Box Cox Transformations in Python
Many common machine learning algorithms assume data is normally distributed.
But what if your data isn't?
I experienced this frustration first hand during my undergraduate thesis, I was attempting to predict the category of online slot-machine a customer was using based on some information about their bet size, speed of play etc. Unfortunately no matter what algorithm I used or what hyper-parameter I modified, I still couldn't achieve accuracy over ~60%.
I experienced this frustration first hand during my undergraduate thesis, I was attempting to predict the category of online slot-machine a customer was using based on some information about their bet size, speed of play etc. Unfortunately no matter what algorithm I used or what hyper-parameter I modified, I still couldn't achieve accuracy over ~60%.
Nearing the end of the school semester I was reading about improving classifier performance when I had my "Eureka!" moment, of course non of these algorithms were performing well. When people play slot machines, the vast majority will bet the minimum stakes with only the most adventurous and financially well-off people betting significantly more. My data was indeed not normally distributed. A quick google search for "How to fix non-normally distributed data" revealed the Box Cox Transformation. A seemingly simple way to transform data to be closer to a normal distribution. After writing a simple script to perform the transformation my accuracy measures jumped to nearly 80%, an incredible 20% increase.
The Transformation
The transformation relies primarily on a lambda (ƛ) variable that holds a value between -5 and 5 that is automatically calculated to be optimal for your data. Specifically, the data is transformed in the following way:
Note: this does not hold for negative values, however; a second formulation can be used instead. Read more
Writing Code
While the transformation is a tad easier in R, we can still perform it relatively easily in Python using the SciPy Library. I will use some sample data from the Beurea of Transportation Statistics, specifically flight duration. My specific dataset is available here.
Lets begin by loading the data and visualizing it as a histogram:
Output:
This data, while it isn't horrible, is significantly skewed. Lets see if we can improve the shape a little.
Output:
The transformed data is now much more regularized and ready to be used or transformed further.
Conclusion
Performing Box Cox transformations is a powerful and elegant way of normalizing skewed data and can lead to significant improvements in machine learning performance. Our sample data transformation shows this:
ReplyDeleteI want this type of one.beacuse in recent days i searched this type of blog finally i got.thanks for this blog.
ccna Training in Chennai
ccna Training institute in Chennai
Python Training in Chennai
Python Classes in Chennai
Angularjs Training in Chennai
ccna Training in OMR
ccna Training in Porur
what should be done when there are negative values
ReplyDeleteGreat post. keep sharing such a worthy information
ReplyDeleteSoftware Testing Training in Chennai
Software Testing Training in Bangalore
Software Testing Training in Coimbatore
Software Testing Training in Madurai
Best Software Testing Institute in Bangalore
Software Testing Course in Bangalore
Software Testing Training Institute in Bangalore
Selenium Course in Bangalore
Your good knowledge and kindness in playing with all the pieces were very useful. I don’t know what I would have done if I had not encountered such a step like this.
ReplyDeleteangular js training in chennai
angular js online training in chennai
angular js training in bangalore
angular js training in hyderabad
angular js training in coimbatore
angular js training
angular js online training
I am really enjoying reading your well written articles. It looks like you spend a lot of effort and time on your blog. I have bookmarked it and I am looking forward to reading new articles. Keep up the good work.
ReplyDeletepython training in chennai
Nice blog post so thanks a lot for sharing this great blog post.. keep more post for sharing.. have a nice day.Notary Public Lawyer in Cambridge
ReplyDeleteMicrosoft Office 2007 Free Download With Full Product Key. Tools for designing and drawing are included as well as animations, transitions, slideshow formats,.MS Office 2007 Download With Crack
ReplyDeleteGot to know something new reading your blog and thanks for sharing this with us. Great reading your blog.
ReplyDeleteIELTS Coaching in Chennai
It as very interesting to read.Thanks for sharing it with us.
ReplyDeletePython course in Pune