Kent MacDonald Data Science

Posts

Cluster-Robust Regression in Python

August 30, 2017

This is a short blog post about handling scenarios where one is investigating a correlation across some domain while controlling for an expected correlation across some other domain. If that sentence made no sense to you don't worry, here is a simple example: Research Question: To determine if Instagram users are more likely to also post on Facebook or to Twitter. Analysis Plan: Perform a t-Test to look at the difference in means of how often Instagram users are posting on Facebook vs. Instagram users posting on Twitter. Problem: Some users have Facebook accounts (Group A), some users have Twitter accounts (Group B) and some users have both (Group A/B). So we can't really use a Related Samples t-Test or an Independent Samples t-Test. Also, some users post many photos on Instagram and others only make the occasional post. We could therefore expect that the difference between users may be larger than the difference between what we actually want to measure which is Facebo...

Keep reading

Parallel Hyper-Parameter Optimization in Python

August 01, 2017

Tuning the specific hyper-parameters used in many machine learning algorithms is as much of an art as it is a science. Thankfully, we can use a few tools to increase our ability to do it effectively. One of which is Grid Search , which is the process of creating a "Grid" of possible hyper-parameter values and then testing each possible combination of values via k-folds Cross Validation and choosing the "best" combination based on performance on a user-defined metric such as accuracy, area under the roc curve or sensitivity. This process is very computationally expensive, especially as the number of hyper-parameters involved increases. We can significantly reduce the time taken to perform grid search by using parallel computing if we have a multi-core CPU or a CPU that supports hyper-threading. The idea of parallel computing is sometimes intimidating to even veteran programmers, thankfully the work of parallel scaling can be done automatically through...

Keep reading

Box Cox Transformations in Python

July 17, 2017

Many common machine learning algorithms assume data is normally distributed. But what if your data isn't? I experienced this frustration first hand during my undergraduate thesis, I was attempting to predict the category of online slot-machine a customer was using based on some information about their bet size, speed of play etc. Unfortunately no matter what algorithm I used or what hyper-parameter I modified, I still couldn't achieve accuracy over ~60%. Nearing the end of the school semester I was reading about improving classifier performance when I had my "Eureka!" moment, of course non of these algorithms were performing well. When people play slot machines, the vast majority will bet the minimum stakes with only the most adventurous and financially well-off people betting significantly more. My data was indeed not normally distributed. A quick google search for "How to fix non-normally distributed data" revealed the Box Cox Transformation . ...

Keep reading

k-Folds Cross Validation in Python

July 11, 2017

How do we evaluate the accuracy of a machine learning algorithm? It is customary when evaluating any machine learning classification model to split the dataset into separate training and testing sets. One convention is 80% training and 20% testing, but what if your dataset is rather small? 20% of an already minimal dataset can lead to false accuracy reporting. Furthermore, such small selections may not be truly representative of the full dataset. One of the solutions to this problem is k-folds cross validation. The Basic Idea If we have 100 rows in our data set, typically 20 of these rows would be selected as a testing set, leaving the remaining 80 rows as a training set. In k-Folds Cross Validation we start out just like that, except after we have divided, trained and tested the data, we will re-generate our training and testing datasets using a different 20% of the data as the testing set and add our old testing set into the remaining 80% for training. This pro...

Keep reading

Univariate Linear Regression in Python

July 10, 2017

Introduction Does x predict y ? This is the basic question that linear regression aims to answer, or at least give a hint about. Technically speaking, linear regression is a way of establishing if two variables are related. In this post we need to be familiar with the idea of both dependent and independent variables. Generally, the dependent variable or " y" is the variable that we are measuring (it can help also to frame this as the outcome ). The independent variable or " x" is the variable that is modified or changed. If these two variables are at all correlated, a change in the independent variable should result in a somewhat reliable change in the dependent variable. For example, lets say we are interested in how rainfall effects umbrella usage, we could hypothesize that the more it rains, the more likely people are to use an umbrella. In this case our independent variable is rainfall lets quantify that as mm per day. Our dependent variable is umb...

Keep reading