Parallel Hyper-Parameter Optimization in Python

Tuning the specific hyper-parameters used in many machine learning algorithms is as much of an art as it is a science.

Thankfully, we can use a few tools to increase our ability to do it effectively. One of which is Grid Search, which is the process of creating a "Grid" of possible hyper-parameter values and then testing each possible combination of values via k-folds Cross Validation and choosing the "best" combination based on performance on a user-defined metric such as accuracy, area under the roc curve or sensitivity.

This process is very computationally expensive, especially as the number of hyper-parameters involved increases. We can significantly reduce the time taken to perform grid search by using parallel computing if we have a multi-core CPU or a CPU that supports hyper-threading. The idea of parallel computing is sometimes intimidating to even veteran programmers, thankfully the work of parallel scaling can be done automatically through SK-Learn's GridSearchCV module.

Writing Code

We will use the "digits" dataset and DecisionTreeClassifier from SK-Learn in this example:
Output: 77.9% Accuracy

Not bad, but lets see if tweaking some parameters has any effect.

SK-Learn's Decision Tree Classifier has quite a few hyper-parameters that can be tweaked, lets start by looking at two of them with some possible values:

Criterion
Description: The function to measure the quality of a split
Possible Values: "Gini", "Entropy"

Max Depth
Description: The maximum depth of the tree. If None, then nodes are expanded until all leaves are pure or until all leaves contain less than min_samples_split samples.
Possible Values: "None", Int (we will consider 2, 4, 6, 8 and 10)

(other parameters listed here)

We will create a "parameter grid" as a dictionary of possible values for these hyper-parameters:
Next we will pass the parameter grid to our GridSearchCV function that will automatically run the classifier with each possible combination of parameters:
Output:
0.772 (+/-0.061) for {'criterion': 'gini', 'max_depth': None}
0.312 (+/-0.027) for {'criterion': 'gini', 'max_depth': 2}
0.546 (+/-0.075) for {'criterion': 'gini', 'max_depth': 4}
0.706 (+/-0.089) for {'criterion': 'gini', 'max_depth': 6}
0.763 (+/-0.104) for {'criterion': 'gini', 'max_depth': 8}
0.778 (+/-0.080) for {'criterion': 'gini', 'max_depth': 10}
0.785 (+/-0.035) for {'criterion': 'entropy', 'max_depth': None}
0.354 (+/-0.012) for {'criterion': 'entropy', 'max_depth': 2}
0.625 (+/-0.050) for {'criterion': 'entropy', 'max_depth': 4}
0.763 (+/-0.033) for {'criterion': 'entropy', 'max_depth': 6}
0.787 (+/-0.043) for {'criterion': 'entropy', 'max_depth': 8}
0.798 (+/-0.036) for {'criterion': 'entropy', 'max_depth': 10}

As you can see when criterion: 'entropy' and max_depth: '10' we see the highest accuracy (79.8%)

Lets increase the size of our parameter grid:

Problems

Now we could run this as is and it would work, however; there are 2 problems with running a grid of this size.

Output size:

This is a 2x7x7x3x4 grid that will result in: 1,176 combinations

Rather than read the entire output results, we will use the "best_score_" and "best_params_" functions:

Speed:

Running a full 3-fold cross-validation on each of the 1,176 combinations will result in fitting the classifier 3,528 times! This can get seriously slow, so we will set the "n_jobs" field to "-1" which allows grid search to use every available core in parallel to speed up the process:

A quick look at Activity Monitor confirms that the script is running on all available cores:

Running on a quad-core i7 with hyper-threading enabled

Final Output:

It is important to note that we only analyzed a relatively small group of hyper-parameters here, and typically you should spend a significant amount of time tuning your grid to find a truly optimal configuration. Nevertheless we achieved the following results:

Best Score:

80.1% Accuracy (+2.2%)

Best Parameters:

'min_samples_leaf': 1,

'max_depth': 8,

'max_features': None,

'criterion': 'entropy',

'min_samples_split': 2

Source Code:

Full Project Repository

Comments

chandhran 18 December 2020 at 22:07
Very amazing blog, Thanks for giving me new useful information.
webdriver architecture
advantage of angularjs
benefits of aws certification
android oreo
High Technologies Solutions10 February 2021 at 00:35
Enjoyed reading the article above, really explains everything in detail, the article is very interesting and effective.

Python Training Institute in South Delhi
Python Training Expert17 February 2021 at 02:02
Awesome blog. I enjoyed reading your articles. This is truly a great read for me. I have bookmarked it and I am looking forward to reading new articles. Keep up the good work!

Python training course in Delhi
baresfuertoner10 March 2021 at 21:41
Nice post. Thanks for share.
data analysis training course london
Dreamsoft Consultants8 July 2021 at 03:08
Post is really supportive to all of us. Eager that these kind of information you post in future also. Otherwise if any One Want Experience Certificate for Fill your Career Gap So Contact Us-9599119376 Or Visit Website.

Best Consultant for Experience Certificate Providers in Bangalore, India
Dreamsoft Consultants9 July 2021 at 00:59
Excellent article for the people who need information about this course.

Best Company for Experience Certificate Providers in Chennai, India
Dreamsoft Consultants10 July 2021 at 23:06
Excellent and very cool idea and great content of different kinds of the valuable information’s.

Genuine Fake Experience Certificate Providers in Hyderabad, India
Python Training Expert18 July 2021 at 03:38
This is my first time visit here. From the tons of comments on your articles. I guess I am not only one having all the enjoyment right here.

Complete Python Programming Training Course in Delhi, India
Python training institute in delhi
Python training Course in delhi
Dreamsoft Consultants2 September 2021 at 23:44
I like your blog it is very knowledable and I got very usefull from your blog. Keep writing this type of blogs . If anyone want to get expercience in Delhi can contact me at - 9599119376 or can visit our website at
Experience Certificate In Noida
Experience Certificate In Chennai
Experience Certificate In Gurugoan
Jose R. Welke21 September 2021 at 20:24
Data analytics is important because it helps businesses optimize their performances. Implementing it into the business model means companies can help reduce costs by identifying more efficient ways of doing business and by storing large amounts of data. inetSoft

Dreamsoft Consultants8 October 2021 at 00:30

Jubilant to read your blog. One of the best I have gone through. If anyone want to get experience certificate in Chennai. Here the Dreamsoft is providing the genuine experience certificate in Chennai. Dreamsoft is the 20 years old consultancy providing experience certificate. You can contact at the 9599119376 or can go to our website at https://experiencecertificates.com/experience-certificate-provider-in-chennai.html

Search This Blog

Kent MacDonald Data Science