k-Folds Cross Validation in Python

How do we evaluate the accuracy of a machine learning algorithm?

It is customary when evaluating any machine learning classification model to split the dataset into separate training and testing sets.

One convention is 80% training and 20% testing, but what if your dataset is rather small? 20% of an already minimal dataset can lead to false accuracy reporting. Furthermore, such small selections may not be truly representative of the full dataset. One of the solutions to this problem is k-folds cross validation.

The Basic Idea

If we have 100 rows in our data set, typically 20 of these rows would be selected as a testing set, leaving the remaining 80 rows as a training set.

In k-Folds Cross Validation we start out just like that, except after we have divided, trained and tested the data,  we will re-generate our training and testing datasets using a different 20% of the data as the testing set and add our old testing set into the remaining 80% for training. This process continues until every row in our original set has been included in a testing set exactly once. The k in k-folds stand for how many times the new datasets are created.

An illustrated example of k-folds cross validation

Starting Code

We will begin with a base-case example of a machine learning template. Our dataset will be the famous iris dataset which I have added headings to and saved as a .csv file available here. We will try to predict the class of plant, using sepal width, sepal length, petal length and petal width. To do this we will use  Gaussian Naive-Bayes from the sklearn library. We will begin by splitting the dataset via the usual 80-20 split.

import pandas
from sklearn.naive_bayes import GaussianNB
from sklearn import metrics
def import_data():
# import total dataset
data = pandas.read_csv('iris_data.csv')
# split via 80-20
row_partition = int(data.shape[0] * 0.8)
train = data[:row_partition]
test = data[row_partition:]
# get a list of column names
headers = list(train.columns.values)
# partition data
x_train = train[headers[:-1]]
y_train = train[headers[-1:]].values.ravel()
x_test = test[headers[:-1]]
y_test = test[headers[-1:]].values.ravel()
return x_train, x_test, y_train, y_test
if __name__ == '__main__':
# get training and testing sets
x_train, x_test, y_train, y_test = import_data()
# create and fit classifier
classifier = GaussianNB()
classifier.fit(x_train, y_train)
# classify our test variables
predictions = classifier.predict(x_test)
# save and print accuracy
accuracy = metrics.accuracy_score(y_test, predictions)
print("Accuracy: " + accuracy.__str__())

Stratification

The above code outputs an accuracy of 93%, however; it has one major problem that becomes obvious when we look at the data contained in the actual splits:


It turns out that the original dataset is sorted based on class. So by slicing off the last 20% of rows, we are selecting only data in the "iris-virginica" category. There are many ways to remedy this, one of which is through stratification which is basically the act of including an equal share of each class in each slice. We will use the train_test_split function from sklearn to demonstrate it:

def import_data():
# import total dataset
data = pandas.read_csv('iris_data.csv')
# get a list of column names
headers = list(data.columns.values)
x = data[headers[:-1]]
y = data[headers[-1:]].values.ravel()
# partition data
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.2, stratify=y)
return x_train, x_test, y_train, y_test

Running the above code results in accuracy measures anywhere from 80% to a perfect 100% based on random chance of how the data is partitioned each time the code is run, further evidence in support of cross-validation.

Adding k-Folds Cross Validation (finally)

We will now move on to adding proper, stratified k-Folds Cross Validation. (Note: there are many ways to do this, I am just showing one of the possibilities)

if __name__ == '__main__':
# get training and testing sets
x, y = import_data()
# set to 10 folds
skf = StratifiedKFold(n_splits=10)
for train_index, test_index in skf.split(x, y):
# specific ".loc" syntax for working with dataframes
x_train, x_test = x.loc[train_index], x.loc[test_index]
y_train, y_test = y[train_index], y[test_index]
# create and fit classifier
classifier = GaussianNB()
classifier.fit(x_train, y_train)
# classify our test variables
predictions = classifier.predict(x_test)
# save and print accuracy
accuracy = metrics.accuracy_score(y_test, predictions)
print("Accuracy: " + accuracy.__str__())


Output:

Accuracy: 0.933333333333
Accuracy: 0.933333333333
Accuracy: 1.0
Accuracy: 0.933333333333
Accuracy: 0.933333333333
Accuracy: 0.933333333333
Accuracy: 0.866666666667
Accuracy: 1.0
Accuracy: 1.0
Accuracy: 1.0


Combining Results

We can take this one step further by combining the results from all folds into final predicted_y and expected_y lists which can then be compared to get a measure of true classifier accuracy.


import pandas
from sklearn.naive_bayes import GaussianNB
from sklearn import metrics
from sklearn.model_selection import StratifiedKFold
def import_data():
# import total dataset
data = pandas.read_csv('iris_data.csv')
# get a list of column names
headers = list(data.columns.values)
# separate into independent and dependent variables
x = data[headers[:-1]]
y = data[headers[-1:]].values.ravel()
return x, y
if __name__ == '__main__':
# get training and testing sets
x, y = import_data()
# set to 10 folds
skf = StratifiedKFold(n_splits=10)
# blank lists to store predicted values and actual values
predicted_y = []
expected_y = []
# partition data
for train_index, test_index in skf.split(x, y):
# specific ".loc" syntax for working with dataframes
x_train, x_test = x.loc[train_index], x.loc[test_index]
y_train, y_test = y[train_index], y[test_index]
# create and fit classifier
classifier = GaussianNB()
classifier.fit(x_train, y_train)
# store result from classification
predicted_y.extend(classifier.predict(x_test))
# store expected result for this specific fold
expected_y.extend(y_test)
# save and print accuracy
accuracy = metrics.accuracy_score(expected_y, predicted_y)
print("Accuracy: " + accuracy.__str__())


Output: 95.3% accuracy

Conclusion

As I stated above, this is just one of many ways to go about k-Folds Cross Validation. I personally find this method easiest to understand and expand upon but your mileage may vary.

Project Source:

Comments