In my last post I went over Decision trees and Random forest algorithms. In this post we will code a full real example of recognising written digits and trying to predict what digit is in a new image.
As we said before these algorithms are classified as “supervised machine learning”, so we will need to feed our machine with lots of pre classified examples and it will learn as we feed it more and more digits.
We are not going to sit and classify thousands of images into numbers for the machine, duo to the fact that this is a well known problem, we will use a dataset from sklearn called digits. This is a dataset of 1797 images that were classified into the corresponding digit (0-9), so we can save the hard labour.
We will start our program by loading the dataset and looking at an image
from sklearn.datasets import load_digits import matplotlib.pyplot as plt digits = load_digits() print(digits.data.shape) plt.gray() plt.matshow(digits.images[1]) print digits.target[1] plt.show()
*pyplot is a feel like Math lab library for viewing images.
*matplotlib.pyplot.
gray
() – set the default colormap to gray and apply to current image if any.
The result is an image of the digit 1 size 8X8 :
*8×8 image of integer pixels in the range 0- 16 if you want to understand the colors representation read here.
[[ 0. 0. 0. 12. 13. 5. 0. 0.]
[ 0. 0. 0. 11. 16. 9. 0. 0.]
[ 0. 0. 3. 15. 16. 6. 0. 0.]
[ 0. 7. 15. 16. 16. 2. 0. 0.]
[ 0. 0. 1. 16. 16. 3. 0. 0.]
[ 0. 0. 1. 16. 16. 6. 0. 0.]
[ 0. 0. 1. 16. 16. 6. 0. 0.]
[ 0. 0. 0. 11. 16. 10. 0. 0.]]
The actual number is 1 and we can review it using the command:
print digits.target[1] *The target of each image is the number that was classified in advance.
We have our examples of handwriting images and their classification ready. It’s time to take out the big guns and see what happens when we apply a random forest on it.
We will split out data set to 80% learning data and 20 % testing data and see how it goes
msk = np.random.rand(len(digits.images)) < 0.8 train_image, test_image = digits.images[msk].copy(), digits.images[~msk].copy() train_target, test_target = digits.target[msk].copy(), digits.target[~msk].copy()
In order to run our Random Forest we will need to flatten the images to 1 dimension
rain_image = train_image.reshape((len(train_image), -1)) test_image_reshape = test_image.reshape((len(test_image), -1))
Now we can run our classifier and test it
rf = RandomForestClassifier() rf.fit(train_image, train_target)
If we would like to check the accuracy of our model on the test data
score = rf.score(test_image_reshape, test_target) print 'Score\t'+str(score)
The score is 0.9383 meaning we classified almost 94% of our test data correctly.
If we would like to predict one new image we would use the predict method
i = 15 expected = test_target[i ] predicted = rf.predict(test_image_reshape[i ]) print "predicted" print predicted print "expected" print expected
If you would like to classify an new image of a digit make sure you resize the image to a 8X8 image before and get the data an 8 bit image.
The full code:
from sklearn.datasets import load_digits import matplotlib.pyplot as plt from sklearn.ensemble import RandomForestClassifier import numpy as np import pandas as pd digits = load_digits() print(digits.data.shape) plt.gray() plt.matshow(digits.images[1]) print digits.target[1] print digits.images[1] #plt.show() msk = np.random.rand(len(digits.images)) < 0.8 train_image, test_image = digits.images[msk].copy(), digits.images[~msk].copy() train_target, test_target = digits.target[msk].copy(), digits.target[~msk].copy() #print len(train_image) #print (test_image) print(test_image) train_image = train_image.reshape((len(train_image), -1)) test_image_reshape = test_image.reshape((len(test_image), -1)) print(test_image) rf = RandomForestClassifier() rf.fit(train_image, train_target) plt.gray() i = 15 plt.matshow(test_image[i ]) plt.show() expected = test_target[i ] predicted = rf.predict(test_image_reshape[i ]) print "predicted" print predicted print "expected" print expected score = rf.score(test_image_reshape, test_target) print 'Random Tree Classifier:\n' print 'Score\t'+str(score)