Programming Assignment 1

Posted on May 31, 2017 by Govind Gopakumar

Please find the associated IPython notebook here.

In this assignment, we’ll try our hand out at the few classifiers we know. It is meant to serve as an introduction to both the scikit-learn interface, as well as how to deal with data in Python.

# Import the required libraries
import numpy as np
import matplotlib.pyplot as plt

# Import the KNN Classifier
from sklearn.neighbors import KNeighborsClassifier

# Import the Decision Tree Classifier
from sklearn.tree import DecisionTreeClassifier

# We also need some data. For now, we shall use scikit-learn's inbuilt data
# models. We will learn how to use our own data later on.
from sklearn.datasets import load_iris

The above lines do the following tasks : - Import basic libraries we need (array handling, plotting) - Imported both the classifiers we studied (KNN, Decision Tree) - Imported a data loading tool that loads the Iris dataset

iris = load_iris()
X = iris.data
Y = iris.target
(150,)
# Let us find out what the dimensions that we are dealing with are : 

print(X.shape)
print(Y.shape)
(150, 4)
(150,)

We have our data in hand. Let us proceed with classification!

Remember that Decision Tree does not require any particular parameters from our end, but we do need to specify the “K” in K nearest neighbors!

# Create the Classifiers
knnclf = KNeighborsClassifier(n_neighbors = 5)
dectreeclf = DecisionTreeClassifier()
knnclf.fit(X, Y)
dectreeclf.fit(X,Y)
DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=None,
            max_features=None, max_leaf_nodes=None,
            min_impurity_split=1e-07, min_samples_leaf=1,
            min_samples_split=2, min_weight_fraction_leaf=0.0,
            presort=False, random_state=None, splitter='best')

The above two calls illustrate the training procedure. Note that we don’t have to do anything at all! Just make the data in the appropriate matrix format, and call the fit() function for each of our classifiers. We can now check how well our model has learnt from our data.

# clf.predict(X) returns the predictions of the classifier on the matrix X
# Here, we are comparing how clf.predict does to the original labels Y, and printing the average
# This will give us the accuracy of the classifier that we have created
print((knnclf.predict(X) == Y).mean())
print((dectreeclf.predict(X) == Y).mean())
0.966666666667
1.0

As we can see above, the decision tree does better than the K nearest neighbors classifier. This can be due to various reasons. I would encourage you to play with different parameters of the scikit learn library functions, and see how high you can get this accuracy to go.

Links :