Table of Contents
In this Python tutorial, learn to create plots from the sklearn digits dataset. Scikit-learn data visualization is very popular as with data analysis and data mining. A few standard datasets that scikit-learn comes with are digits and iris datasets for classification and the Boston, MA house prices dataset for regression.
Digits Dataset sklearn
The sklearn digits dataset is made up of 1797 8×8 images. Each image, like the one shown below, is of a hand-written digit. In order to utilize an 8×8 figure like this, we will need to transform it into a feature vector with length 64.
Digits Dataset Analysis
This section will focus on the analysis of the sklearn iris dataset and learn about the dataset before we dive into visualization.
Load the Digits Dataset
We can load the digits dataset from the sklearn.datasets by using the load_digits() method. This will save the object containing digits data and the attributes associated with it.
from sklearn import datasets import matplotlib.pyplot as plt digits = datasets.load_digits()
The digits dataset is a dataset of handwritten digits and each feature is the intensity of one pixel of an 8 x 8 image. This dataset is made up of 1797 8 x 8 images. Each image, like the one shown above, is of a hand-written digit. In order to utilize an 8 x 8 figure like this, we’d have to first transform it into a feature vector with length 64.
Input:
X = digits.data y = digits.target print(X[0])
Output:
[ 0. 0. 5. 13. 9. 1. 0. 0. 0. 0. 13. 15. 10. 15. 5. 0. 0. 3. 15. 2. 0. 11. 8. 0. 0. 4. 12. 0. 0. 8. 8. 0. 0. 5. 8. 0. 0. 9. 8. 0. 0. 4. 11. 0. 1. 12. 7. 0. 0. 2. 14. 5. 10. 12. 0. 0. 0. 0. 6. 13. 10. 0. 0. 0.]
A vector is presented with the observation feature values. We are able to visualize the handwritten character by using the images method and load the feature values as a matrix.
Input:
print(digits.images[0])
Output:
[[ 0. 0. 5. 13. 9. 1. 0. 0.] [ 0. 0. 13. 15. 10. 15. 5. 0.] [ 0. 3. 15. 2. 0. 11. 8. 0.] [ 0. 4. 12. 0. 0. 8. 8. 0.] [ 0. 5. 8. 0. 0. 9. 8. 0.] [ 0. 4. 11. 0. 1. 12. 7. 0.] [ 0. 2. 14. 5. 10. 12. 0. 0.] [ 0. 0. 6. 13. 10. 0. 0. 0.]]
First Observation’s Feature Values as an Image
Input:
from sklearn import datasets import matplotlib.pyplot as plt digits = datasets.load_digits() plt.gray() plt.matshow(digits.images[0]) plt.show()
Output:
K-Nearest Neighbors Algorithm
The k-nearest neighbors (KNN) algorithm can be used to solve classification and regression problems. In this example, we will import the KNeighborsClassifier from sklearn.neighbors. In addition we will train_test_split from sklearn.model_selection. We will be using a random state of 42 with stratified training and testing sets of 0.2. The K Nearest Neighbors classifier will have 7 neighbors to fit into the training data. The score() method will print the accuracy of the classifiers prediction.
from sklearn.neighbors import KNeighborsClassifier from sklearn.model_selection import train_test_split from sklearn import datasets digits = datasets.load_digits() X = digits.data y = digits.target X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state=42, stratify=y) knn = KNeighborsClassifier(n_neighbors=7) knn.fit(X_train, y_train) print(knn.score(X_test, y_test))
Output:
0.9833333333333333
Logistic Regression Algorithm
The logistic regression algorithm uses a linear equation to predict a value with with independent predictors. The predicted value can be anywhere between negative infinity to positive infinity. We need the output of the algorithm to be class variable. For example this could be 0-no or 1-yes. In this example, we will import the LogisticRegression from sklearn.linear_model.
Input:
from sklearn.linear_model import LogisticRegression logisticRegr = LogisticRegression(solver='lbfgs', multi_class='auto') logisticRegr.fit(X_train, y_train) logisticRegr.predict(X_test[0].reshape(1,-1)) logisticRegr.predict(X_test[0:10]) predictions = logisticRegr.predict(X_test) score = logisticRegr.score(X_test, y_test) print(score)
Output:
0.9611111111111111
Digits Dataset Confusion Matrix
The confusion matrix is table that describes a classification models performance on a set of test data for the known true values. In order to create a confusion matrix with the digits dataset, Matplotlib and seaborn libraries will be used to make a confusion matrix.
Input:
import matplotlib.pyplot as plt import seaborn as sns from sklearn import metrics cm = metrics.confusion_matrix(y_test, predictions) plt.figure(figsize=(5,5)) sns.heatmap(cm, annot=True, fmt=".2f", linewidths=.5, square = True, cmap = 'Blues_r'); plt.ylabel('Actual label'); plt.xlabel('Predicted label'); all_sample_title = f'Accuracy Score: {score:.2f}' plt.title(all_sample_title, size = 12) plt.show()
Output:
t-distributed Stochastic Neighbor Embedding (t-SNE)
t-Distributed Stochastic Neighbor Embedding (t-SNE) is a non-linear dimensional reduction algorithm used for exploring high-dimensional data. t-SNE maps maps multi-dimensional data to two or more dimensions that are suitable for human observation.
Input:
import matplotlib.pyplot as plt from sklearn.manifold import TSNE X = digits.data[:500] y = digits.target[:500] digits = datasets.load_digits() tsne = TSNE(n_components=2, random_state=0) X_2d = tsne.fit_transform(X) digits_ids = range(len(digits.target_names)) plt.figure(figsize=(6, 5)) colors = 'aqua', 'azure', 'coral', 'gold', 'green', 'fuchsia', 'maroon', 'purple', 'red', 'orange' for i, c, label in zip(digits_ids, colors, digits.target_names): plt.scatter(X_2d[y == i, 0], X_2d[y == i, 1], c=c, label=label) plt.legend() plt.show()
Output: