Table of Contents
In this Python tutorial, learn to analyze the Wisconsin breast cancer dataset for prediction using decision trees machine learning algorithm. The Wisconsin breast cancer dataset can be downloaded from our datasets page.
Decision Trees Machine Learning Algorithm
Decision trees are a helpful way to make sense of a considerable dataset. They are useful for determining the behavior of certain actions and the probability that a set number of potential probabilities will occur. Decision trees are also helpful for visualizing datasets. They have one of the clearest visual outputs of any form of statistical modeling or data mining. As a result, these trees are much clearer than neural networks and are sometimes used more often in predictive data modeling.
The clearest way to set up a decision tree is by starting at a particular premise and then plotting out the multiple decisions that result from that premise. Each of these decisions will then have probabilities that result from them. Whenever a new set of branches is established on a decision tree, the individual or algorithm must determine what exactly the probabilities are of achieving each decision.
Decision Trees Outcomes
Then, the algorithm continues to create different outcomes and later determines which outcomes are most important. The process helps to determine the percentages on the relationships of these trees. A decision tree guided by a machine learning algorithm can start to make changes on the trees depending on how helpful the information gleaned is. Decision trees guided by machine learning algorithm may be able to cut out outliers or other pieces of information that are not relevant to the eventual decision that needs to be made.
While the algorithm may consider that information, it may not display the information at the end of the decision process so that there is less clutter and more simplicity for the decision tree’s visual output. The same process may be applicable for certain decisions in the tree that are overly influential of the event that will eventually happen. These concepts can be emphasized and perhaps shown in a larger font than others.
Decision Trees Basic Parts
Decision trees have three basic parts that data professionals need to be aware of. The first and most important are the leaves. Leaves are the decisions that spring from the tree and are manifested by different choices and probabilities. There are also the branches. These are often denoted by lines in a visualization of the decision tree process. Branches represent the delineation of what is needed to get to the eventual set of necessary probabilities and possibilities.
Decision Trees Classifier Model
The DecisionTreeClassifier() from sklearn is a very simple model to build by adding a few parameters to prune the data without over-fitting. The random_state is a RandomState instance that is optional, but the default is none. The random_state is an integer and is used by the random number generator.
Import Packages and Breast Cancer Data
from sklearn.tree import DecisionTreeClassifier from sklearn.model_selection import train_test_split import pandas as pd import matplotlib.pyplot as plt import numpy as np breast_cancer = pd.read_csv('wisc_bc_data.csv')
Train and Test Data
Before we can create the training and test data, we must remove the ‘id’ column. This column provides no value on the prediction on breast cancer and can be removed with the del function from pandas.
del breast_cancer['id'] X_train, X_test, y_train, y_test = train_test_split(breast_cancer.loc[:, breast_cancer.columns != 'diagnosis'], breast_cancer['diagnosis'], stratify=breast_cancer['diagnosis'], random_state=66)
Build the Decision Tree Classifier Model
Input:
tree = DecisionTreeClassifier() tree.fit(X_train, y_train) print(f"Decision tree training set accuracy: {format(tree.score(X_train, y_train), '.4f')} ") print(f"Decision tree testing set accuracy: {format(tree.score(X_test, y_test), '.4f')} ")
Output:
Decision tree training set accuracy: 1.0000 Decision tree testing set accuracy: 0.9510
As one can see from the training set, the accuracy score is 1.0000 which is an indicator of over-fitting.
Decision Tree Classifier Model Over-fitting
Over-fitting usually occurs when the learning system tightly fits the given training data so much that it would be inaccurate in predicting the outcomes of the untrained data. This is exactly what we can see with our training set data.
With our decision trees classifier model, over-fitting occurred when the tree is designed to perfectly fit all samples in the training data set. In return, we ended up with branches with strict rules of sparse data. This is why the training set data accuracy is effected when predicting samples that are not part of the training set.
Decision Tree Pruning
One of the methods used to address over-fitting in decision tree is called pruning which is done after the initial training is complete. In pruning, you trim off the branches of the tree, which is completed by removing the decision nodes starting from the leaf node such that the overall accuracy is not disturbed.
Decision Tree Classifier Improvement
In this DecisionTreeClassifier(), we will add the max_depth parameter which is optional. This parameter sets an integer to the maximum depth of the decision tree. If this left as None, the nodes will be expanded until all leaves are pure or until the leaves contain less than the than min_samples_split samples.
Input:
tree = DecisionTreeClassifier(max_depth=5) tree.fit(X_train, y_train) print(f"Decision tree training set accuracy: {format(tree.score(X_train, y_train), '.4f')} ") print(f"Decision tree testing set accuracy: {format(tree.score(X_test, y_test), '.4f')} ")
Output:
Decision tree training set accuracy: 0.9859 Decision tree testing set accuracy: 0.9371
As we can see from the max_depth parameter with an integer of 5, we have an improved accuracy score on both the training and testing data.
Decision Tree Classifier Feature Importance
The DecisionTreeClassifier() has an attribute that returns the feature importance of each feature.
Input:
print(f"Breast Cancer Feature Importances: \n {format(tree.feature_importances_)} ")
Output:
Breast Cancer Feature Importances: [0. 0.0265003 0. 0.02285783 0. 0. 0. 0.01572222 0. 0. 0. 0. 0.01635447 0. 0. 0. 0. 0. 0. 0. 0.00908582 0.02449806 0.7643834 0.0057787 0. 0.01007777 0. 0.10474143 0. 0. ]
Decision Features Importance Plot
Input:
breast_cancer_features = [x for i,x in enumerate(breast_cancer.columns) if i!=30] def breast_cancer_feature_importances_plot(model): plt.figure(figsize=(10,5)) n_features = 30 plt.barh(range(n_features), model.feature_importances_, align='center', color=['#FF69B4']) plt.yticks(np.arange(n_features), breast_cancer_features) plt.title('Breast Cancer Decision Trees Features Importances') plt.xlabel("Feature Importance") plt.ylabel("Feature") plt.ylim(-1, n_features) breast_cancer_feature_importances_plot(tree) plt.show()
Output: