Table of Contents
In this Python tutorial, learn to analyze the Wisconsin breast cancer dataset for prediction using random forest machine learning algorithm. The Wisconsin breast cancer dataset can be downloaded from our datasets page.
Random Forest Machine Learning Algorithm
Random forests are a decision tool that is used to classify pieces of data and help guide machines to make decisions. A random forest has the same basic structure as a decision tree. It is a machine learning algorithm that combines a large number of possibilities with the probability of event occurring. The random forests algorithm is particularly helpful with a machine learning training set. Some machine learning operators want their artificial intelligence systems to work within a small set of parameters. Others want a more expansive data set that goes beyond the abilities of individuals.
Most data professionals will often opt for a decision tree if it is at all possible. The decision tree is an approach to probability and percentages that is taught to many individuals in high school. A decision tree can be understood by people who know nothing about any other data algorithm. They may not be familiar with random forests and may be worried about the applicability of such a complex system. But random forests have their own applicability in many instances. A decision tree may overfit for a training set and will reduce the number of possible outcomes that the operation can provide percentages for.
Possibilities and Probabilities
Random forests involve a machine learning program analyzing a wide variety of potential decision tree and then viewing how those decision trees are developing. The random forest algorithm can detect patterns on what decisions and probabilities are available and then understand those patterns through a series of algorithms. The random forest algorithm then presents a set of possibilities with a set of probabilities just like a decision tree. The main benefit of random forests is that they can overcome potential limitations caused by a basic decision tree.
The random forest algorithm is also easily visualized. A wider variety of individuals can view and understand its decisions than those made by other sophisticated algorithms. But even though the random forest algorithm is simpler than some other algorithms, it is still vastly more complicated than a decision tree and cannot be as easily followed. Sometimes, a decision tree may be all that is needed and all that an individual should use for their particular machine learning problem.
Random Forest Classifier Model
The RandomForestClassifier() from sklearn is a very simple model to build by adding a few parameters to reduce over-fitting. A random forest is a meta estimator that fits a number of decision tree classifiers on various sub-samples of the dataset. In addition, uses averaging to improve the predictive accuracy and control over-fitting. The sub-sample size is always the same as the original input sample size but the samples are drawn with replacement if bootstrap=True (default).
Import Packages and Diabetes Dataset
from sklearn.ensemble import RandomForestClassifier from sklearn.model_selection import train_test_split import pandas as pd import matplotlib.pyplot as plt import numpy as np breast_cancer = pd.read_csv('wisc_bc_data.csv')
Train and Test Data
Before we can create the training and test data, we must remove the ‘id’ column. This column provides no value on the prediction on breast cancer and can be removed with the del function from pandas.
del breast_cancer['id'] X_train, X_test, y_train, y_test = train_test_split(breast_cancer.loc[:, breast_cancer.columns != 'diagnosis'], breast_cancer['diagnosis'], stratify=breast_cancer['diagnosis'], random_state=66)
Build the Random Forest Classifier Model
The below model will use the n_estimators parameter with an integer of 100. The n_estimators parameter is the number of trees in the forest and is optional parameter with a default of 10.
Input:
forest = RandomForestClassifier(n_estimators=100) forest.fit(X_train, y_train) print(f"Random forest training set accuracy: {format(forest.score(X_train, y_train), '.4f')} ") print(f"Random forest testing set accuracy: {format(forest.score(X_test, y_test), '.4f')} ")
Output:
Random forest training set accuracy: 1.0000 Random forest testing set accuracy: 0.9441
The random forest with 100 trees gives us an accuracy of 94.41% on the testing data but 100% on training set which is an indicator of over-fitting. We can implement the max_depth parameter that will set at 5.
Input:
forest = RandomForestClassifier(max_depth=5, n_estimators=100) forest.fit(X_train, y_train) print(f"Random forest training set accuracy: {format(forest.score(X_train, y_train), '.4f')} ") print(f"Random forest testing set accuracy: {format(forest.score(X_test, y_test), '.4f')} ")
Output:
Random forest training set accuracy: 0.9930 Random forest testing set accuracy: 0.9441
By adding the max depth of 5, we improved the training set accuracy to 99.30% and the testing set accuracy score from stayed the same at 94.41%.
Random Forest Classifier Feature Importance
The RandomForestClassifier() has an attribute that returns the feature importance of each feature.
Input:
print(f"Breast Cancer Feature Importances: \n {format(forest.feature_importances_)} ")
Output:
Breast Cancer Feature Importances: [0.03441597 0.00629157 0.06244324 0.08036168 0.00319567 0.01343588 0.02889459 0.08010194 0.00343783 0.00393573 0.01064791 0.00429263 0.01221628 0.03268579 0.00303577 0.00347593 0.0066475 0.00691506 0.00409171 0.00267978 0.09333557 0.01673479 0.16680822 0.11367018 0.01000071 0.01571671 0.04184256 0.12001528 0.01199975 0.0066738 ]
Random Forest Classifier Feature Importance Plot
Input:
breast_cancer_features = [x for i,x in enumerate(breast_cancer.columns) if i!=30] def breast_cancer_feature_importances_plot(model): plt.figure(figsize=(10,5)) n_features = 30 plt.barh(range(n_features), model.feature_importances_, align='center', color=['#FF1493']) plt.yticks(np.arange(n_features), breast_cancer_features) plt.title('Breast Cancer Random Forest Features Importances') plt.xlabel("Feature Importance") plt.ylabel("Feature") plt.ylim(-1, n_features) breast_cancer_feature_importances_plot(forest) plt.show()
Output: