Table of Contents
Python tutorial diabetes prediction with machine learning Random Forest Algorithm. This dataset can be downloaded from the UCI Machine Learning Repository. If you’re not familiar with the diabetes dataset, spend some time analyzing the data with a step-by-step guide on the Diabetes Dataset Analysis tutorial.
Random Forest Classifier Model
The RandomForestClassifier() from sklearn is a very simple model to build by adding a few parameters to reduce over-fitting. A random forest is a meta estimator that fits a number of decision tree classifiers on various sub-samples of the dataset. In addition, uses averaging to improve the predictive accuracy and control over-fitting. The sub-sample size is always the same as the original input sample size but the samples are drawn with replacement if bootstrap=True (default).
Import Packages and Diabetes Data
from sklearn.ensemble import RandomForestClassifier from sklearn.model_selection import train_test_split import pandas as pd import matplotlib.pyplot as plt import numpy as np diabetes = pd.read_csv('diabetes.csv')
Train and Test Data
X_train, X_test, y_train, y_test = train_test_split(diabetes.loc[:, diabetes.columns != 'Outcome'], diabetes['Outcome'], stratify=diabetes['Outcome'], random_state=66)
Build the Random Forest Classifier Model
The below model will use the n_estimators parameter with an integer of 100. The n_estimators parameter is the number of trees in the forest and is optional parameter with a default of 10.
Input:
forest = RandomForestClassifier(n_estimators=100) forest.fit(X_train, y_train) print(f"Random forest training set accuracy: {format(forest.score(X_train, y_train), '.4f')} ") print(f"Random forest testing set accuracy: {format(forest.score(X_test, y_test), '.4f')} ")
Output:
Random forest training set accuracy: 1.0000 Random forest testing set accuracy: 0.7917
The random forest with 100 trees gives us an accuracy of 79.17% on the testing data but 100% on training set which is an indicator of over-fitting. We can implement the max_depth parameter that will set at 5.
Input:
forest = RandomForestClassifier(max_depth=5, n_estimators=100) forest.fit(X_train, y_train) print(f"Random forest training set accuracy: {format(forest.score(X_train, y_train), '.4f')} ") print(f"Random forest testing set accuracy: {format(forest.score(X_test, y_test), '.4f')} ")
Output:
Random forest training set accuracy: 0.8403 Random forest testing set accuracy: 0.7969
By adding the max depth of 5, we improved the training set accuracy to 84.03% and the testing set accuracy score from 79.17% to 79.69%.
Random Forest Classifier Feature Importance
The RandomForestClassifier() has an attribute that returns the feature importance of each feature.
Input:
print(f"Diabetes Feature Importances: n {format(forest.feature_importances_)} ")
Output:
Diabetes Feature Importances: [0.06407568 0.34152962 0.06332738 0.04797788 0.07042062 0.20293128 0.09624571 0.11349184]
Random Forest Classifier Feature Importance Plot
Input:
diabetes_features = [x for i,x in enumerate(diabetes.columns) if i!=8] def diabetes_feature_importances_plot(model): plt.figure(figsize=(10,5)) n_features = 8 plt.barh(range(n_features), model.feature_importances_, align='center') plt.yticks(np.arange(n_features), diabetes_features) plt.title('Diabetes Decision Trees Features Importances') plt.xlabel("Feature Importance") plt.ylabel("Feature") plt.ylim(-1, n_features) diabetes_feature_importances_plot(tree) plt.show()
Output: