In this Python tutorial, learn to analyze the Wisconsin breast cancer dataset for prediction using gradient boosting machine learning algorithm. The Wisconsin breast cancer dataset can be downloaded from our datasets page.
Gradient Boosting Machine Learning Algorithm
Boosting is a common technique used by algorithms and artificial intelligence. This technique uses inputs to track regression and display it through decision trees. It is a tool used to find the minimum of a function by displaying and tracing the weak learners of the function. The function is attempting to discover the tenets by which one factor is related to another. By combining multiple concepts that have a weak predictive factor with a sophisticated algorithm, the goal is to create a decision tree with a much stronger correlation. The term “boosting” refers to the increased correlation that results from the process and how it improves the actions of the algorithm.
A gradient boosting algorithm attempts to turn weak learning or poorly predictive concepts into stronger ones. This concept is helpful for particularly large datasets where there may not be a large number of concepts in common between the data points. One of the areas where these machines are used most frequently is in search engine rankings. Search engine rankings have to take in a large number of potential searches that may not be directly related to one another and distill them down into a handful of ranks where a page would go.
Search Engine Boosting
This process has to be repeated for thousands or potentially millions of different results on a search engine such as Google or Yahoo. Google and Yahoo make money when an individual is able to type in a handful of keywords and then find exactly what they can to see or buy. They need to perfect their data systems in order to create algorithms that will split out the exact information needed to secure success. Gradient boosting helps to make sure that the rank of the search engine corresponds to what an individual wants to see. A better working gradient booster greatly increases the chances that a search engine will be successful and will be able to make both an individual and a company a considerable amount of money over months or years.
Gradient Boosting Accuracy Prediction
The Wisconsin breast cancer dataset will be used to build a model on the gradient boosting algorithm to predict the accuracy of the training and testing data. By building the model, we can record the training and testing accuracy and output a plot with the given range number on the x-axis and the accuracy on the y-axis.
Import Packages and Breast Cancer Dataset
1 2 3 4 5 6 7 | from sklearn.ensemble import GradientBoostingClassifier from sklearn.model_selection import train_test_split import pandas as pd import matplotlib.pyplot as plt import numpy as np breast_cancer = pd.read_csv('wisc_bc_data.csv') |
Train and Test Data
Before we can create the training and testing data, we must first remove the ‘id’ column. This column provides no value on the prediction of breast cancer and can be removed by using the del function from pandas.
1 2 3 4 | del breast_cancer['id'] X_train, X_test, y_train, y_test = train_test_split(breast_cancer.loc[:, breast_cancer.columns != 'diagnosis'], breast_cancer['diagnosis'], stratify=breast_cancer['diagnosis'], random_state=66) |
Build the Gradient Boosting Model
Input:
1 2 3 4 5 | boosting = GradientBoostingClassifier() boosting.fit(X_train, y_train) print(f"Gradient boosting training set accuracy: {format(boosting.score(X_train, y_train), '.4f')} ") print(f"Gradient boosting testing set accuracy: {format(boosting.score(X_test, y_test), '.4f')} ") |
Output:
1 2 | Gradient boosting training set accuracy: 1.0000 Gradient boosting testing set accuracy: 0.9441 |
Gradient Boosting Model Over-fitting
Over-fitting usually occurs when the learning system tightly fits the given training data so much that it would be inaccurate in predicting the outcomes of the untrained data. This is exactly what we can see with our training set data.
Input:
1 2 3 4 5 | max_boosting = GradientBoostingClassifier(max_depth=1) max_boosting.fit(X_train, y_train) print(f"Gradient boosting training set accuracy: {format(max_boosting.score(X_train, y_train), '.4f')} ") print(f"Gradient boosting testing set accuracy: {format(max_boosting.score(X_test, y_test), '.4f')} ") |
Output:
1 2 | Gradient boosting training set accuracy: 0.9883 Gradient boosting testing set accuracy: 0.9371 |
By increasing the adding the max depth of 1, this method decreased the model complexity reduced the training and testing set accuracy.
Decision Features Importance Plot
Input:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 | breast_cancer_features = [x for i,x in enumerate(breast_cancer.columns) if i!=30] def breast_cancer_feature_importances_plot(model): plt.figure(figsize=(10,5)) n_features = 30 plt.barh(range(n_features), model.feature_importances_, align='center', color=['#FF1493']) plt.yticks(np.arange(n_features), breast_cancer_features) plt.title('Breast Cancer Gradient Boosting Features Importances') plt.xlabel("Feature Importance") plt.ylabel("Feature") plt.ylim(-1, n_features) breast_cancer_feature_importances_plot(max_boosting) plt.show() |
Output: