In this Python tutorial, learn to analyze the Wisconsin breast cancer dataset for prediction using logistic regression algorithm. This Wisconsin breast cancer dataset can be downloaded from our datasets page.
Logistic Regression Machine Learning Algorithm Summary
Logistic regression is another way that a machine learning algorithm can make a decision. A machine learning system that uses logistic regression is determining more than just whether or not a datapoint falls in one category or another. A logistic regression system is attempting to understand the chances that a particular point will be in one group or another. The system then introduces a probability using a large data set in order to make predictions about future events. This system has a number of practical real-world applications. These applications help determine the chance that a particular event will happen.
One common example is in the medical field. There has been a logistic regression system that has been introduced in order to help determine the chances of patient mortality. This question of patient mortality is a binary one that changes over a period of time and has the chance to either increase or decrease depending on the circumstances. There are a wide variety of possibilities that go into the chances that an individual lives or dies over a particular period of time as a patient. A logistic regression model would weigh a large number of factors based on how they contributed to either the patient surviving or dying. The end result would be a model and a probability for the survival of a patient given the factors inputted. The algorithm can be trained based on real-world consequences and on its performance in a number of different settings.
Logistic Regression Trained and Untrained Datasets
Logistic regression, like other forms of machine learning, can be applicable for both trained sets and untrained sets. For trained sets, logistic regression can be used by representing percentages and the possibility that certain events will occur with the example set being the events that actually occurred. In some situations, a logistic regression system that identifies images may have a percentage or exact number of relevant images uploaded as an example set that it attempts to replicate dozens or hundreds of times. These trained logistic regression sets can be effective in identifying images and making predictions based off of previously set up images.
Untrained sets can also help detect relationships and patterns over a large amount of data. An untrained set may attempt to devised the probability open event and how close between 1 and 0 is the percentage that the event occurs. This untrained set would be particularly helpful if an individual starts out not knowing the ideal situation but they want their machine learning system to match. While untrained sets are certainly less predictable than train sets, they can help provide a baseline for both of the data points that make up a regression line.
Logistic Regression Model
We are able to fit the logistic regression model with the optional parameter of C. C is a float parameter that is optional and the default is 1.0. The C parameter is the inverse of regularization strength and must be a positive float. This is similar to support vector machines, smaller values specify stronger regularization.
Import Packages and Breast Cancer Data
1 2 3 4 5 6 7 | import numpy as np import matplotlib.pyplot as plt import pandas as pd from sklearn.model_selection import train_test_split from sklearn.linear_model import LogisticRegression breast_cancer = pd.read_csv("wisc_bc_data.csv") |
Pandas head() Function
We can use the pandas head() function to view the first 5 rows of data.
Input:
1 | print(breast_cancer.head()) |
Output:
1 2 3 4 5 6 7 8 | id diagnosis ... symmetry_worst fractal_dimension_worst 0 842302 M ... 0.4601 0.11890 1 842517 M ... 0.2750 0.08902 2 84300903 M ... 0.3613 0.08758 3 84348301 M ... 0.6638 0.17300 4 84358402 M ... 0.2364 0.07678 [5 rows x 32 columns] |
From the above output, the id column is of no value for this project and we can simply remove this column.
Input:
1 2 3 | del breast_cancer['id'] print(breast_cancer.head()) |
Output:
1 2 3 4 5 6 7 8 | diagnosis radius_mean ... symmetry_worst fractal_dimension_worst 0 M 17.99 ... 0.4601 0.11890 1 M 20.57 ... 0.2750 0.08902 2 M 19.69 ... 0.3613 0.08758 3 M 11.42 ... 0.6638 0.17300 4 M 20.29 ... 0.2364 0.07678 [5 rows x 31 columns] |
Train and Test Data
1 2 | X_train, X_test, y_train, y_test = train_test_split(breast_cancer.loc[:, breast_cancer.columns != 'diagnosis'], breast_cancer['diagnosis'], stratify=breast_cancer['diagnosis'], random_state=66) |
LogisticRegression()
Input:
1 2 3 4 | model = LogisticRegression().fit(X_train, y_train) print(f"Logistic regression training set classification score: {format(model.score(X_train, y_train), '.4f')} ") print(f"Logistic regression testing set classification score: {format(model.score(X_test, y_test), '.4f')} ") |
Output:
1 2 | Logistic Regression training set classification score: 0.9554 Logistic Regression testing set classification score: 0.9371 |
LogisticRegression(C=0.01)
Input:
1 2 3 4 | model_001 = LogisticRegression(C=0.01).fit(X_train, y_train) print(f"Logistic Regression training set classification score: {format(model_001.score(X_train, y_train), '.4f')} ") print(f"Logistic Regression testing set classification score: {format(model_001.score(X_test, y_test), '.4f')} ") |
Output:
1 2 | Logistic Regression training set classification score: 0.9484 Logistic Regression testing set classification score: 0.8951 |
LogisticRegression(C=100)
Input:
1 2 3 4 | model_100 = LogisticRegression(C=100).fit(X_train, y_train) print(f"Logistic Regression training set classification score: {format(model_100.score(X_train, y_train), '.4f')} ") print(f"Logistic Regression testing set classification score: {format(model_100.score(X_test, y_test), '.4f')} ") |
Output:
1 2 | Logistic Regression training set classification score: 0.9671 Logistic Regression testing set classification score: 0.9650 |
Logistic Regression Model Plot
In the logistic regression model plot we will take the above models and implement a plot for logistic regression. The enumerate method will be used to iterate over the columns of the diabetes dataset. the enumerate() method will add a counter to an interable. The object will then be returned as an enumerate objects which can be used for loops or converted into tuples by using the list() method.
When plotting the models, coef_ is used as an array shape (1, n_features) or (n_classes, n_features). This is defined as the coefficient of the features in the decision function. Also, .T will access the attribute T of the object as coef_.T for each model.
Input:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 | breast_cancer_features = [x for i,x in enumerate(breast_cancer.columns) if i!=8] plt.figure(figsize=(12,6)) plt.plot(model.coef_.T, 'o', label="LogisticRegression()") plt.plot(model_001.coef_.T, 'v', label="LogisticRegression(C=0.01)") plt.plot(model_100.coef_.T, '^', label="LogisticRegression(C=100)") plt.xticks(range(breast_cancer.shape[1]), breast_cancer_features, rotation=90) plt.hlines(0, 0, breast_cancer.shape[1]) plt.ylim(-3, 3) plt.title('Logistic Regression') plt.xlabel("Feature") plt.ylabel("Coefficient Magnitude") plt.legend() plt.show() |
Output: