In this Python tutorial, learn to analyze and visualize the Wisconsin breast cancer dataset. This tutorial will analyze how data can be used to predict which type of breast cancer one may have. By analyzing the breast cancer data, we will also implement machine learning in separate posts and how it can be used to predict breast cancer. The purpose of this tutorial is to help provide a basic understanding of data cleansing, data exploration, feature selection, model evaluation and model selection.
Importance of Data Mining
The field of data mining is one of the most robust technological fields in the modern digital economy. More and more companies are hiring data professionals and spending time and money perfecting their data procedures. Data mining operations are becoming essential to modern competition in the 21st century economy. They are helping factories and health care offices save thousands of dollars every year in their operations. They are allowing companies to microtarget advertisements and get more out of each dollar spent. A continuing trend has been to start hiring individuals in this field who will help organize these data approaches.
Data Manipulation and Predictions
These individuals are tasked with learning a massive amount of information in a short period of time. They have to be familiar with the dozens and hundreds of different ways that a computer can manipulate a large amount of data and use that manipulated information to make sense of data. A data professional may have to streamline a thousand different procedures or make sense of a million data points.
In order to be successful, a data mining professional must be familiar with all of these procedures and how to use them. They must know how to analyze and apply common forms of machine learning. These forms are applicable for a wide variety of data situations and can help a data professional classify, organize, and use data in order to make predictions.
Breast Cancer Tumors
A mass of abnormal tissue is known as a tumor. There are two types of tumors: benign, which are non-cancerous and malignant, which are cancerous.
Benign Tumors
In most cases, a doctor diagnosing a tumor as benign will most likely be left alone. Benign tumors are not generally aggressive around the surrounding tissue and in some cases, may continue to grow. If the tumor continues to grow and cause discomfort by pressing against surrounding organs and causing pain, the tumor would be removed.
Malignant Tumors
Malignant tumors are aggressive and cancerous because damage the surrounding tissue and may be removed depending on the cancerous and aggressive on the severity or aggressiveness of the tumor.
Breast Cancer Dataset Analysis
The below few section will completed data analysis of the breast cancer dataset before we work into the visualizing the breast cancer dataset.
Load the Breast Cancer Dataset
The first step is loading the breast cancer dataset and then importing the data with pandas using the pd.read_csv method. This will save the object containing digits data and the attributes associated with it.
pd.read_csv()
1 2 3 | import pandas as pd breast_cancer = pd.read_csv('wisc_bc_data.csv') |
breast_cancer.columns
We can use the breast_cancer.columns method to print out the column names.
Input:
1 | print(breast_cancer.columns) |
Output:
1 2 3 4 5 6 7 8 9 10 | Index(['id', 'diagnosis', 'radius_mean', 'texture_mean', 'perimeter_mean', 'area_mean', 'smoothness_mean', 'compactness_mean', 'concavity_mean', 'concave points_mean', 'symmetry_mean', 'fractal_dimension_mean', 'radius_se', 'texture_se', 'perimeter_se', 'area_se', 'smoothness_se', 'compactness_se', 'concavity_se', 'concave points_se', 'symmetry_se', 'fractal_dimension_se', 'radius_worst', 'texture_worst', 'perimeter_worst', 'area_worst', 'smoothness_worst', 'compactness_worst', 'concavity_worst', 'concave points_worst', 'symmetry_worst', 'fractal_dimension_worst'], dtype='object') |
breast_cancer.head()
The breast_cancer.head() method will print out the data for the first 5 rows.
Input:
1 | print(breast_cancer.head()) |
Output:
1 2 3 4 5 6 | id diagnosis ... symmetry_worst fractal_dimension_worst 0 842302 M ... 0.4601 0.11890 1 842517 M ... 0.2750 0.08902 2 84300903 M ... 0.3613 0.08758 3 84348301 M ... 0.6638 0.17300 4 84358402 M ... 0.2364 0.07678 |
breast_cancer.shape
The breast_cancer.shape() method show the total of columns and total of rows.
Input:
1 | print(breast_cancer.shape) |
Output:
1 | (569, 32) |
breast_cancer.info()
The breast_cancer.info() method will provide all necessary information about the breast cancer dataset in one output.
Input:
1 | print(breast_cancer.info()) |
Output:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 | <class 'pandas.core.frame.DataFrame'> RangeIndex: 569 entries, 0 to 568 Data columns (total 32 columns): id 569 non-null int64 diagnosis 569 non-null object radius_mean 569 non-null float64 texture_mean 569 non-null float64 perimeter_mean 569 non-null float64 area_mean 569 non-null float64 smoothness_mean 569 non-null float64 compactness_mean 569 non-null float64 concavity_mean 569 non-null float64 concave points_mean 569 non-null float64 symmetry_mean 569 non-null float64 fractal_dimension_mean 569 non-null float64 radius_se 569 non-null float64 texture_se 569 non-null float64 perimeter_se 569 non-null float64 area_se 569 non-null float64 smoothness_se 569 non-null float64 compactness_se 569 non-null float64 concavity_se 569 non-null float64 concave points_se 569 non-null float64 symmetry_se 569 non-null float64 fractal_dimension_se 569 non-null float64 radius_worst 569 non-null float64 texture_worst 569 non-null float64 perimeter_worst 569 non-null float64 area_worst 569 non-null float64 smoothness_worst 569 non-null float64 compactness_worst 569 non-null float64 concavity_worst 569 non-null float64 concave points_worst 569 non-null float64 symmetry_worst 569 non-null float64 fractal_dimension_worst 569 non-null float64 dtypes: float64(30), int64(1), object(1) memory usage: 140.1+ KB None |
Breast Cancer Diagnosis Size
The .size() will return an int representing the number of elements if an object. If it’s a series, it will return the number of rows and if a DataFrame it will return the number of rows times number of columns.
Input:
1 | print(breast_cancer.groupby('diagnosis').size()) |
Output:
1 2 3 4 | diagnosis B 357 M 212 dtype: int64 |
Breast Cancer Data Visualization
The seanborn package offers a count plot that is thought of a histogram across a categorical, instead of quantitative, variable. The count plot will plot the counted of observations in each categorical bin using bars.
When inputting data, the data can be passed in a variety of formats including below:
- A DataFrame in long-form will determine how the data are plotted in which the case the x, y, and hue variables will determine how the data is plotted.
- A DataFrame in wide-form will plot each numeric column.
- An array or list of vectors.
Import Packages
1 2 3 4 | import pandas as pd import matplotlib.pyplot as plt import seaborn as sns from matplotlib.colors import ListedColormap |
Seaborn Diagnosis Countplot
The seaborn package provides a function called color_palette() that allows discrete color palettes. This will provide an interface to generate colors in seaborn, but also provide Hex color codes to your liking. In this example, we will provide the Hex color codes of #FF1493 (deep pink) and #FF69B4 (hot pink).
Input:
1 2 3 | sns.countplot(breast_cancer['diagnosis'], label="Count", palette=sns.color_palette(['#FF1493', '#FF69B4']), order=pd.value_counts(breast_cancer['diagnosis']).iloc[:17].index) plt.show() |
Output:
Pandas Pairwise Correlation
The Pandas breast_cancer.corr() is used to find the pairwise correlation of all columns in the breast cancer dataframe. Correlation is used when referencing the strength of a relationship between two variables have a high/strong correlation means.
We must first drop the ‘id’ column as it provides no value when analyzing the dataset.
Input:
1 2 3 4 5 | breast_cancer = breast_cancer.drop('id', axis=1) breast_cancer_corr = breast_cancer.corr() print(breast_cancer_corr) |
Output:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 | radius_mean ... fractal_dimension_worst radius_mean 1.000000 ... 0.007066 texture_mean 0.323782 ... 0.119205 perimeter_mean 0.997855 ... 0.051019 area_mean 0.987357 ... 0.003738 smoothness_mean 0.170581 ... 0.499316 compactness_mean 0.506124 ... 0.687382 concavity_mean 0.676764 ... 0.514930 concave points_mean 0.822529 ... 0.368661 symmetry_mean 0.147741 ... 0.438413 fractal_dimension_mean -0.311631 ... 0.767297 radius_se 0.679090 ... 0.049559 texture_se -0.097317 ... -0.045655 perimeter_se 0.674172 ... 0.085433 area_se 0.735864 ... 0.017539 smoothness_se -0.222600 ... 0.101480 compactness_se 0.206000 ... 0.590973 concavity_se 0.194204 ... 0.439329 concave points_se 0.376169 ... 0.310655 symmetry_se -0.104321 ... 0.078079 fractal_dimension_se -0.042641 ... 0.591328 radius_worst 0.969539 ... 0.093492 texture_worst 0.297008 ... 0.219122 perimeter_worst 0.965137 ... 0.138957 area_worst 0.941082 ... 0.079647 smoothness_worst 0.119616 ... 0.617624 compactness_worst 0.413463 ... 0.810455 concavity_worst 0.526911 ... 0.686511 concave points_worst 0.744214 ... 0.511114 symmetry_worst 0.163953 ... 0.537848 fractal_dimension_worst 0.007066 ... 1.000000 [30 rows x 30 columns] |
Seaborn Correlation Heatmap Matrix
A heatmap is a two-dimensional graphical representation of data values that are contained in a visualized matrix.
The seaborn Python package allows the creation of heatmaps which can be tweaked using matplotlib tools. For this example, we will be using the matplotlibs ListedColormap to customize the colors of the heatmap.
Input:
1 2 3 4 5 6 | plt.figure(figsize=(8, 8)) sns.heatmap(breast_cancer_corr, cbar=True, annot=False, yticklabels=breast_cancer.columns, cmap=ListedColormap(['#C71585', '#DB7093', '#FF00FF', '#FF69B4', '#FFB6C1', '#FFC0CB']), xticklabels=breast_cancer.columns) plt.show() |
Output:
Highest Correlation Breast Cancer Features
In order to plot the next few plots, we must be able to analyze which features have the highest correlation.
The breast_cancer_corr data frame has 30 rows and 30 columns. This means when we sort the features with the highest correlation, the dataframe output will have have 900 rows.
breast_cancer_corr.abs()
The abs() function is used to return the absolute value of a number.
high_correlation.unstack()
The unstack() function in a dataframe unstacks the row to columns.
high_correlation_unstack.sort_values(ascending=False)
The sort.values() function will sort the values in ascending or descending order. Also, we will print the high correlation between 30 and 35 because the first the 29 values will have a correlation of 1.000000.
Input:
1 2 3 4 5 | high_correlation = breast_cancer_corr.abs() high_correlation_unstack = high_correlation_sort = high_correlation_unstack.sort_values(ascending=False) print(high_correlation_sort[30:35]) |
Output:
1 2 3 4 5 6 | radius_mean perimeter_mean 0.997855 perimeter_mean radius_mean 0.997855 radius_worst perimeter_worst 0.993708 perimeter_worst radius_worst 0.993708 radius_mean area_mean 0.987357 dtype: float64 |
From the above we can see the highest correlation between features that do not equal 1.0000. In the few section we can use any of the above, but for simplicity we will use the radius_worst and perimeter_worst.
Seaborn High Correlation Scatterplot
The seaborn scatterplot will be used to plot two high-correlation variables from the Wisconsin breast cancer dataset. The two variables that have a high correlation are radius_worst and perimeter_worst will provide analysis in the strength of that relationship with the available statistical data.
Input:
1 2 3 | sns.jointplot("radius_worst", "perimeter_worst", data=breast_cancer, kind="scatter", space=0, color="#FF1493", height=5, ratio=3) plt.show() |
Output:
Seaborn High Correlation Pairplot
The seaborn pairplot is made to visualize the relationship between two variables, where the variables can be continuous.
Input:
1 2 3 | sns.pairplot(breast_cancer, vars=["radius_worst", "perimeter_worst"], palette=sns.color_palette(['#FF1493', '#FF69B4']), hue='diagnosis', height=3) plt.show() |
Output:
Machine Learning Algorithms for Breast Cancer
Machine learning and data mining go hand-in-hand when working with data. Machine learning algorithms are referred from data mining and other big data tools that make use of big data. When working with large sets of data, it can be processed and understood by human beings because of the large quantities of quantitative data.
The Wisconsin breast cancer dataset can have multiple algorithms implemented to detect the diagnosis of benign or malignant. The below machine algorithms will be implemented with the breast cancer dataset in separate tutorials to fully focus on each algorithm.
Python Machine Learning Algorithm in Scope:
R Programming Machine Learning Algorithm in Scope: