In this Python tutorial, we will analyze the Wisconsin breast cancer dataset for prediction using k-nearest neighbors machine learning algorithm. The Wisconsin breast cancer dataset can be downloaded from our datasets page.

## K-Nearest Neighbors Algorithm

k-Nearest Neighbors is an example of a classification algorithm. These algorithms are either quantitative or qualitative and are used to place a particular data set in a particular category or classification. The way that this algorithm works is through demarcation lines and decisions about boundaries. In this algorithm, K is the data point that the operator is trying to figure out more information about. The operator often wants to figure out what categories K fits in.

In order to do this, the algorithm draws a perimeter around K and studies the other data points within that perimeter. The data points within a determined perimeter help push the artificial intelligence machine to give K classification. Different neighbors in a different perimeter would lead to potentially different results for this algorithm. K-nearest neighbors is helpful for guiding machine learning and determining relationships while only knowing a limited amount of data about the situation.

### Artificial Intelligence System

There may not be enough data to determine a regression line or a confident decision tree. An artificial intelligence system can learn with k-nearest neighbors in a number of different ways. Guided learning involves an individual setting parameters and neighbors and having the program tries to match those parameters and classifications. The program’s success is determined by the number of times that it correctly guesses what classification K is in. In unguided learning, an artificial intelligence system makes a wide variety of determinations about what they should be classified as and what parameter should be drawn around k. Each approach is helpful for different situations where an individual wants to determine the classification for a group of data points.

### K-Nearest Neighbors Predictions

k-Nearest Neighbors can be helpful for a number of different processes and situations. One common situation where this algorithm can be used is in the understanding of natural processes and the behavior of unpredictable bodies. These bodies often do not behave according to rational or clear concepts. There may not be certain analyses that show when a person will buy a product or when a bird will drink at one particular watering hole or another. This algorithm can be helpful in drawing similarities and assumptions based on the proximity of a particular data point to another set of data points. It can help make predictions about the behavior of these bodies or make a classification as to what group a concept should be placed into.

There is also the ability to display this information in a basic way. The essence of k-Nearest Neighbors is the data point along with the associated set of parameters and the other data points of the system. These data points can be plotted on a map somewhat easily. This map can then be displayed to the individuals involved in understanding or perhaps paying for the system. It is also helpful that another process besides decision trees can actually be displayed in a way that is helpful for people who are not data professionals.

## K-Nearest Neighbors Model

The Wisconsin breast cancer dataset will be used to build a model on the k-NN algorithm to predict the accuracy of the training and testing data. By building the model, we can record the training and testing accuracy with a range between a 1 and 50. This will output as a plot with the given range number on the x-axis and the accuracy on the y-axis.

### Import Packages and Dataset

1 2 3 4 5 6 |
from sklearn.neighbors import KNeighborsClassifier from sklearn.model_selection import train_test_split import pandas as pd import matplotlib.pyplot as plt breast_cancer = pd.read_csv('wisc_bc_data.csv') |

### Train and Testing Data

Before we can create the training and test data, we must remove the ‘**id’** column. This column provides no value on the prediction on breast cancer and can be removed with the **del** function from pandas.

1 2 3 4 5 6 7 |
del breast_cancer['id'] X_train, X_test, y_train, y_test = train_test_split(breast_cancer.loc[:, breast_cancer.columns != 'diagnosis'], breast_cancer['diagnosis'], stratify=breast_cancer['diagnosis'], random_state=66) train_accuracy = [] test_accuracy = [] |

### Build the k-NN Model

**Input:**

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 |
k = range(1, 50) for n_neighbors in k: # build the model knn = KNeighborsClassifier(n_neighbors=n_neighbors) knn.fit(X_train, y_train) # record training set accuracy train_accuracy.append(knn.score(X_train, y_train)) # record test set accuracy test_accuracy.append(knn.score(X_test, y_test)) plt.plot(k, train_accuracy, label="Train Accuracy") plt.plot(k, test_accuracy, label="Test Accuracy") plt.title('Breast Cancer Diagnosis k-Nearest Neighbor Accuracy') plt.ylabel("Accuracy") plt.xlabel("k") plt.legend() plt.show() |

**Output:**

## K-Nearest Neighbors Classifier Accuracy

From the above plot we can view that the nearest neighbor 3 has the highest accuracy rating between the testing and training data. We are able to set the nearest neighbor to 3 and validate k-NN score.

### k-NN of 3

**Input:**

1 2 3 4 5 6 |
knn = KNeighborsClassifier(n_neighbors=3) knn.fit(X_train, y_train) print("k-Nearest Neighbor 3") print(f"k-Nearest Neighbor classifier on training set: {format(knn.score(X_train, y_train), '.4f')} ") print(f"k-Nearest Neighbor classifier on testing set: {format(knn.score(X_test, y_test), '.4f')} ") |

**Output:**

1 2 3 |
k-Nearest Neighbor 3 k-Nearest Neighbor classifier on training set: 0.9577 k-Nearest Neighbor classifier on testing set: 0.9161 |

### k-NN of 5

**Input:**

1 2 3 4 5 6 |
knn = KNeighborsClassifier(n_neighbors=5) knn.fit(X_train, y_train) print("k-Nearest Neighbor 5") print(f"k-Nearest Neighbor classifier on training set: {format(knn.score(X_train, y_train), '.4f')} ") print(f"k-Nearest Neighbor classifier on testing set: {format(knn.score(X_test, y_test), '.4f')} ") |

**Output:**

1 2 3 |
k-Nearest Neighbor 5 k-Nearest Neighbor classifier on training set: 0.9531 k-Nearest Neighbor classifier on testing set: 0.9021 |

### k-NN of 15

**Input:**

1 2 3 4 5 6 |
knn = KNeighborsClassifier(n_neighbors=15) knn.fit(X_train, y_train) print("k-Nearest Neighbor 15") print(f"k-Nearest Neighbor classifier on training set: {format(knn.score(X_train, y_train), '.4f')} ") print(f"k-Nearest Neighbor classifier on testing set: {format(knn.score(X_test, y_test), '.4f')} ") |

**Output:**

1 2 3 |
k-Nearest Neighbor 15 k-Nearest Neighbor classifier on training set: 0.9390 k-Nearest Neighbor classifier on testing set: 0.9091 |

### k-NN of 30

**Input:**

1 2 3 4 5 6 |
knn = KNeighborsClassifier(n_neighbors=30) knn.fit(X_train, y_train) print("k-Nearest Neighbor 30") print(f"k-Nearest Neighbor classifier on testing set: {format(knn.score(X_test, y_test), '.4f')} ") |

**Output:**

1 2 3 |
k-Nearest Neighbor 30 k-Nearest Neighbor classifier on training set: 0.9272 k-Nearest Neighbor classifier on testing set: 0.9091 |

### k-NN of 50

**Input:**

1 2 3 4 5 6 |
knn = KNeighborsClassifier(n_neighbors=50) knn.fit(X_train, y_train) print("k-Nearest Neighbor 50") print(f"k-Nearest Neighbor classifier on testing set: {format(knn.score(X_test, y_test), '.4f')} ") |

**Output:**

1 2 3 |
k-Nearest Neighbor 50 k-Nearest Neighbor classifier on training set: 0.9178 k-Nearest Neighbor classifier on testing set: 0.9021 |