Table of Contents
Python tutorial diabetes prediction with machine learning Support Vector Machine Algorithm. This dataset can be downloaded from the UCI Machine Learning Repository. If you’re not familiar with the diabetes dataset, spend some time analyzing the data with a step-by-step guide on the Diabetes Dataset Analysis tutorial.
Support Vector Machine Learning Algorithm
The support vector machines model will provide an effective high dimensional spaces even when the number of dimensions is greater than the number of samples.Also, the model will use a subset of training points in the decision function (called support vectors). By doing so, it provides better memory efficiency.
Import Packages and Diabetes Data
from sklearn.svm import SVC from sklearn.model_selection import train_test_split import pandas as pd import matplotlib.pyplot as plt import numpy as np diabetes = pd.read_csv('diabetes.csv')
Train and Test Data
X_train, X_test, y_train, y_test = train_test_split(diabetes.loc[:, diabetes.columns != 'Outcome'], diabetes['Outcome'], stratify=diabetes['Outcome'], random_state=66)
Build the Random Forest Classifier Model
Input:
vector = SVC() vector.fit(X_train, y_train) print(f"Support vector machine training set accuracy: {format(vector.score(X_train, y_train), '.4f')} ") print(f"Support vector machine testing set accuracy: {format(vector.score(X_test, y_test), '.4f')} ")
Output:
Support vector machine training set accuracy: 1.0000 Support vector machine testing set accuracy: 0.6510
The support vector machine gives us an accuracy score of 65.10% on the testing data but 100% on training set. This is an indicator of over-fitting and we can implement the MinMaxScaler. The features within SVM requires the features to vary on a similar scale. We’re able to re-scale the diabetes data so that all features will be the same scale approximately.
SVM MinMaxScaler
The MinMaxScaler method transforms features by scaling each feature to a given range. Meaning this estimator scales and translates each feature individually such that it is in the given range on the training set.
Input:
from sklearn.preprocessing import MinMaxScaler scaler = MinMaxScaler() X_train_scaled = scaler.fit_transform(X_train) X_test_scaled = scaler.fit_transform(X_test) vector = SVC() vector.fit(X_train_scaled, y_train) print(f"Support vector machine training set accuracy: {format(vector.score(X_train_scaled, y_train), '.4f')} ") print(f"Support vector machine testing set accuracy: {format(vector.score(X_test_scaled, y_test), '.4f')} ")
Output:
Support vector machine training set accuracy: 0.7691 Support vector machine testing set accuracy: 0.7708
1 comment
If you are having trouble with UCIs gzip on Windows you can get the csv from Kaggle https://www.kaggle.com/datasets/saurabh00007/diabetescsv.