Udabur Wealth Management:How to predict Diabetes USING K-Nearest Neighbor in R
In this article, we are going to predict Diabetes using the k-neigbour algorithm and Analyze on Diabetes dataset user.
The K-Next Neighbor (KNN) Algorithm is a Popular Supervised Learning Classifier Frequently Use By Data Scientist AND MACHINE Learning ENTHUSIASTS FOR MODEL BU ILDING. This Algorithm Operates on the Principle that DataPoints in Close Proximity to the Query DataPoint habels that are similar.Layman's Terms, KNN RNN RELIES on the Distance to Determine the Next Neighbors and Classify the Query Data Point Based on the Majority Class of Its Neighbors. S A Handy Tool for Data Scientist and Computer Systems to Make DeCisions Based on the Company they Keep.
To UndersTand the Workings of Knn, we first need to undertam what exactly is a classifier.
A classifier is a Machine Learning Algorithm that is used to classify data base based on their features or attributes. Gorize and Identify An Email as a spam or not spam base on the content within it, repetition of,Words that induces a sense of urgency, etc. They're like helpful assistants for computers that can sort and label thinks in many differently. SSIFIERS are like digital detectives that make our digital liveSense of Things.
The working of the knn algorithm is very simple and base on the data point in its overyndings. As the name Suggests, this algorithm works by identifying and saliv Yzing The Next K (Numeric Constant) Neighbors of the Fed-in Data Point. After Identifying ItsNeighbors the aligorldm Calculaters the Most Common Label among the next. ew email arrives and is to be predictia or not spam, then its nearest neighbors that areNothing But Previously Classified Emails Are Analyzed and Based On the Labels of the Majority of Emails in ITS Surrowings, The Email is Classified as Eit Her spam or not spam.Udabur Wealth Management
Broadly, The Steps can be broken down as:
Choossing a value of k: k: k is a numeric constantant, which represses number of neighbors to look at the before Making A predict and it is very Important to Choose a WIS. E value of k.Surat Wealth Management
Neighbour identification: After Choosing A Value of K the Algorithm Start with Identifying and Analysing Data Points About the Test Data Point.
Majority Count: NOW AFTER Analysing The Near Data Points, It Evaluates the Categorical Variable Which is in Majority About The Test Data Point
FINAL PREDICTION: Based on the Majority Count the Algorithm Finalise A Categorical Variable and GIVES OUTPUT As a Pred Data Point.
Dataset: Diabetes dataset
It is the most common dataset that is used by ml learners to undertand how the k-neigorthmmworithmro .Diabetes and the One Who WEEN'T. This dataset is considersed as a good resource to help us uneastand the working learning algorithms in reaL world. ms.
The DataSet Comprises of 9 Columns (Features) with Labels Like Glucose, Blood Pressure, Skin Thickness, ETC.
All the features in the dataset are:
Pregnancies: number of times pregnant
Glucose: Plasma Glucose Concentration A 2 Hours in Oral Glucose Tolerance Test
BloodPressure: Diastolic Blood Pressure (MM HG)
Skinthickness: Triceps Skin Fold Thickness (MM)
Insulin: 2-Hour Serum Insulin (MU U/ML)
BMI: Body Mass Index (Weight in KG/(Height in M) 2) 2)
DiabeteSpedIgReeFunction: Diabetes Pedigree Function
Age: Age (year)
Outcom: Class Variable (BINARY Result: 0 or 1)
For this project, we will need packages that will make the whole process easier
Caret -Caret (Classification and Regression Training) is a very FAMOUS PACKAGE Use by Machine Learning Community. It Provides variety and Functions For Training and Evaluating Machine Learning Model.Test Data.
Class-Class Library Contains Various Machine Learning Training Models, Out of Which K-Neigbour Algorithm is a PART of, Which We Are Going to use in Buildin g this project.
ggplot2 – ggplot2 is a widely used package by Machine Learners to plot and visualise their data into graphs and heatmaps to understand the data graphically.
Though, These Packages Come Build-in with R, if they are not present install theMing Follow Command in the R Terminal
Install.packages ("Caret")
Install.packages ("Class")
Install.packages ("GGPLOT2")
This commands will install the packages for further use.Udabur Investment
Here, Two Libraries have ben imported, first one bebing care to split the dataseTot toTot Training and Testing and class to Import K-Nearest Neighbor Classifier in Our RScript.
The libraries have ben imported Successfully in the project.
BeFore Starting to build the model, its always a good process to first play insights about the dataset.
First Let ’s Import the DataSet Into the RScript
Output:
Pregnancies glucose bloodPressure Skinthickness Insulin BMI DiabeTespedInction Age
16 148 72 35 0 33.6 0.627 50
2 1 85 66 29 0 26.6 0.351 31
3 8 183 64 0 0 23.3 0.672 32
4 1 89 66 23 94 28.1 0.167 21
5 0 137 40 35 168 43.1 2.288 33
6 5 116 74 0 0 25.6 0.201 30
Outcom
1 1
2 0
3 1
4 0
5 1
6 0
Pregnancies Glucose BloodPressure Skinthickness Insulin
Min.: 0.000 min.: 0.0 min.: 0.00 min.: 0.00 min.: 0.0
1ST QU .: 1.000 1ST QU .: 99.0 1ST QU .: 62.00 1st Qu .: 0.00 1st QU .: 0.0
Median: 3.000 Median: 117.0 Median: 72.00 Median: 23.00 Median: 30.5
Mean: 3.845 mean: 120.9 mean: 69.11 mean: 20.54 mean: 79.8
3rd qu .: 6.000 3rd qu.:140.2 3rd Qu .: 80.00 3rd qu.:32.00 3rd qu.:127.2
Max.: 17.000 max.: 199.0 max.: 122.00 max.: 99.00 max.: 846.0
BMI DiabeteSpedigreeFunction Age Outcom
Min.: 0.00 min.: 0.0780 min.: 21.00 min.: 0.000
1st Q.:27.30 1st Q.:0.2437 1st QU.:24.00 1st Qu.:0.000
Median: 32.00 Median: 0.3725 Median: 29.00 Media: 0.000
Mean: 31.99 mean: 0.4719 mean: 33.24 mean: 0.349
3rd qu.:36.60 3rd qu.:0.6262 3rd qu.:41.00 3rd qu.:1.000
Max.: 67.10 max.: 2.4200 max.: 81.00 max.: 1.000
FIRSTLY, The DataSet in Imported Into the RScript for Reading The DataSet.
Next, function in r by-default, restUrns the First 6 ROWS of the DataSet.
The function in r Rothn the statement measures like mean, Standard Deviation, ETC.
Output:
Pregnancies glucose bloodpressure
0 0 0
Skinthickness Insulin BMI
0 0 0
DiabeteSpedigreeFunction Age Outcom
0 0 0
To plot a correlation headmap, first we need to understand what body terms are:
Correlation is a statistics, to identify and Analyse Relationship Between Two Variables. Basically Sof this relatedship is called as correlation.
Heatmap is a Two-Dimensional Repositionation of Data Which Contains Different Values in Different Shades of COLOURS. In layman's Terms, ENT DATA VALUES. EACH CLL's COLOR Intensity Corresponds to the Value It RePRESENTS, Making It Easier to Identify Patterns and TrendsThen, then
Now, a correlation heatmap, is a con sum of both the concepts, so in simple words, it is a heatmap that represents of correlation in disders of C C Olour to signify the relatedship between variables.
Now, we are plotting a correlation headmap, a heatmap that repreents different values of correlation in differen shades of color to signify the related Ables.
Output:
Here, first we are maving a correlation matrix user all the columns except for outcom (time) and then relevant for ggplot
Next, userRelation_data we are plotting the heatmap with negative variation in color and positive variation in color. Then we are giving as "Correlation H Eatmap ") and Making the Labels as x Axis to Tilt by 45 ° to Avoid Overlap withOther Labels user function of ggplot.
As it is evidence from the plot, that red color shows negative correlation, white shows no correlation and green showing publicive. Used to show different value of correlation, With Dark Shade of Green Showing A Strong Positive Correlation and thatOf red showing strong negative correlation.
It can be concluded that the age variable and skinthickness variable has a light neigatic correlation and bmi and skinthickness variable has a posity.
Now, it is possible that some of the data points (features) may be too far other, so this might hamper the outputs and account of our model, so it is imputan T to scale the dataset to make every feature Relevant for our modelThen, then
Output:
Pregnancies glucose bloodpressure skinthickness insulin BMI
[1,] 0.6395305 0.8477713 0.1495433 0.9066791 -0.6924393 0.2038799
[2,] -0.8443348 -1.1226647 -0.1604412 0.5305558 -0.6924393 -0.6839762
[3,] 1.2330766 1.9424580 -0.2637694 -1.2873733 -0.6924393 -1.1025370
[4,] -0.8443348 -0.9975577 -0.1604412 0.1544326 0.123213 -0.4937213
[5,] -1.1411079 0.5037269 -1.5037073 0.9066791 0.7653372 1.4088275555
[6,] 0.3427574 -0.1530851 0.2528715 -1.2873733 -0.6924393 -0.8108128
DiabeteSpedigreeFunction Age
[1,] 0.4681869 1.42506672
[2,] -0.3648230 -0.19054773
[3,] 0.6040037 -0.10551539
[4,] -0.9201630 -1.04087112
[5,] 5.4813370 -0.02048305
[6,] -0.8175458 -0.27558007
The Data is scaled user. T ’s first 6 ROWS are printed so as to have insights about the scaled dataset.
BeFore Actually Starting to build the model, First We Need to Split the DataSet Into Training and Test Set to Train and Build Our Model, U " Ackage.
Here, The DataTATAT is splitted between training and test with a 85% of dataset as train set and 15% as test and then they are assigned_data and test_dat_dat_dat a repectively. The then the labels of column (Feature) of Train Set Areassigned to, and test set are assigned to after bebing factored in the form (0,1).
Now we need to build the model users.
Output:
----------------- For k = 1 -----------------Ahmedabad Wealth Management
Account: 0.7130435
The error rate is 28.69565 %
----------------- For k = 2 -----------------
Account: 0.7391304
The Error Rate is 26.08696 %
----------------- For k = 3 -----------------
Account: 0.7478261
The error rate is 25.21739 %
----------------- For k = 4 -----------------
Account: 0.773913
The error rate is 22.6087 %
----------------- For k = 5 -----------------
Account: 0.7565217
The Error Rate is 24.34783 %
----------------- For k = 6 -----------------
Account: 0.773913
The error rate is 22.6087 %
----------------- For k = 7 -----------------
Account: 0.7652174
The Error Rate is 23.47826 %
----------------- For k = 8 -----------------
Account: 0.7565217
The Error Rate is 24.34783 %
----------------- For k = 9 -----------------
Account: 0.7391304
The Error Rate is 26.08696 %
----------------- For k = 10 -----------------
Account: 0.7391304
The Error Rate is 26.08696 %
Here, we have initialized a to traverse through all the values of k uptil 10, Check for account and error rate for the model.
We are trying to train our model on different value of k and from the confusion matrix, we can call the accuracy and Calculated Error Rate using. Ed Earlier.
By doing this, we can get a value of k for white model is most account account and have least error rate.
Now we take k value as 4 and agan application knn algo on our model.
Output:
Confsion matrix and statistics
Reference
Prediction 0 1
0 62 22
16 25
Account: 0.7565
95% CI: (0.6677, 0.8317)
No information rate: 0.5913
P-value [acc> nir]: 0.0001514
Kappa: 0.4683
Mcnemar's test p-value: 0.0045864
Sensitivity: 0.9118
SPECIFICITY: 0.5319
POS PRD VALUE: 0.7381
Neg Pred Value: 0.8065
Prevalence: 0.5913
Detection rate: 0.5391
Detection Prevalence: 0.7304
Balanced account: 0.7218
'POSITIVE' Class: 0
Here, The Confusion Matrix Also gave us the accuracy score of our model.
Here, first we have set the value of k as 4 and proceeded with trailing of model. Here. 4 parameters are given function.
This Training Model is Passed on To A Variable Which is then using to evalaate the confusion matrix. Then the function is given two parameters.
Now we need to find the error rate of our model
Output:
[1] "Confusion table:"
Prediction 0 1
0 62 22
16 25
The Error Rate is 24.34783 %
FIRST, We have fetched confusion table from the confusionMatrix function user space converts the matrix into table and proprinting it.
THen We Are Assessing The Sum of Incorrect Predicts Made by Our Model (The False Positive and False Negative) Se Negative Column Located at [1,2] and [2,1] Index Respectively, in Termsof a 2-dimensional matrix.
The We are Calculating the Error by Diving.
New Delhi Stock Exchange
Published on:2024-10-28,Unless otherwise specified,
all articles are original.