In this R tutorial, we will be analyzing and visualizing medical care expenses. For this tutorial, we will be using the lm() package to fit a linear regressional model to data in R.
Health insurance companies must make money to stay afloat. In order to do so, it must collect more in yearly premiums than it spends on medical care for its beneficiaries. Investors invest money and time to develop models that will accurately forecast medical expenses for the insured population.
The goal of this tutorial is to analyze patient data to estimate the average medical care expense for such population segments. These estimated can be used to create actuarial tables that set the price of patients yearly premiums, which may be high or low, depending on the expected treatment costs.
Install and Load Packages
Below are the packages and libraries that we will need to load to complete this tutorial.
Input:
1 2 3 4 5 6 7 8 9 10 | install.packages("C50") install.packages("gmodels") install.packages("party") install.packages("RColorBrewer") install.packages("psych") library(C50) library(gmodels) library(party) library(RColorBrewer) library(psych) |
Download and Load the Medical Care Expense Dataset
Since we will be using the medical care expenses dataset, you will need to download this dataset. This dataset is already packaged and available for an easy download from the dataset page or directly from here Medical Care Expenses – insurance.csv
Input:
1 | insurance <- read.csv("insurance.csv", stringsAsFactors = FALSE) |
View the Medical Care Expense Dataset
head() function
Now, let’s take a look at the data and see by using the command head().
Input:
1 | head(insurance) |
Output:
1 2 3 4 5 6 7 | age sex bmi children smoker region expenses 1 19 female 27.9 0 yes southwest 16884.92 2 18 male 33.8 1 no southeast 1725.55 3 28 male 33.0 3 no southeast 4449.46 4 33 male 22.7 0 no northwest 21984.47 5 32 male 28.9 0 no northwest 3866.86 6 31 female 25.7 0 no southeast 3756.62 |
As one can see from the above, the data provided for each person is age, sex, BMI, children, smoker, region, and expenses.
lm() function
The method that will be used for insurance data is lm().
Input:
1 | ins_model <- lm(charges ~ age + sex + smoker, data = insurance) |
The above will be used to fit a linear regression model to data with R. The R formula syntax uses the tilde character to describe the model; the dependent variable ‘expenses’ goes to the left of the tilde character, while the independent variables go to the right, separated by the + sign. Using the ., will add all independent variable to the right of the tilde sign as shown below.
Input:
1 | ins_model <- lm(charges ~ ., data = insurance1) |
Histogram of expenses
By creating the histogram, the expenses range from $1122 to $63770, but most of the expense range from $0 and $15000.
Correlation Matrix for age, BMI, children, and expenses
Input:
1 | cor(insurance[c("age", "bmi", "children", "expenses")]) |
Output:
1 2 3 4 5 | age bmi children expenses age 1.0000000 0.10934101 0.04246900 0.29900819 bmi 0.1093410 1.00000000 0.01264471 0.19857626 children 0.0424690 0.01264471 1.00000000 0.06799823 expenses 0.2990082 0.19857626 0.06799823 1.00000000 |
Scatterplot Matrix Using pairs()
Input:
1 | pairs(insurance[c("age", "bmi", "children", "expenses")]) |
Output:
Scatterplot Matrix Using pairs.panels()
The below scatterplot is using pairs.panels(). This will give a better visual of the data and as you can see below above the diagonal, the scatterplots have been replaced with a correlation matrix/ On the diagonal, a histogram depicting the distribution of values for each feature is shown. Also, the scatterplots below the diagonal are now presented with additional visual information.
Correlation Ellipse
The oval-shaped object on each scatterplot is called a correlation ellipse. This shows a visual of the correlation strength. The dot at the center of the correlation ellipse indicates the point at the mean values for the x and y-axis variables. The correlation between the two variables is indicated by the shape of the ellipse is shown by the shape more stretched, the stronger the correlation. When there’s a perfect rounded oval, such as the BMI and children, this indicated a very weak correlation.
Input:
1 | pairs.panels(insurance[c("age", "bmi", "children", "expenses")]) |
Output:
lm() function
The lm() function will return a regression model object that can be used to make predictions. Interactions between independent variables can be specified using the * operator.
Input:
1 2 3 4 | ins_model <- lm(expenses ~ age + sex + smoker + region, data = insurance) ins_model <- lm(expenses ~ ., data = insurance) ins_model |
Output:
1 2 3 4 5 6 7 8 | Call: lm(formula = expenses ~ ., data = insurance) Coefficients: (Intercept) age sexmale bmi children -11941.6 256.8 -131.4 339.3 475.7 smokeryes regionnorthwest regionsoutheast regionsouthwest 23847.5 -352.8 -1035.6 -959.3 |
Improving Model Performance
The below break down for the lm() and using summary() is showing residuals and coefficients. The residuals section provides a statistical summary for the prediction errors. Some of these errors are quite substantial. As one can see, 50 percent of the errors fall into the 1Q and 3Q values. This means that predictions were between $2850.90 over the true value and $1383.90 were under the true value.
For each of the estimated regression coefficient, the p-value provides an estimate of the probability that the true coefficient is given zero given the value of the estimate. The below model we created, has several highly significant variables, and they seem to be related to the outcomes in logical ways.
The coefficient of determination, also known as the multiple R-squared value, provides a measure of how well our model as a whole explains the values of the dependent variable. This means that the closer the value is to 1.0, the better the model explains the data. As one can see the below data, the adjusted R-squared data is 0.7509, this shows us that the 75 percent of the variation is in the dependent variable.
Input:
1 | summary(ins_model) |
Output:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 | Call: lm(formula = expenses ~ ., data = insurance) Residuals: Min 1Q Median 3Q Max -11302.7 -2850.9 -979.6 1383.9 29981.7 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) -11941.6 987.8 -12.089 < 2e-16 *** age 256.8 11.9 21.586 < 2e-16 *** sexmale -131.3 332.9 -0.395 0.693255 bmi 339.3 28.6 11.864 < 2e-16 *** children 475.7 137.8 3.452 0.000574 *** smokeryes 23847.5 413.1 57.723 < 2e-16 *** regionnorthwest -352.8 476.3 -0.741 0.458976 regionsoutheast -1035.6 478.7 -2.163 0.030685 * regionsouthwest -959.3 477.9 -2.007 0.044921 * --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 6062 on 1329 degrees of freedom Multiple R-squared: 0.7509, Adjusted R-squared: 0.7494 F-statistic: 500.9 on 8 and 1329 DF, p-value: < 2.2e-16 |
The next question, How can we improve model performance?
The key difference between regression modeling and other machine learning approaches is that regression typically leaves feature selection and model specification to the user.
Adding non-linear relationships
The below will add age and age2 to the lm() formula using expenses ~ age + age2 form. This will allow the model to separate the linear and non-linear impact of age on medical expenses.
Converting a numeric variable to a binary indicator
A feature called ifelse() will be used to convert a numeric variable to a binary indicator.
Input:
1 | insurance$bmi30 <- ifelse(insurance$bmi >= 30, 1, 0) |
Putting it all together
From the above model improvements, we can view improvement in the regression formula:
- Adding a non-linear term for age
- Created BMI index of 30
- Specified an interaction between BMI and smoking (expenses ~ bmi30*smoker)
Input:
1 2 | ins_model2 <- lm(expenses ~ age + age2 + children + bmi + sex + bmi30*smoker + region, data = insurance) summary(ins_model2) |
Output:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 | Call: lm(formula = expenses ~ age + age2 + children + bmi + sex + bmi30 * smoker + region, data = insurance) Residuals: Min 1Q Median 3Q Max -17297.1 -1656.0 -1262.7 -727.8 24161.6 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 139.0053 1363.1359 0.102 0.918792 age -32.6181 59.8250 -0.545 0.585690 age2 3.7307 0.7463 4.999 6.54e-07 *** children 678.6017 105.8855 6.409 2.03e-10 *** bmi 119.7715 34.2796 3.494 0.000492 *** sexmale -496.7690 244.3713 -2.033 0.042267 * bmi30 -997.9355 422.9607 -2.359 0.018449 * smokeryes 13404.5952 439.9591 30.468 < 2e-16 *** regionnorthwest -279.1661 349.2826 -0.799 0.424285 regionsoutheast -828.0345 351.6484 -2.355 0.018682 * regionsouthwest -1222.1619 350.5314 -3.487 0.000505 *** bmi30:smokeryes 19810.1534 604.6769 32.762 < 2e-16 *** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 4445 on 1326 degrees of freedom Multiple R-squared: 0.8664, Adjusted R-squared: 0.8653 F-statistic: 781.7 on 11 and 1326 DF, p-value: < 2.2e-16 |
As one can see from the above performance, the adjusted R-squared has improved greatly from the previous 0.75 to 0.87. This states that 87 percent of the variation of medical treatment costs. The higher-order age 2 term is statistically significant, as is the obesity indicator, bmi30. The interaction between obesity and smoking suggests a massive effect; in addition to the increased costs over $13,304 for smoking alone, obese smokers spend another $19,810 per year. This suggests that smoking exacerbates diseases associated with obesity.