Linear regression

Linear regression#

Research question: What is the relationship between fertility and education?

#lets load the dataset for the linear regression model
data (swiss)
head(swiss)
FertilityAgricultureExaminationEducationCatholicInfant.Mortality
Courtelary80.2 17.0 15 12 9.9622.2
Delemont83.1 45.1 6 9 84.8422.2
Franches-Mnt92.5 39.7 5 5 93.4020.2
Moutier85.8 36.5 12 7 33.7720.3
Neuveville76.9 43.5 17 15 5.1620.6
Porrentruy76.1 35.3 9 7 90.5726.6
# Create a histogram with a normal distribution curve
hist(swiss$Fertility, probability = TRUE, col = "lightblue", main = "Histogram with Normal Distribution Curve")
_images/b73209834562ceebc0eb326e14ac310b39897930b1cf8b41a46787014b695bd0.png

According to the plot, it may assume that the outcome is normally distributed.

Now let’s investigate the relationship between the fertility and the education while controlling for Examination, Catholic, Agriculture

# Fit a linear regression model
lm_model <- lm(Fertility ~ Education + Examination + Catholic + Agriculture, data = swiss)

# Summarize the model
summary(lm_model)
Call:
lm(formula = Fertility ~ Education + Examination + Catholic + 
    Agriculture, data = swiss)

Residuals:
     Min       1Q   Median       3Q      Max 
-15.7813  -6.3308   0.8113   5.7205  15.5569 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) 91.05542    6.94881  13.104  < 2e-16 ***
Education   -0.96161    0.19455  -4.943 1.28e-05 ***
Examination -0.26058    0.27411  -0.951  0.34722    
Catholic     0.12442    0.03727   3.339  0.00177 ** 
Agriculture -0.22065    0.07360  -2.998  0.00455 ** 
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 7.736 on 42 degrees of freedom
Multiple R-squared:  0.6498,	Adjusted R-squared:  0.6164 
F-statistic: 19.48 on 4 and 42 DF,  p-value: 3.95e-09
# Normal Q-Q Plot
plot(lm_model, which = 2, col = "blue")
_images/3648e9e25394a519fe76f97afc89f00e8eb2700c66394f63299b4d8aa52d1a5a.png

Logistic regression#

# Define a binary outcome variable based on Fertility rate
swiss$HighFertility <- ifelse(swiss$Fertility > median(swiss$Fertility), 1, 0)
logit_model <- glm(HighFertility ~ Education + Examination + Catholic + Agriculture,
                   data = swiss,
                   family = binomial)
summary(logit_model)
Call:
glm(formula = HighFertility ~ Education + Examination + Catholic + 
    Agriculture, family = binomial, data = swiss)

Deviance Residuals: 
     Min        1Q    Median        3Q       Max  
-2.07058  -0.37397  -0.01366   0.43406   1.60842  

Coefficients:
            Estimate Std. Error z value Pr(>|z|)   
(Intercept) 14.02121    5.10091   2.749  0.00598 **
Education   -0.15114    0.09363  -1.614  0.10647   
Examination -0.43822    0.17297  -2.533  0.01129 * 
Catholic     0.02535    0.01403   1.807  0.07078 . 
Agriculture -0.12432    0.04825  -2.576  0.00998 **
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 65.135  on 46  degrees of freedom
Residual deviance: 28.952  on 42  degrees of freedom
AIC: 38.952

Number of Fisher Scoring iterations: 6