Linear Regression
Regression Modelling focuses on constructing linear models to predict or explain a continuous target variable using techniques such as Ordinary Least Squares (OLS). It covers model assumptions, parameter estimation, diagnostic checks, and practical fixes for issues like multicollinearity or residual non-normality. The aim is to provide a thorough grounding so practitioners can confidently interpret relationships and forecast numeric outcomes with rigor.
Multiple Linear Regression (MLR)
Learning Objectives
Explain the statistical model and key assumptions of multiple linear regression, and apply the Ordinary Least Squares (OLS) method for parameter estimation in Python.
Indicative Content
Purpose of MLR: Predict a continuous dependent variable using two or more independent variables (continuous or categorical)
Statistical model:
Assumptions: Linearity in parameters, independence of errors, constant variance, normality of errors
Parameter estimation: Minimizing sum of squared errors via OLS
Partial regression coefficients: Holding all other variables constant
Python implementation:
Using
ols()
fromstatsmodels.formula.api
Retrieving parameter estimates via
summary()
Evaluating Model Fit
Learning Objectives
Conduct global (ANOVA F-test) and individual (t-test) significance testing, and interpret R², adjusted R², and residual analysis to assess overall regression performance.
Indicative Content
Partitioning total variation: Explained vs. unexplained
Global significance testing:
F-test with null hypothesis
Individual significance testing:
t-test with null hypothesis
Goodness-of-fit measures:
R² (proportion of explained variation)
Adjusted R² (penalizes extra predictors)
Residual analysis: Identifying patterns or violations of assumptions
Python functions for evaluation:
summary()
fromstatsmodels
anova_lm()
for global F-test
Handling Categorical Variables in Regression
Learning Objectives
Incorporate categorical predictors into multiple linear regression through dummy coding, and interpret the resulting coefficients accurately.
Indicative Content
Inclusion of categorical variables in MLR (linear in parameters, but variables can be non-numeric)
Dummy coding to transform categories into binary indicators
Interpretation of partial regression coefficients for each category (relative to a base level)
Factor variables recognized in
statsmodels
with formula-based approach
Multicollinearity and Its Impact
Learning Objectives
Detect multicollinearity using the Variance Inflation Factor (VIF), understand its consequences for parameter stability, and apply remedial measures such as ridge regression.
Indicative Content
Definition: High inter-correlation among independent variables leads to unstable estimates
VIF (Variance Inflation Factor):
VIF > 5 suggests serious multicollinearity
Consequences: Inflated standard errors, unreliable parameter estimates, potential overfitting
Remedial measures:
Removing highly correlated predictors
Applying ridge regression
Python functions:
variance_inflation_factor()
fromstatsmodels.stats.outliers_influence
Example usage of
dmatrices()
frompatsy
Ridge regression approach
Assumptions of Regression Models
Learning Objectives
Examine normality and homoscedasticity of residuals, identify influential observations (e.g., Cook’s Distance, DFBETAs), and use Python diagnostic tools to validate model assumptions.
Indicative Content
Normality of errors:
Q-Q Plot (should be roughly linear)
Shapiro-Wilk test (null hypothesis: residuals ~ Normal)
Homoscedasticity (constant variance):
Residual vs. predicted plot (random scatter indicates no violation)
Influential observations:
Cook’s Distance (threshold > 1 or 4/n)
DFBETAs measuring change in parameter estimates
Python functions/tests:
shapiro()
fromscipy.stats
qqplot()
fromstatsmodels.graphics.gofplots
Influence detection with statsmodels (e.g.,
influence_plot()
)
Predicting New Data and Standardizing Coefficients
Learning Objectives
Generate predictions for new datasets from a fitted model and standardize coefficients to enable meaningful comparisons among predictors.
Indicative Content
Generating predictions:
Matching column names/variables in new data
Handling potential missing values
Confidence intervals for predictions
Standardized coefficients:
Subtract mean and divide by standard deviation for each variable
Compare relative importance of predictors on the same scale
Python implementation details:
predict()
method from statsmodels
Tools and Methodologies
statsmodels (e.g.,
ols()
,summary()
,anova_lm()
,qqplot()
,influence_plot()
,variance_inflation_factor()
) for linear regression, significance testing, and diagnostic analysespatsy for formula-based design matrices (
dmatrices()
)scipy.stats (e.g.,
shapiro()
) for normality testingnumpy for numerical operations (e.g., handling arrays, computing algebraic expressions)
MethodologiesFit multiple linear regression models to continuous outcome variables using Ordinary Least Squares (OLS) and interpret partial regression coefficients
Diagnose model stability and validity by checking assumptions (multicollinearity, homoscedasticity, residual normality) and applying fixes (removing collinear variables, considering ridge regression)
Generate predictions for new data, evaluate model fit (e.g., R², adjusted R²), and standardize coefficients to facilitate comparisons among predictors