Performing Linear Regression Analysis (Ordinary Least Square) Using Python Statsmodels

The objective of this project is to perform linear regression analysis (ordinary least square technique) using Statsmodels (a Python library) to predict the car price, based on the automobile dataset from UCI Machine Learning repository, which is a common dataset for regression analysis. The automobile dataset is from the year 1985 which is quite old, but it’s suitable for the learning purposes of this project. You can find the project’s jupyter notebook and the dataset (if you want to skip extracting step) on my GitHub repository.

Import Library

import numpy as np
import pandas as pd

import scipy.stats as stats
import statsmodels.api as sm

import seaborn as sns
import matplotlib.pyplot as plt

# download dataset description from the source
from urllib import request 

import warnings
warnings.filterwarnings('ignore')

Data Preprocessing & Exploration

Extract the Dataset

data_desc_url = "https://archive.ics.uci.edu/ml/machine-learning-databases/autos/imports-85.names"
response = request.urlretrieve(data_desc_url, "dataset/data_desc.names")

with open("dataset/data_desc.names") as data_desc:
    print(data_desc.read())


# url source dataset
data_url = "https://archive.ics.uci.edu/ml/machine-learning-databases/autos/imports-85.data" 
df = pd.read_csv(data_url)
df.head()

As shown above, the dataset doesn’t contain the column header. Here is the column list that has been created manually based on the dataset descriptions.

column = [  'symboling', 'normalized-losses', 'make', 'fuel-type', 'aspiration',
            'num-of-doors', 'body-style', 'drive-wheels', 'engine-location', 'wheel-base',
            'length', 'width', 'height', 'curbweight', 'engine-type', 'num-of-cylinders',
            'engine-size', 'fuel-system', 'bore', 'stroke', 'compression-ratio', 'horsepower', 
            'peak-rpm', 'city-mpg', 'highway-mpg', 'price' ]

data_url = "https://archive.ics.uci.edu/ml/machine-learning-databases/autos/imports-85.data"
df.columns = column
df.head()

Data Cleaning

print(df.shape)
print(df.isin(['?']).sum())


df.drop(['normalized-losses','symboling'], axis=1, inplace=True)
df.replace('?', np.nan, inplace=True)
df.dropna(inplace=True)
df.head()
df.shape
(192, 24)


After the null values were removed, now the dataset consist of 192 rows and 24 features (columns).

df.info()

The df.info() shows that there are some columns hasn’t have the correct data type.

# to float
to_float = ['bore', 'stroke', 'horsepower', 'peak-rpm', 'price']

for i in to_float:
    df[i] = df[i].astype(float)
    
df.info()

Now, the columns have the correct data type and it’s ready for some calculations.

plt.figure(figsize=(12,8))
sns.displot(df['price'], kde=True)
plt.title('Car Price Distribution Data')
plt.show()

As shown in the plot above, most of the car prices are concentrated between the range 5000 to 20000 and the data has right-skewed distribution.

cat_var = df.loc[: , df.columns!='make'].select_dtypes(include=['object'])

plt.figure(figsize=(12,8))

# enumerate(df.columns): return list of tuple(index column, name of column)
for i in enumerate(cat_var.columns):
    
    # 3 rows, 3 columns, index enumerate() + 1
    plt.subplot(3,3, i[0]+1) 
    
    # x: name of the column
    sns.boxplot(data=df,  x=i[1], y='price', palette="husl") 
    plt.xticks(rotation = 45)
    plt.tight_layout()

There are some outliers data based on the plots, but it’s permissible for several reasons. Their occurrence is natural and the error can’t be identified or corrected. So, in this step, the outliers won’t be removed.

num_var = df.loc[: , df.columns!='make'].select_dtypes(include=['float','int64'])
num_var.describe() # or df.describe


plt.figure(figsize=(12,8))
sns.heatmap(num_var.corr(), annot=True, cmap='Blues')
plt.title('Correlation Heatmap of Numerical Variables')
plt.show()

Based on the heatmap, several variables have strong positive or negative correlations. In the simple linear regression analysis, engine-size will be used as the predictor. Then, all the categorical and numerical variables that have been explored before, will be used in the multiple linear regression analysis as well.

Simple Linear Regression

plt.figure(figsize=(12,8))
sns.jointplot(data=df, x='engine-size', y='price', kind='reg').fig.suptitle("Scatter Plot & Histogram of Engine Size vs. Price")
plt.tight_layout()
plt.show()


df_slr = df.copy() # slr: simple linear regression
df_slr = df_slr.sample(frac=1, random_state=101).reset_index(drop=True)

df_slr.head()


x_slr = df_slr['engine-size']
y_slr = df_slr['price']


x_slr = sm.add_constant(x_slr)
x_slr.head()


train_size = int(0.75 * len(x_slr))

x_train_slr = x_slr[:train_size]
y_train_slr = y_slr[:train_size]

x_test_slr = x_slr[train_size:]
y_test_slr = y_slr[train_size:]

x_train_slr.shape, x_test_slr.shape
((144, 2), (48, 2))


lm_slr = sm.OLS(y_train_slr, x_train_slr).fit()

lm_slr.summary()


fig = plt.figure(figsize=(12,8))

fig = sm.graphics.plot_regress_exog(lm_slr, 'engine-size', fig=fig)

plt.show()

The statsmodels’ plot_regress_exog function allows for viewing regression results against a single regressor, which in this case is enginesize. Four different plots are generated by this function:


sm.graphics.influence_plot(lm_slr)

The influence_plot can be utilized to gain a deeper understanding of the regression model. This plot enables the identification of records in the dataset that have had a significant influence on the regression analysis. The influential data points can be recognized by their large circles in the plot. For example, the data point with ID 103 has a significant impact on the regression results.

y_pred_slr = lm_slr.predict(x_test_slr)

y_pred_slr.head()


fig, ax = plt.subplots(figsize=(12,8))

plt.scatter(x_test_slr['engine-size'], y_test_slr)
plt.plot(x_test_slr['engine-size'], y_pred_slr, color='g')

plt.xlabel('Engine Size')
plt.ylabel('Price')

plt.show()

The fitted line of the predicted values can be seen here. Here is a scatter plot of the test data (engine size vs. price) and the matching predicted value for each x value in the test data. It appears to have quite accurately approximated or fitted the test data.


Multiple Linear Regression

df_mlr = df.copy() # mlr: multiple linear regression
df_mlr.columns


cols = ['fuel-type','aspiration','num-of-doors','engine-location']

from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()

for i in cols:
    df_mlr[i] = le.fit_transform(df_mlr[i])

# similar to DataFrame.head() but it returning the data randomly    
df_mlr[cols].sample(5) 


df_mlr.drop(['make'], axis=1, inplace=True)
cat_columns = ['body-style', 'engine-type', 'drive-wheels', 'num-of-cylinders', 'fuel-system']
for i in cat_columns:
    df_mlr = pd.concat([df_mlr.drop(i, axis=1),
                        pd.get_dummies(df_mlr[i],
                                        prefix = i,
                                        prefix_sep = '_',
                                        drop_first = True)], axis=1)

df_mlr.head()
# get the number of rows and columns
df_mlr.shape 
(192, 39)


 # copy the dataframe
df_mlr = df_mlr.sample(frac=1, random_state=101).reset_index(drop=True)

# assign df_mlr into the x variables (exclude price variable)
x_mlr = df_mlr.drop(['price'], axis=1)

# assign df_mlr['price'] as the target
y_mlr = df_mlr['price'] 

# add constant
x_mlr = sm.add_constant(x_mlr) 

x_mlr.head()


train_size = int(0.75 * len(x_mlr))

x_train_mlr = x_mlr[:train_size]
y_train_mlr = y_mlr[:train_size]

x_test_mlr = x_mlr[train_size:]
y_test_mlr = y_mlr[train_size:]


lm_mlr = sm.OLS(y_train_mlr, x_train_mlr).fit()

lm_mlr.summary()
y_pred_mlr = lm_mlr.predict(x_test_mlr)

y_pred_mlr.head()


data_actual_pred = pd.DataFrame({'Actual Value' : y_test_mlr.ravel(),
                                 'Predicted Value' : y_pred_mlr})

data_actual_pred.head()


melted_data_actual_pred = pd.melt(data_actual_pred.reset_index(),
                                   id_vars=['index'],
                                   value_vars=['Actual Value', 'Predicted Value'])


plt.figure(figsize=(12,8))
sns.lineplot(data=melted_data_actual_pred, x='index', y='value', hue='variable')

plt.show()

The plot shows the actual and the predicted values are close. It indicates this is a good model.

Conclusions

Simple Linear Regression

Lastly, according to Chin (1998), the R-squared score that more than 0.67 is categorized as a substantial. Therefore, the closer the R-squared value is to 1, the better the fit of the model.

Credits