Handling Outliers & Missing Values in Python

When analyzing and modeling data, it’s common to encounter outliers and missing values, which can have a significant impact on the accuracy and validity of the results. Properly addressing these issues is crucial to ensure that the analysis or modeling is based on dependable data. Outliers can distort statistical models and negatively affect prediction accuracy, while missing values can lead to biased or inaccurate results by reducing the completeness of a dataset. Correctly handling these issues is necessary to enhance the precision and reliability of the analysis or modeling.

Python pandas is a powerful data manipulation library that offers various tools for handling outliers and missing values. Python pandas offers numerous methods for addressing missing values, such as replacing them with mean, median, or mode values, or removing them altogether. Moreover, pandas provides several statistical functions to identify and handle outliers. Additionally, pandas provides several statistical functions for detecting and handling outliers, such as the interquartile range method and the Z-score method.

In this tasks, I’m using Google Colaboratory (CoLab) as the coding environment. Google Colaboratory is a free Jupyter notebook interactive development environment provided by Google.

Download the dataset to follow along.

Import Library

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

# standardize / convert to z-score
from sklearn.preprocessing import StandardScaler

# ignore warning
import warnings
warnings.filterwarnings('ignore')

Handling Outliers

Connect Google Drive to a Google Colab Notebook

from google.colab import drive
drive.mount('/content/drive')

Read Data from Google Drive

cust_info = pd.read_csv('/content/drive/MyDrive/Colab Notebooks/dataset_customers/mall_customers_info.csv')
cust_score = pd.read_csv('/content/drive/MyDrive/Colab Notebooks/dataset_customers/mall_customers_score.csv')
customer_data_2 = pd.read_csv('/content/drive/MyDrive/Colab Notebooks/dataset_customers/customers_data_2.csv')
cust_info.head()
cust_score.head()
customer_data_2.head()

Merge & Concat Data

customer_data_1 = pd.merge(cust_info,cust_score[['CustomerID', 'Spending Score (1-100)']],how='inner')
customer_data_1.head()
customer_data_1.shape
cust_df = pd.concat([customer_data_1, customer_data_2])
cust_df.shape

Outliers Handling with Inter Quartile Range

df_iqr_outliers = pd.DataFrame.copy(cust_df)


df_iqr_outliers.plot(kind='box', rot=45)


Q1 = df_iqr_outliers['Annual_Income'].quantile(.25)
Q3 = df_iqr_outliers['Annual_Income'].quantile(.75)
iqr = Q3 - Q1
up_l = Q3 + 1.5 * iqr
lw_l = Q1 - 1.5 * iqr
print("upper limit: {} & lower limit: {}".format(up_l, lw_l))


Because there is no minus value in the Annual_Income column, filtering process to imputate outliers value (replace oultiers with other value) only using the upper limit value.

outliers_index = np.where(df_iqr_outliers['Annual_Income']>up_l)
print(outliers_index)


df_iqr_outliers.iloc[outliers_index]


df_iqr_outliers['Annual_Income'].iloc[outliers_index] = df_iqr_outliers['Annual_Income'].mean()

df_iqr_outliers.iloc[outliers_index]


df_iqr_outliers.plot(kind='box', rot=45)


Outliers Handling using Z-Score

df_z_score_outliers = pd.DataFrame.copy(cust_df)

df_z_score_outliers.boxplot(rot=45, grid=False)


scaled_df_z_score_outliers_annual_income = StandardScaler().fit_transform(df_z_score_outliers['Annual_Income'].values.reshape(-1,1))


df_z_score_outliers['scaled_Annual_Income'] = scaled_df_z_score_outliers_annual_income
df_z_score_outliers.boxplot(column='scaled_Annual_Income', grid=False)
df_z_score_outliers_outliers_removed = df_z_score_outliers.drop(df_z_score_outliers.index[np.where(df_z_score_outliers['scaled_Annual_Income']>3)])
df_z_score_outliers_outliers_removed = df_z_score_outliers_outliers_removed.drop('scaled_Annual_Income', axis=1)
df_z_score_outliers_outliers_removed.boxplot(rot=45, grid=False)


Handling Missing Value

DataFrame example

data = pd.DataFrame({ "A" : [4, 5, 7, np.nan, np.nan, 5, 8, np.nan, 3],
                       "B" : [100, 150, 130, 140, 180, 115, 155, 120, 105] })

print(data)


Fill missing value using mean (or other specific value)

data_v1 = pd.DataFrame.copy(data)


data_v1[data_v1["A"].isnull() == True]


mean = data_v1["A"][data_v1["A"].isnull() == False].mean()

mean


data_v1["A"][data_v1["A"].isnull() == True] = mean

# or using fillna()
data_v1["A"].fillna(mean, inplace=True)

print(data_v1)

Other value such as median, mode or a specific value can be assign using this method

Fill missing value using pandas.fillna methods

data_v2 = pd.DataFrame.copy(data)

np.where(data_v2['A'].isnull())


data_v2.iloc[np.where(data_v2['A'].isnull())]

Input limit paramater to limit the maximum number of consecutive NaN values to forward/backward fill and inplace=True to fill in-place. Other paramaters can be found on pandas documentation.

print(data_v2)


data_v2.fillna(method="pad")
data_v2.fillna(method="ffill", limit=1)


data_v2.fillna(method="backfill", limit=1)
data_v2.fillna(method="bfill")


Fill missing value using pandas.interpolate

Fill NaN values using interpolate method (read documentation).

data_v3 = pd.DataFrame.copy(data)

print(data_v3)
data_v3.interpolate()


Thank you for taking the time to read this post. I hope that the information and insights shared in this post have been valuable to you and have provided some helpful perspectives on the topic at hand.