M.A. Portfolio Website

Handling Outliers & Missing Values in Python

When analyzing and modeling data, it’s common to encounter outliers and missing values, which can have a significant impact on the accuracy and validity of the results. Properly addressing these issues is crucial to ensure that the analysis or modeling is based on dependable data. Outliers can distort statistical models and negatively affect prediction accuracy, while missing values can lead to biased or inaccurate results by reducing the completeness of a dataset. Correctly handling these issues is necessary to enhance the precision and reliability of the analysis or modeling.

Python pandas is a powerful data manipulation library that offers various tools for handling outliers and missing values. Python pandas offers numerous methods for addressing missing values, such as replacing them with mean, median, or mode values, or removing them altogether. Moreover, pandas provides several statistical functions to identify and handle outliers. Additionally, pandas provides several statistical functions for detecting and handling outliers, such as the interquartile range method and the Z-score method.

In this tasks, I’m using Google Colaboratory (CoLab) as the coding environment. Google Colaboratory is a free Jupyter notebook interactive development environment provided by Google.

Download the dataset to follow along.

Import Library

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

# standardize / convert to z-score
from sklearn.preprocessing import StandardScaler

# ignore warning
import warnings
warnings.filterwarnings('ignore')

Handling Outliers

Connect Google Drive to a Google Colab Notebook

from google.colab import drive
drive.mount('/content/drive')

Read Data from Google Drive

cust_info = pd.read_csv('/content/drive/MyDrive/Colab Notebooks/dataset_customers/mall_customers_info.csv')
cust_score = pd.read_csv('/content/drive/MyDrive/Colab Notebooks/dataset_customers/mall_customers_score.csv')
customer_data_2 = pd.read_csv('/content/drive/MyDrive/Colab Notebooks/dataset_customers/customers_data_2.csv')

cust_info.head()

cust_score.head()

customer_data_2.head()

Merge & Concat Data

customer_data_1 = pd.merge(cust_info,cust_score[['CustomerID', 'Spending Score (1-100)']],how='inner')
customer_data_1.head()

customer_data_1.shape

cust_df = pd.concat([customer_data_1, customer_data_2])
cust_df.shape

Outliers Handling with Inter Quartile Range

Duplicate concatenated dataframe.

df_iqr_outliers = pd.DataFrame.copy(cust_df)

Create box plot of df_iqr_outliers dataframe.

df_iqr_outliers.plot(kind='box', rot=45)

Calculate inter quartile range to generate the upper limit and the lower limit.

Q1 = df_iqr_outliers['Annual_Income'].quantile(.25)
Q3 = df_iqr_outliers['Annual_Income'].quantile(.75)
iqr = Q3 - Q1
up_l = Q3 + 1.5 * iqr
lw_l = Q1 - 1.5 * iqr
print("upper limit: {} & lower limit: {}".format(up_l, lw_l))

Using np.where to get index of value of Annual_Income column that greater than upper limit.

Because there is no minus value in the Annual_Income column, filtering process to imputate outliers value (replace oultiers with other value) only using the upper limit value.

outliers_index = np.where(df_iqr_outliers['Annual_Income']>up_l)
print(outliers_index)

Filter dataframe using index from np.where.

df_iqr_outliers.iloc[outliers_index]

Replace outliers with mean of Annual_Income column (without filtering outliers).

df_iqr_outliers['Annual_Income'].iloc[outliers_index] = df_iqr_outliers['Annual_Income'].mean()

df_iqr_outliers.iloc[outliers_index]

Create box plot using imputated outliers dataframe.

df_iqr_outliers.plot(kind='box', rot=45)

Outliers Handling using Z-Score

Duplicate concatenated dataframe & create box plot.

df_z_score_outliers = pd.DataFrame.copy(cust_df)

df_z_score_outliers.boxplot(rot=45, grid=False)

Standardizing (convert to z-score).

scaled_df_z_score_outliers_annual_income = StandardScaler().fit_transform(df_z_score_outliers['Annual_Income'].values.reshape(-1,1))

Add standardized annual income to df_z_score_outliers dataframe.

df_z_score_outliers['scaled_Annual_Income'] = scaled_df_z_score_outliers_annual_income

df_z_score_outliers.boxplot(column='scaled_Annual_Income', grid=False)

df_z_score_outliers_outliers_removed = df_z_score_outliers.drop(df_z_score_outliers.index[np.where(df_z_score_outliers['scaled_Annual_Income']>3)])
df_z_score_outliers_outliers_removed = df_z_score_outliers_outliers_removed.drop('scaled_Annual_Income', axis=1)

df_z_score_outliers_outliers_removed.boxplot(rot=45, grid=False)

Handling Missing Value

DataFrame example

data = pd.DataFrame({ "A" : [4, 5, 7, np.nan, np.nan, 5, 8, np.nan, 3],
                       "B" : [100, 150, 130, 140, 180, 115, 155, 120, 105] })

print(data)

Fill missing value using mean (or other specific value)

Duplicate data DataFrame.

data_v1 = pd.DataFrame.copy(data)

Filter null value using isnull() == True.

data_v1[data_v1["A"].isnull() == True]

Calculate mean from existing data in column A that doesn’t have null value.

mean = data_v1["A"][data_v1["A"].isnull() == False].mean()

mean

Assign mean as the replacement of missing values in column A.

data_v1["A"][data_v1["A"].isnull() == True] = mean

# or using fillna()
data_v1["A"].fillna(mean, inplace=True)

print(data_v1)

Other value such as median, mode or a specific value can be assign using this method

Fill missing value using `pandas.fillna` methods

Duplicate data DataFrame & using numpy.where to return the array index of missing values.

data_v2 = pd.DataFrame.copy(data)

np.where(data_v2['A'].isnull())

Using array index that has been generated from previous step to subset null value.

data_v2.iloc[np.where(data_v2['A'].isnull())]

Input limit paramater to limit the maximum number of consecutive NaN values to forward/backward fill and inplace=True to fill in-place. Other paramaters can be found on pandas documentation.

print(data_v2)

Using fillna with ffill or pad.

data_v2.fillna(method="pad")

data_v2.fillna(method="ffill", limit=1)

Using fillna with backfill or bfill.

data_v2.fillna(method="backfill", limit=1)

data_v2.fillna(method="bfill")

Fill missing value using `pandas.interpolate`

Fill NaN values using interpolate method (read documentation).

data_v3 = pd.DataFrame.copy(data)

print(data_v3)

data_v3.interpolate()

Thank you for taking the time to read this post. I hope that the information and insights shared in this post have been valuable to you and have provided some helpful perspectives on the topic at hand.

Prev Next Top