Handling Outliers & Missing Values in Python
When analyzing and modeling data, it’s common to encounter outliers and missing values, which can have a significant impact on the accuracy and validity of the results. Properly addressing these issues is crucial to ensure that the analysis or modeling is based on dependable data. Outliers can distort statistical models and negatively affect prediction accuracy, while missing values can lead to biased or inaccurate results by reducing the completeness of a dataset. Correctly handling these issues is necessary to enhance the precision and reliability of the analysis or modeling.
Python pandas is a powerful data manipulation library that offers various tools for handling outliers and missing values. Python pandas offers numerous methods for addressing missing values, such as replacing them with mean, median, or mode values, or removing them altogether. Moreover, pandas provides several statistical functions to identify and handle outliers. Additionally, pandas provides several statistical functions for detecting and handling outliers, such as the interquartile range method and the Z-score method.
In this tasks, I’m using Google Colaboratory (CoLab) as the coding environment. Google Colaboratory is a free Jupyter notebook interactive development environment provided by Google.
Download the dataset to follow along.
Import Library
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
# standardize / convert to z-score
from sklearn.preprocessing import StandardScaler
# ignore warning
import warnings
warnings.filterwarnings('ignore')
Handling Outliers
Connect Google Drive to a Google Colab Notebook
from google.colab import drive
drive.mount('/content/drive')
Read Data from Google Drive
cust_info = pd.read_csv('/content/drive/MyDrive/Colab Notebooks/dataset_customers/mall_customers_info.csv')
cust_score = pd.read_csv('/content/drive/MyDrive/Colab Notebooks/dataset_customers/mall_customers_score.csv')
customer_data_2 = pd.read_csv('/content/drive/MyDrive/Colab Notebooks/dataset_customers/customers_data_2.csv')
cust_info.head()
cust_score.head()
customer_data_2.head()
Merge & Concat Data
customer_data_1 = pd.merge(cust_info,cust_score[['CustomerID', 'Spending Score (1-100)']],how='inner')
customer_data_1.head()
customer_data_1.shape
cust_df = pd.concat([customer_data_1, customer_data_2])
cust_df.shape
Outliers Handling with Inter Quartile Range
- Duplicate concatenated dataframe.
df_iqr_outliers = pd.DataFrame.copy(cust_df)
- Create box plot of df_iqr_outliers dataframe.
df_iqr_outliers.plot(kind='box', rot=45)
- Calculate inter quartile range to generate the upper limit and the lower limit.
Q1 = df_iqr_outliers['Annual_Income'].quantile(.25)
Q3 = df_iqr_outliers['Annual_Income'].quantile(.75)
iqr = Q3 - Q1
up_l = Q3 + 1.5 * iqr
lw_l = Q1 - 1.5 * iqr
print("upper limit: {} & lower limit: {}".format(up_l, lw_l))
- Using
np.where
to get index of value of Annual_Income column that greater than upper limit.
Because there is no minus value in the Annual_Income column, filtering process to imputate outliers value (replace oultiers with other value) only using the upper limit value.
outliers_index = np.where(df_iqr_outliers['Annual_Income']>up_l)
print(outliers_index)
- Filter dataframe using index from
np.where
.
df_iqr_outliers.iloc[outliers_index]
- Replace outliers with mean of Annual_Income column (without filtering outliers).
df_iqr_outliers['Annual_Income'].iloc[outliers_index] = df_iqr_outliers['Annual_Income'].mean()
df_iqr_outliers.iloc[outliers_index]
- Create box plot using imputated outliers dataframe.
df_iqr_outliers.plot(kind='box', rot=45)
Outliers Handling using Z-Score
- Duplicate concatenated dataframe & create box plot.
df_z_score_outliers = pd.DataFrame.copy(cust_df)
df_z_score_outliers.boxplot(rot=45, grid=False)
- Standardizing (convert to z-score).
scaled_df_z_score_outliers_annual_income = StandardScaler().fit_transform(df_z_score_outliers['Annual_Income'].values.reshape(-1,1))
- Add standardized annual income to df_z_score_outliers dataframe.
df_z_score_outliers['scaled_Annual_Income'] = scaled_df_z_score_outliers_annual_income
df_z_score_outliers.boxplot(column='scaled_Annual_Income', grid=False)
df_z_score_outliers_outliers_removed = df_z_score_outliers.drop(df_z_score_outliers.index[np.where(df_z_score_outliers['scaled_Annual_Income']>3)])
df_z_score_outliers_outliers_removed = df_z_score_outliers_outliers_removed.drop('scaled_Annual_Income', axis=1)
df_z_score_outliers_outliers_removed.boxplot(rot=45, grid=False)
Handling Missing Value
DataFrame example
data = pd.DataFrame({ "A" : [4, 5, 7, np.nan, np.nan, 5, 8, np.nan, 3],
"B" : [100, 150, 130, 140, 180, 115, 155, 120, 105] })
print(data)
Fill missing value using mean (or other specific value)
- Duplicate data DataFrame.
data_v1 = pd.DataFrame.copy(data)
- Filter null value using isnull() == True.
data_v1[data_v1["A"].isnull() == True]
- Calculate mean from existing data in column A that doesn’t have null value.
mean = data_v1["A"][data_v1["A"].isnull() == False].mean()
mean
- Assign mean as the replacement of missing values in column A.
data_v1["A"][data_v1["A"].isnull() == True] = mean
# or using fillna()
data_v1["A"].fillna(mean, inplace=True)
print(data_v1)
Other value such as median, mode or a specific value can be assign using this method
Fill missing value using pandas.fillna
methods
- Duplicate data DataFrame & using numpy.where to return the array index of missing values.
data_v2 = pd.DataFrame.copy(data)
np.where(data_v2['A'].isnull())
- Using array index that has been generated from previous step to subset null value.
data_v2.iloc[np.where(data_v2['A'].isnull())]
Input limit paramater to limit the maximum number of consecutive NaN values to forward/backward fill and inplace=True to fill in-place. Other paramaters can be found on pandas documentation.
print(data_v2)
- Using
fillna
withffill
orpad
.
data_v2.fillna(method="pad")
data_v2.fillna(method="ffill", limit=1)
- Using
fillna
withbackfill
orbfill
.
data_v2.fillna(method="backfill", limit=1)
data_v2.fillna(method="bfill")
Fill missing value using pandas.interpolate
Fill NaN values using interpolate method (read documentation).
data_v3 = pd.DataFrame.copy(data)
print(data_v3)
data_v3.interpolate()
Thank you for taking the time to read this post. I hope that the information and insights shared in this post have been valuable to you and have provided some helpful perspectives on the topic at hand.