Exploratory Data Analysis of 911 Calls Data

This project is a data capstone project which is part of Udemy course: Python for Data Science and Machine Learning by Jose Portilla. This capstone project is about analyzing some 911 call data from Kaggle using python libraries such as pandas, matplotlib and seaborn.

The data contains the following fields:

Data and Setup


Import libraries and set %matplotlib inline

Python predefined (magic) function %matplotlib inline is used to enable the inline plotting, where the plots/graphs will be displayed just below the cell where plotting commands are written.
Source: pythonguides.com

import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

Read data

Read in the csv file as a dataframe called df.

df = pd.read_csv('911.csv')


df DataFrame Summary

Check the head of df.

df.head()

Check info() of df dataframe.

df.info()

Check number of NA value.
Using isna() to generate boolean value (True if NA and False if non-NA) then using sum() to sum up all the True value.

df.isna().sum()

There are 3 columns that has NA value (zip, twp, addr).

Value counts of zipcodes for 911 calls.

df['zip'].value_counts().sort_values(ascending=False)

Value counts of townships (twp) for 911 calls.

df['twp'].value_counts().sort_values(ascending=False)

Number of unique value in the title column.

df['title'].nunique()

Creating new features

In the titles column there are “Reasons/Departments” specified before the title code. These are EMS, Fire, and Traffic. For example, if the title column value is EMS: BACK PAINS/INJURY, the reason column value would be EMS.

Next step is creating a function (first method), to split of each string value in title column by ‘:’ character, append the first index of it into empty list then assign the list as a new column (‘reason’).

def func_split(a):
    b = []
    for i in a:
        b.append(i.split(':')[0])
    return b

df['reason'] = func_split(df['title'])

Second method is using apply() with a custom lambda expression.

df['reason'] = df['title'].apply(lambda x:x.split(':')[0])

The most common reason for a 911 call based off of the reason column.

df['reason'].value_counts()

Create a countplot of 911 calls by reason. In this plot, set palette='viridis' and because of seaborn is a library built on top of matplotlib, matplotlib’s colormaps can be use to change color style of seaborn’s plot (more built-in matplotlib colormaps).

sns.countplot(x='reason', data=df, palette='viridis')

Time information
The data type of the objects in the timeStamp column.

print(df['timeStamp'].dtypes)

The timestamps are still strings (objects). Use pd.to_datetime to convert the column from strings to DateTime objects.

df['timeStamp'] = pd.to_datetime(df['timeStamp'])

Grab specific attributes from a Datetime object by calling them. For example:

time = df['timeStamp'].iloc[0]
time.hour

Jupyter’s tab method can be use to explore the various attributes. Now that the timestamp column are actually DateTime objects, use .apply() to create 3 new columns called Hour, Month, and Day of Week.

Notice how the Day of Week is an integer 0-6. Use the .map() with ‘dmap’ dictionary to map the actual string names to the day of the week:

dmap = {0:'Mon',1:'Tue',2:'Wed',3:'Thu',4:'Fri',5:'Sat',6:'Sun'}
df['Hour'] = df['timeStamp'].apply(lambda x:x.hour)
df['Day of Week'] = df['timeStamp'].apply(lambda x:x.day_of_week).map(dmap)
df['Month'] = df['timeStamp'].apply(lambda x:x.month)
df.head()

Create a countplot of the Day of Week column with the hue based off of the reason column

sns.countplot(x='Day of Week', data=df, hue='reason', palette='viridis')
plt.legend(bbox_to_anchor=(1.05, 1),loc=0, borderaxespad=0.)

The same for Month.

sns.countplot(x='Month', data=df, hue='reason', palette='viridis')
plt.legend(bbox_to_anchor=(1.25, 1), borderaxespad=0.)

The plot was missing some Months, fill this information by plotting the information in another way, possibly a simple line plot that fills in the missing months.

Create a groupby object called ‘byMonth’ and it’s visualizations
The DataFrame is grouping by the month column and using the count() method for aggregation.

byMonth = df.groupby('Month').count()
byMonth

Create a simple plot of the dataframe indicating the count of calls per month.

byMonth['twp'].plot()

Use seaborn’s lmplot() to create a linear fit on the number of calls per month. In order to create lmplot(), the index has to reset to be a new column.

byMonth.reset_index(inplace=True)
sns.lmplot(x='Month', y='twp', data=byMonth)

Create a new column called ‘Date’ that contains the date from ‘timeStamp’ column

df['Date'] = df['timeStamp'].apply(lambda x:x.date())

Groupby Date column with the count() aggregate and create a plot of counts of 911 calls.

plt.figure(figsize=(10,5))
df.groupby('Date')['twp'].count().plot()
plt.tight_layout()

Recreate above plot but create 3 separate plots with each plot representing a reason for the 911 call.

plt.figure(figsize=(10,5))
df[df['reason']=='Traffic'].groupby('Date')['twp'].count().plot()
plt.title('Traffic')
plt.tight_layout()
plt.figure(figsize=(10,5))
df[df['reason']=='Fire'].groupby('Date')['twp'].count().plot()
plt.title('Fire')
plt.tight_layout()
plt.figure(figsize=(10,5))
df[df['reason']=='EMS'].groupby('Date')['twp'].count().plot()
plt.title('EMS')
plt.tight_layout()

Create heatmap and clustermap using restructured df dataframe
First the dataframe need to be restructured, so that the columns become the Hours and the Index becomes the Day of the Week. There are lots of ways to do this, but in this time pivot_table() and use aggfunc='count' will be use.

dayHour = df.pivot_table(index='Day of Week', columns='Hour', values='twp', aggfunc='count')
dayHour.head()

Create a heatmap using the dayHour dataframe.

plt.figure(figsize=(12,6))
sns.heatmap(data=dayHour, cmap='plasma').tick_params(left=False, bottom=False)

Create a clustermap using the dayHour dataframe.

sns.clustermap(data=dayHour, cmap='plasma', figsize=(8.5,8)).tick_params(right=False, bottom=False)

Repeat these same plots and operations, for a dataframe that shows the Month as the column.

dayMonth = df.pivot_table(index='Day of Week', columns='Month', values='twp', aggfunc='count')
dayMonth.head()
plt.figure(figsize=(12,6))
sns.heatmap(data=dayMonth, cmap='plasma').tick_params(left=False, bottom=False)
sns.clustermap(data=dayMonth, cmap='plasma', figsize=(8.5,8)).tick_params(right=False, bottom=False)