This data set contains information about crimes and law enforcement agencies in the State of California, and is provided by the FBI via Kaggle Datasets. The data set consists of 8 CSV files, 4 with data about law enforcement officers and 4 with data about crimes. The data is poorly formatted, and only 2 of the CSV files can be loaded as Pandas DataFrames without much effort.
In this notebook, we'll be walking through the use of regular expressions and other low-level drudgery to parse this odd assortment of typos, Linux and Windows newline characters, and an all-around tasteless use of commas, and stuff this processing into some functions that will magically return nice tidy DataFrames.
(An alternative to using Python is to use command line tools like sed
or awk
to do the same kind of processing. However, sed
is a line-based tool, and as we'll see, there are some issues with the data that require dealing with multiple lines. It's possible to use awk
, but mastery of awk
requires time and practice. Python is a good alternative.)
# must for data analysis
% matplotlib inline
import numpy as np
import pandas as pd
from matplotlib.pyplot import *
# useful for data wrangling
import io, os, re, subprocess
# for sanity
from pprint import pprint
data_files = os.listdir('data/')
pprint(data_files)
This notebook will demonstrate how to load just one of these files - but each one presents its own challenges. Ultimately, we can abstract away the details of each file into (at least) 8 classes or 8 static methods. This notebook will cover the procedural programming that would go into those methods and classes, and illustrate how the components work. Then they can all be packaged up into a function.
Start by loading the lines of the file into a list of strings:
filename = data_files[1]
filewpath = "data/"+filename
with open(filewpath) as f:
lines = f.readlines()
pprint([p.strip() for p in lines[:10]])
This data is going to be a bit tricky - there are commas in the name field, not surrounded with double quotes, and another field with commas that is surrounded with double quotes. Checking the number of commas:
number_of_commas = [len(re.findall(',',p)) for p in lines]
print np.min(number_of_commas)
print np.max(number_of_commas)
All lines have 4 fields, one of which (attendance) always has one comma, which means all lines have at least 5 commas. Some schools have a comma in their name, for a total of six commas. But there should be 4 columns per line.
We will use regular expressions to put double quotes around school names, to keep the commas in school names from confusing the parser. (Note that this is more efficiently done on the command line with sed or awk, but in this case the fixes needed are relatively minor.)
Finally, we'll be fnished manipulating the data, and we'll pass the resulting list of parsed strings on to a DataFrame.
The data on school names follows a pattern: a string, possibly containing a comma, followed by a comma, and a double quote, and a number. The comma-double quote combination that marks the end of the school name is unique to schools, so we'll use that to mark the end of the school name.
<school name, perhaps including a comma>,"00,000"
We can use the regular expression r'^[A-Za-z, ]{1,},"[0-9]'
to match the pattern. This pattern means:
^
- start at the beginning of the line[^"]{1,}
- look for one or more occurrences of a non-double-quote character... (continued on next line),"
- ...followed by a comma and a double quote[0-9]
- and all of that followed by a numerical digitThis will locate the string containing the name of the school. We can use the ()
and make a substitution:
re.sub( r'^([^"]{1,})(,"[0-9])' , r'"\1"\2' , my_string )
The first bit, same as above, means, 1 or more instances of characters that aren't a double quote. That double quote is the first and only thing we care about finding.
The second bit there means, replace \1
with the first portion surrounded by parentheses, and replace \2
with the second portion surrounded by parentheses.
While we're at it, we can also add a check to make sure we're ignoring empty lines, which in this file means four or more commas in a row: ,,,,
# Parse each line with a regular expression
newlines = []
for p in lines[1:]:
if( len(re.findall(',,,,',p))==0):
#newlines.append(re.sub(r'^([^"]{1,})(,"[0-9])' , r'"\1"\2', p))
newlines.append(p)
pprint(newlines[:10])
We now have the raw data as a list of strings. We can process the string using some regular expression magic. Here's what the procedure will look like:
read_csv()
method to get a DataFrame.The end result is a DataFrame that we can use to do a statistical analysis.
one_string = '\n'.join(newlines)
sio = io.StringIO(unicode(one_string))
columnstr = lines[0].strip()
# Get rid of \r stuff
columnstr = re.sub('\r',' ',columnstr)
columnstr = re.sub('\s+',' ',columnstr)
# Fix what can ONLY have been a typo, making this file un-parsable without superhuman regex abilities
#columnstr = re.sub(',Campus','Campus',columnstr)
columns = [pp.strip() for pp in columnstr.split(",")]
df = pd.read_csv(sio,quotechar='"',names=columns,thousands=',')
df.head()
Now that we've successfully wrangled some of this data into a DataFrame, let's take it for a spin and make sure we're able to visualize the data without any issues.
import seaborn as sns
sns.pairplot(df)
If we have an issue with a column being treated as a string, as the Student enrollment column is in this case, we can cast it as a floating point number or an integer. We will extract the column of Student enrollment data, and use the Series.map()
function to apply a function converting string data following a given locale setting for numbers (which helps Python know where to expect commas and decimals, and what to do with them).
import locale
from locale import atof
locale.setlocale(locale.LC_NUMERIC, '')
try:
df['Student enrollment'] = df['Student enrollment'].map(atof)
except AttributeError:
# Already done
pass
df.head()
Now we can plot distributions of the student enrollment variable, which has some large higher order moments:
sns.distplot(df['Student enrollment'],bins=15,kde=False)
On a somewhat unrelated note - we can use the student enrollment data to divvy up schools into small, medium, and large schools by getting the quantiles of student enrollment. If we split the student enrollment distribution at its 33rd and 66th quantiles, we'll bin the schools into three groups of three sizes:
# Divide the schools into three size bins using quantiles
slice1 = np.percentile(df['Student enrollment'],q=33)
slice2 = np.percentile(df['Student enrollment'],q=66)
def school_size(enrollment):
if enrollment < slice1:
return 'Small'
elif enrollment < slice2:
return 'Medium'
else:
return 'Large'
print "Small-Medium Cutoff: %d"%(slice1)
print "Medium-Large Cutoff: %d"%(slice2)
df['Size'] = df['Student enrollment'].map(lambda x : school_size(x))
Now we can use that to get a conditional pair plot, with colors corresponding to the (terribly bland) labels of "Small", "Medium", and "Large".
sns.pairplot(df,hue="Size")
Of the functionality we implemented above, the most useful to abstract away into an object or a function is the process of turning a file into a DataFrame. Furthermore, each file will likely have its own challenges with parsing and processing that will be unique to it.
This is a task best suited for 8 static methods, for the following reasons:
Writing the parsing scripts as functions will help to separate the task of processing data and the task of analyzing data (hence, the reason we don't delve too deeply into the data above, just test out a few plots to make sure the DataFrame we imported is robust).
We'll start with the first four files, which contain information about law enforcement agencies broken down by agency, campus, city, and county. Each one has a roughly similar structure, so we can share some code, but there are too many particulars to make more code sharing useful.
def ca_law_enforcement_by_agency(data_directory):
filename = 'ca_law_enforcement_by_agency.csv'
# Load file into list of strings
with open(data_directory + '/' + filename) as f:
content = f.read()
content = re.sub('\r',' ',content)
[header,data] = content.split("civilians\"")
header += "civilians\""
data = data.strip()
agencies = re.findall('\w+ Agencies', data)
all_but_agencies = re.split('\w+ Agencies',data)
del all_but_agencies[0]
newlines = []
for (a,aba) in zip(agencies,all_but_agencies):
newlines.append(''.join([a,aba]))
# Combine into one long string, and do more processing
one_string = '\n'.join(newlines)
sio = io.StringIO(unicode(one_string))
# Process column names
columnstr = header.strip()
columnstr = re.sub('\s+',' ',columnstr)
columnstr = re.sub('"','',columnstr)
columns = columnstr.split(",")
# Load the whole thing into Pandas
df = pd.read_csv(sio,quotechar='"',names=columns,thousands=',')
return df
df1 = ca_law_enforcement_by_agency('data/')
df1.head()
def ca_law_enforcement_by_campus(data_directory):
filename = 'ca_law_enforcement_by_campus.csv'
# Load file into list of strings
with open(data_directory + '/' + filename) as f:
newlines = f.readlines()
# Process each string in the list
newlines = []
for p in lines[1:]:
if( len(re.findall(',,,,',p))==0):
#newlines.append(re.sub(r'^([^"]{1,})(,"[0-9])' , r'"\1"\2', p))
newlines.append(p)
# Combine into one long string, and do more processing
one_string = '\n'.join(newlines)
sio = io.StringIO(unicode(one_string))
# Process column names
columnstr = lines[0].strip()
columnstr = re.sub('\r',' ',columnstr)
columnstr = re.sub('\s+',' ',columnstr)
columns = columnstr.split(",")
# Load the whole thing into Pandas
df = pd.read_csv(sio,quotechar='"',names=columns,thousands=',')
return df
df2 = ca_law_enforcement_by_campus('data/')
df2.head()
def ca_law_enforcement_by_city(data_directory):
filename = 'ca_law_enforcement_by_city.csv'
# Load file into list of strings
with open(data_directory + '/' + filename) as f:
content = f.read()
content = re.sub('\r',' ',content)
[header,data] = content.split("civilians\"")
header += "civilians\""
data = data.strip()
# Combine into one long string, and do more processing
one_string = re.sub(r'([0-9]) ([A-Za-z])',r'\1\n\2',data)
sio = io.StringIO(unicode(one_string))
# Process column names
columnstr = header.strip()
columnstr = re.sub('\s+',' ',columnstr)
columnstr = re.sub('"','',columnstr)
columns = columnstr.split(",")
# Load the whole thing into Pandas
df = pd.read_csv(sio,quotechar='"',names=columns,thousands=',')
return df
df3 = ca_law_enforcement_by_city('data/')
df3.head()
def ca_law_enforcement_by_county(data_directory):
filename = 'ca_law_enforcement_by_county.csv'
# Load file into list of strings
with open(data_directory + '/' + filename) as f:
content = f.read()
content = re.sub('\r',' ',content)
[header,data] = content.split("civilians\"")
header += "civilians\""
data = data.strip()
# Combine into one long string, and do more processing
one_string = re.sub(r'([0-9]) ([A-Za-z])',r'\1\n\2',data)
sio = io.StringIO(unicode(one_string))
# Process column names
columnstr = header.strip()
columnstr = re.sub('\s+',' ',columnstr)
columnstr = re.sub('"','',columnstr)
columns = columnstr.split(",")
# Load the whole thing into Pandas
df = pd.read_csv(sio,quotechar='"',names=columns,thousands=',')
return df
df4 = ca_law_enforcement_by_county('data/')
df4.head()
The second set of four data files contain data related to offenses, broken down by agency, campus, city, and county. Once we've finished writing some functions to parse out this data and return it as Pandas DataFrames, we can get down to the business of analyzing the data in concert.
def ca_offenses_by_agency(data_directory):
filename = 'ca_offenses_by_agency.csv'
# Load file into list of strings
with open(data_directory + '/' + filename) as f:
lines = f.readlines()
one_line = '\n'.join(lines[1:])
sio = io.StringIO(unicode(one_line))
# Process column names
columnstr = lines[0].strip()
columnstr = re.sub('\s+',' ',columnstr)
columnstr = re.sub('"','',columnstr)
columns = columnstr.split(",")
columns = [s.strip() for s in columns]
# Load the whole thing into Pandas
df = pd.read_csv(sio,quotechar='"',names=columns,thousands=',')
return df
df5 = ca_offenses_by_agency('data/')
df5.head()
def ca_offenses_by_campus(data_directory):
filename = 'ca_offenses_by_campus.csv'
# Load file into list of strings
with open(data_directory + '/' + filename) as f:
content = f.read()
lines = content.split('\r')
lines = [l for l in lines if 'Medical Center, Sacramento5' not in l]
one_line = '\n'.join(lines[1:])
sio = io.StringIO(unicode(one_line))
# Process column names
columnstr = lines[0].strip()
columnstr = re.sub('\s+',' ',columnstr)
columnstr = re.sub('"','',columnstr)
columns = columnstr.split(",")
columns = [s.strip() for s in columns]
# Load the whole thing into Pandas
df = pd.read_csv(sio,quotechar='"',names=columns,thousands=',')
return df
df6 = ca_offenses_by_campus('data/')
df6.head()
def ca_offenses_by_city(data_directory):
filename = 'ca_offenses_by_city.csv'
# Load file into list of strings
with open(data_directory + '/' + filename) as f:
content = f.read()
lines = content.split('\r')
one_line = '\n'.join(lines[1:])
sio = io.StringIO(unicode(one_line))
# Process column names
columnstr = lines[0].strip()
columnstr = re.sub('\s+',' ',columnstr)
columnstr = re.sub('"','',columnstr)
columns = columnstr.split(",")
columns = [s.strip() for s in columns]
# Load the whole thing into Pandas
df = pd.read_csv(sio,quotechar='"',names=columns,thousands=',')
return df
df7 = ca_offenses_by_city('data/')
df7.head()
def ca_offenses_by_county(data_directory):
filename = 'ca_offenses_by_county.csv'
# Load file into list of strings
with open(data_directory + '/' + filename) as f:
lines = f.readlines()
one_line = '\n'.join(lines[1:])
sio = io.StringIO(unicode(one_line))
# Process column names
columnstr = lines[0].strip()
columnstr = re.sub('\s+',' ',columnstr)
columnstr = re.sub('"','',columnstr)
columns = columnstr.split(",")
columns = [s.strip() for s in columns]
# Load the whole thing into Pandas
df = pd.read_csv(sio,quotechar='"',names=columns,thousands=',')
return df
df8 = ca_offenses_by_county('data/')
df8.head()
Now that we've got functions to handle the ugly bits of regular expressions, cleanup, and parsing, we can set to work with analysis.
We have many choices of where to go from here with the analysis. I won't even begin to plot distributions, since, with 8 tables of data, we'll be here all day.
This notebook provides functions for obtaining each data file provided in this data set as a cleaned up DataFrame with a single function call, making it easier to combine these 8 DataFrames to explore this rich, multivariate dataset. Now get a move on, and get to the good stuff.