Introduction

This data set contains information about crimes and law enforcement agencies in the State of California, and is provided by the FBI via Kaggle Datasets. The data set consists of 8 CSV files, 4 with data about law enforcement officers and 4 with data about crimes. The data is poorly formatted, and only 2 of the CSV files can be loaded as Pandas DataFrames without much effort.

In this notebook, we'll be walking through the use of regular expressions and other low-level drudgery to parse this odd assortment of typos, Linux and Windows newline characters, and an all-around tasteless use of commas, and stuff this processing into some functions that will magically return nice tidy DataFrames.

(An alternative to using Python is to use command line tools like sed or awk to do the same kind of processing. However, sed is a line-based tool, and as we'll see, there are some issues with the data that require dealing with multiple lines. It's possible to use awk, but mastery of awk requires time and practice. Python is a good alternative.)

In [21]:
# must for data analysis
% matplotlib inline
import numpy as np
import pandas as pd
from matplotlib.pyplot import *

# useful for data wrangling
import io, os, re, subprocess

# for sanity
from pprint import pprint
In [22]:
data_files = os.listdir('data/')
pprint(data_files)
['ca_law_enforcement_by_agency.csv',
 'ca_law_enforcement_by_campus.csv',
 'ca_law_enforcement_by_city.csv',
 'ca_law_enforcement_by_county.csv',
 'ca_offenses_by_agency.csv',
 'ca_offenses_by_campus.csv',
 'ca_offenses_by_city.csv',
 'ca_offenses_by_county.csv']

This notebook will demonstrate how to load just one of these files - but each one presents its own challenges. Ultimately, we can abstract away the details of each file into (at least) 8 classes or 8 static methods. This notebook will cover the procedural programming that would go into those methods and classes, and illustrate how the components work. Then they can all be packaged up into a function.

Loading the Data

Start by loading the lines of the file into a list of strings:

In [23]:
filename = data_files[1]
filewpath = "data/"+filename

with open(filewpath) as f:
    lines = f.readlines()

pprint([p.strip() for p in lines[:10]])
['University/College ,Campus,Student\renrollment,Total law\renforcement \remployees,Total\rofficers,Total\rcivilians',
 'Allan Hancock College,,"11,047",10.0,5.0,5.0',
 'California State Polytechnic University,Pomona,"23,966",27.0,19.0,8.0',
 'California State Polytechnic University,San Luis Obispo,"20,186",33.0,17.0,16.0',
 'California State University,Bakersfield,"8,720",21.0,14.0,7.0',
 'California State University,Channel Islands,"5,879",28.0,14.0,14.0',
 'California State University,Chico,"17,287",25.0,14.0,11.0',
 'California State University,Dominguez Hills,"14,687",26.0,20.0,6.0',
 'California State University,East Bay,"14,823",26.0,15.0,11.0',
 'California State University,Fresno,"23,179",31.0,20.0,11.0']

This data is going to be a bit tricky - there are commas in the name field, not surrounded with double quotes, and another field with commas that is surrounded with double quotes. Checking the number of commas:

In [24]:
number_of_commas = [len(re.findall(',',p)) for p in lines]
print np.min(number_of_commas)
print np.max(number_of_commas)
5
6

All lines have 4 fields, one of which (attendance) always has one comma, which means all lines have at least 5 commas. Some schools have a comma in their name, for a total of six commas. But there should be 4 columns per line.

We will use regular expressions to put double quotes around school names, to keep the commas in school names from confusing the parser. (Note that this is more efficiently done on the command line with sed or awk, but in this case the fixes needed are relatively minor.)

Finally, we'll be fnished manipulating the data, and we'll pass the resulting list of parsed strings on to a DataFrame.

Properly Parsing School Names

The data on school names follows a pattern: a string, possibly containing a comma, followed by a comma, and a double quote, and a number. The comma-double quote combination that marks the end of the school name is unique to schools, so we'll use that to mark the end of the school name.

<school name, perhaps including a comma>,"00,000"

We can use the regular expression r'^[A-Za-z, ]{1,},"[0-9]' to match the pattern. This pattern means:

  • ^ - start at the beginning of the line
  • [^"]{1,} - look for one or more occurrences of a non-double-quote character... (continued on next line)
  • ," - ...followed by a comma and a double quote
  • [0-9] - and all of that followed by a numerical digit

This will locate the string containing the name of the school. We can use the () and make a substitution:

re.sub( r'^([^"]{1,})(,"[0-9])' ,  r'"\1"\2' , my_string )

The first bit, same as above, means, 1 or more instances of characters that aren't a double quote. That double quote is the first and only thing we care about finding.

The second bit there means, replace \1 with the first portion surrounded by parentheses, and replace \2 with the second portion surrounded by parentheses.

While we're at it, we can also add a check to make sure we're ignoring empty lines, which in this file means four or more commas in a row: ,,,,

In [25]:
# Parse each line with a regular expression
newlines = []
for p in lines[1:]:
    if( len(re.findall(',,,,',p))==0):
        #newlines.append(re.sub(r'^([^"]{1,})(,"[0-9])' ,  r'"\1"\2', p))
        newlines.append(p)
        
pprint(newlines[:10])
['Allan Hancock College,,"11,047",10.0,5.0,5.0\n',
 'California State Polytechnic University,Pomona,"23,966",27.0,19.0,8.0\n',
 'California State Polytechnic University,San Luis Obispo,"20,186",33.0,17.0,16.0\n',
 'California State University,Bakersfield,"8,720",21.0,14.0,7.0\n',
 'California State University,Channel Islands,"5,879",28.0,14.0,14.0\n',
 'California State University,Chico,"17,287",25.0,14.0,11.0\n',
 'California State University,Dominguez Hills,"14,687",26.0,20.0,6.0\n',
 'California State University,East Bay,"14,823",26.0,15.0,11.0\n',
 'California State University,Fresno,"23,179",31.0,20.0,11.0\n',
 'California State University,Fullerton,"38,128",37.0,27.0,10.0\n']

Getting a DataFrame

We now have the raw data as a list of strings. We can process the string using some regular expression magic. Here's what the procedure will look like:

  • Join the list of strings together into one long string
  • Create a StringIO object to turn that string into a stream that can be read.
  • Extract column names from the (badly-mangled and poorly-formatted) header file
  • Pass the StringIO object and properly formatted column names to the Pandas read_csv() method to get a DataFrame.

The end result is a DataFrame that we can use to do a statistical analysis.

In [26]:
one_string = '\n'.join(newlines)
sio = io.StringIO(unicode(one_string))

columnstr = lines[0].strip()

# Get rid of \r stuff
columnstr = re.sub('\r',' ',columnstr)
columnstr = re.sub('\s+',' ',columnstr)

# Fix what can ONLY have been a typo, making this file un-parsable without superhuman regex abilities
#columnstr = re.sub(',Campus','Campus',columnstr)

columns = [pp.strip() for pp in columnstr.split(",")]

df = pd.read_csv(sio,quotechar='"',names=columns,thousands=',')
df.head()
Out[26]:
University/College Campus Student enrollment Total law enforcement employees Total officers Total civilians
0 Allan Hancock College NaN 11047 10.0 5.0 5.0
1 California State Polytechnic University Pomona 23966 27.0 19.0 8.0
2 California State Polytechnic University San Luis Obispo 20186 33.0 17.0 16.0
3 California State University Bakersfield 8720 21.0 14.0 7.0
4 California State University Channel Islands 5879 28.0 14.0 14.0

Plotting a Distribution

Now that we've successfully wrangled some of this data into a DataFrame, let's take it for a spin and make sure we're able to visualize the data without any issues.

In [27]:
import seaborn as sns
In [28]:
sns.pairplot(df)
Out[28]:
<seaborn.axisgrid.PairGrid at 0x10b8fce50>

Casting a Column as Numeric

If we have an issue with a column being treated as a string, as the Student enrollment column is in this case, we can cast it as a floating point number or an integer. We will extract the column of Student enrollment data, and use the Series.map() function to apply a function converting string data following a given locale setting for numbers (which helps Python know where to expect commas and decimals, and what to do with them).

In [29]:
import locale
from locale import atof
locale.setlocale(locale.LC_NUMERIC, '')
try:
    df['Student enrollment'] = df['Student enrollment'].map(atof)
except AttributeError:
    # Already done
    pass
df.head()
Out[29]:
University/College Campus Student enrollment Total law enforcement employees Total officers Total civilians
0 Allan Hancock College NaN 11047 10.0 5.0 5.0
1 California State Polytechnic University Pomona 23966 27.0 19.0 8.0
2 California State Polytechnic University San Luis Obispo 20186 33.0 17.0 16.0
3 California State University Bakersfield 8720 21.0 14.0 7.0
4 California State University Channel Islands 5879 28.0 14.0 14.0

Now we can plot distributions of the student enrollment variable, which has some large higher order moments:

In [30]:
sns.distplot(df['Student enrollment'],bins=15,kde=False)
Out[30]:
<matplotlib.axes._subplots.AxesSubplot at 0x1054b7890>

Divvying Data Using Quantiles

On a somewhat unrelated note - we can use the student enrollment data to divvy up schools into small, medium, and large schools by getting the quantiles of student enrollment. If we split the student enrollment distribution at its 33rd and 66th quantiles, we'll bin the schools into three groups of three sizes:

In [31]:
# Divide the schools into three size bins using quantiles
slice1 = np.percentile(df['Student enrollment'],q=33)
slice2 = np.percentile(df['Student enrollment'],q=66)

def school_size(enrollment):
    if enrollment < slice1:
        return 'Small'
    elif enrollment < slice2:
        return 'Medium'
    else:
        return 'Large'

print "Small-Medium Cutoff: %d"%(slice1)
print "Medium-Large Cutoff: %d"%(slice2)

df['Size'] = df['Student enrollment'].map(lambda x : school_size(x))
Small-Medium Cutoff: 16079
Medium-Large Cutoff: 29351

Now we can use that to get a conditional pair plot, with colors corresponding to the (terribly bland) labels of "Small", "Medium", and "Large".

In [32]:
sns.pairplot(df,hue="Size")
Out[32]:
<seaborn.axisgrid.PairGrid at 0x10e659fd0>

Functional Data Processing

Of the functionality we implemented above, the most useful to abstract away into an object or a function is the process of turning a file into a DataFrame. Furthermore, each file will likely have its own challenges with parsing and processing that will be unique to it.

This is a task best suited for 8 static methods, for the following reasons:

  • We have a large number of files, and each one has different formatting issues.
  • Each file has different special patterns, possible typos, etc.
  • The only shared information among each file parser is where the data file is located.

Writing the parsing scripts as functions will help to separate the task of processing data and the task of analyzing data (hence, the reason we don't delve too deeply into the data above, just test out a few plots to make sure the DataFrame we imported is robust).

Parsing Law Enforcement Data

We'll start with the first four files, which contain information about law enforcement agencies broken down by agency, campus, city, and county. Each one has a roughly similar structure, so we can share some code, but there are too many particulars to make more code sharing useful.

In [33]:
def ca_law_enforcement_by_agency(data_directory):
    filename = 'ca_law_enforcement_by_agency.csv'

    # Load file into list of strings
    with open(data_directory + '/' + filename) as f:
        content = f.read()

    content = re.sub('\r',' ',content)
    [header,data] = content.split("civilians\"")
    header += "civilians\""
    
    data = data.strip()
    agencies = re.findall('\w+ Agencies', data)
    all_but_agencies = re.split('\w+ Agencies',data)
    del all_but_agencies[0]
    
    newlines = []
    for (a,aba) in zip(agencies,all_but_agencies):
        newlines.append(''.join([a,aba]))
    
    # Combine into one long string, and do more processing
    one_string = '\n'.join(newlines)
    sio = io.StringIO(unicode(one_string))
    
    # Process column names
    columnstr = header.strip()
    columnstr = re.sub('\s+',' ',columnstr)
    columnstr = re.sub('"','',columnstr)
    columns = columnstr.split(",")

    # Load the whole thing into Pandas
    df = pd.read_csv(sio,quotechar='"',names=columns,thousands=',')

    return df


df1 = ca_law_enforcement_by_agency('data/')
df1.head()
Out[33]:
State/Tribal/Other Agency Unit/Office Total law enforcement employees Total officers Total civilians
0 State Agencies Atascadero State Hospital NaN 139 128 11
1 State Agencies California State Fair NaN 3 3 0
2 State Agencies Coalinga State Hospital NaN 228 210 18
3 State Agencies Department of Parks and Recreation Capital 560 530 30
4 State Agencies Fairview Developmental Center NaN 17 14 3
In [34]:
def ca_law_enforcement_by_campus(data_directory):
    filename = 'ca_law_enforcement_by_campus.csv'

    # Load file into list of strings
    with open(data_directory + '/' + filename) as f:
        newlines = f.readlines()

    # Process each string in the list
    newlines = []
    for p in lines[1:]:
        if( len(re.findall(',,,,',p))==0):
            #newlines.append(re.sub(r'^([^"]{1,})(,"[0-9])' ,  r'"\1"\2', p))
            newlines.append(p)
            
    # Combine into one long string, and do more processing
    one_string = '\n'.join(newlines)
    sio = io.StringIO(unicode(one_string))

    # Process column names
    columnstr = lines[0].strip()
    columnstr = re.sub('\r',' ',columnstr)
    columnstr = re.sub('\s+',' ',columnstr)
    columns = columnstr.split(",")

    # Load the whole thing into Pandas
    df = pd.read_csv(sio,quotechar='"',names=columns,thousands=',')

    return df


df2 = ca_law_enforcement_by_campus('data/')
df2.head()
Out[34]:
University/College Campus Student enrollment Total law enforcement employees Total officers Total civilians
0 Allan Hancock College NaN 11047 10.0 5.0 5.0
1 California State Polytechnic University Pomona 23966 27.0 19.0 8.0
2 California State Polytechnic University San Luis Obispo 20186 33.0 17.0 16.0
3 California State University Bakersfield 8720 21.0 14.0 7.0
4 California State University Channel Islands 5879 28.0 14.0 14.0
In [35]:
def ca_law_enforcement_by_city(data_directory):
    filename = 'ca_law_enforcement_by_city.csv'

    # Load file into list of strings
    with open(data_directory + '/' + filename) as f:
        content = f.read()

    content = re.sub('\r',' ',content)
    [header,data] = content.split("civilians\"")
    header += "civilians\""
    
    data = data.strip()
        
    # Combine into one long string, and do more processing
    one_string = re.sub(r'([0-9]) ([A-Za-z])',r'\1\n\2',data)
    sio = io.StringIO(unicode(one_string))
    
    # Process column names
    columnstr = header.strip()
    columnstr = re.sub('\s+',' ',columnstr)
    columnstr = re.sub('"','',columnstr)
    columns = columnstr.split(",")

    # Load the whole thing into Pandas
    df = pd.read_csv(sio,quotechar='"',names=columns,thousands=',')

    return df


df3 = ca_law_enforcement_by_city('data/')
df3.head()
Out[35]:
City Population Total law enforcement employees Total officers Total civilians
0 Alameda 78613 112 83 29
1 Albany 19723 30 23 7
2 Alhambra 86175 128 85 43
3 Alturas 2566 6 5 1
4 Anaheim 349471 577 399 178
In [36]:
def ca_law_enforcement_by_county(data_directory):
    filename = 'ca_law_enforcement_by_county.csv'

    # Load file into list of strings
    with open(data_directory + '/' + filename) as f:
        content = f.read()

    content = re.sub('\r',' ',content)
    [header,data] = content.split("civilians\"")
    header += "civilians\""
    
    data = data.strip()
        
    # Combine into one long string, and do more processing
    one_string = re.sub(r'([0-9]) ([A-Za-z])',r'\1\n\2',data)
    sio = io.StringIO(unicode(one_string))
    
    # Process column names
    columnstr = header.strip()
    columnstr = re.sub('\s+',' ',columnstr)
    columnstr = re.sub('"','',columnstr)
    columns = columnstr.split(",")

    # Load the whole thing into Pandas
    df = pd.read_csv(sio,quotechar='"',names=columns,thousands=',')

    return df


df4 = ca_law_enforcement_by_county('data/')
df4.head()
Out[36]:
Metropolitan/Nonmetropolitan County Total law enforcement employees Total officers Total civilians
0 Metropolitan Counties Alameda 1560 978 582
1 Metropolitan Counties Butte 288 101 187
2 Metropolitan Counties Contra Costa 936 610 326
3 Metropolitan Counties El Dorado 349 164 185
4 Metropolitan Counties Fresno 1043 406 637

Parsing Offenses Data

The second set of four data files contain data related to offenses, broken down by agency, campus, city, and county. Once we've finished writing some functions to parse out this data and return it as Pandas DataFrames, we can get down to the business of analyzing the data in concert.

In [37]:
def ca_offenses_by_agency(data_directory):
    filename = 'ca_offenses_by_agency.csv'

    # Load file into list of strings
    with open(data_directory + '/' + filename) as f:
        lines = f.readlines()
    
    one_line = '\n'.join(lines[1:])
    sio = io.StringIO(unicode(one_line))
    
    # Process column names
    columnstr = lines[0].strip()
    columnstr = re.sub('\s+',' ',columnstr)
    columnstr = re.sub('"','',columnstr)
    columns = columnstr.split(",")
    columns = [s.strip() for s in columns]
    
    # Load the whole thing into Pandas
    df = pd.read_csv(sio,quotechar='"',names=columns,thousands=',')

    return df

df5 = ca_offenses_by_agency('data/')
df5.head()
Out[37]:
State/Tribal/Other Agency Unit/Office Violent crime Murder and nonnegligent manslaughter Rape (revised definition) Rape (legacy definition) Robbery Aggravated assault Property crime Burglary Larceny-theft Motor vehicle theft Arson
0 State Agencies Atascadero State Hospital4 NaN 321.0 0.0 0.0 NaN 0.0 321.0 4.0 0.0 2.0 2.0 0.0
1 State Agencies California State Fair4 NaN 12.0 0.0 0.0 NaN 0.0 12.0 79.0 21.0 56.0 2.0 1.0
2 State Agencies Coalinga State Hospital NaN 149.0 0.0 0.0 NaN 0.0 149.0 3.0 1.0 2.0 0.0 0.0
3 State Agencies Department of Parks and Recreation Angeles 2.0 0.0 0.0 NaN 1.0 1.0 26.0 8.0 18.0 0.0 0.0
4 State Agencies Department of Parks and Recreation Bay Area 0.0 0.0 0.0 NaN 0.0 0.0 1.0 0.0 0.0 1.0 0.0
In [38]:
def ca_offenses_by_campus(data_directory):
    filename = 'ca_offenses_by_campus.csv'

    # Load file into list of strings
    with open(data_directory + '/' + filename) as f:
        content = f.read()
    
    lines = content.split('\r')
    lines = [l for l in lines if 'Medical Center, Sacramento5' not in l]
    one_line = '\n'.join(lines[1:])
    sio = io.StringIO(unicode(one_line))
    
    # Process column names
    columnstr = lines[0].strip()
    columnstr = re.sub('\s+',' ',columnstr)
    columnstr = re.sub('"','',columnstr)
    columns = columnstr.split(",")
    columns = [s.strip() for s in columns]
    
    # Load the whole thing into Pandas
    df = pd.read_csv(sio,quotechar='"',names=columns,thousands=',')

    return df

df6 = ca_offenses_by_campus('data/')
df6.head()
Out[38]:
University/College Campus Student enrollment Violent crime Murder and nonnegligent manslaughter Rape (revised definition) Rape (legacy definition) Robbery Aggravated assault Property crime Burglary Larceny-theft Motor vehicle theft Arson
0 Allan Hancock College NaN 11047.0 0.0 0.0 0.0 NaN 0.0 0.0 21.0 2.0 18.0 1.0 0.0
1 California State Polytechnic University Pomona 23966.0 6.0 0.0 4.0 NaN 1.0 1.0 173.0 5.0 150.0 18.0 1.0
2 California State Polytechnic University San Luis Obispo4 20186.0 3.0 0.0 2.0 NaN 0.0 1.0 163.0 7.0 149.0 7.0 1.0
3 California State University Bakersfield 8720.0 1.0 0.0 0.0 NaN 0.0 1.0 78.0 12.0 65.0 1.0 0.0
4 California State University Channel Islands 5879.0 2.0 0.0 0.0 NaN 1.0 1.0 62.0 7.0 54.0 1.0 0.0
In [39]:
def ca_offenses_by_city(data_directory):
    filename = 'ca_offenses_by_city.csv'

    # Load file into list of strings
    with open(data_directory + '/' + filename) as f:
        content = f.read()
    
    lines = content.split('\r')
    one_line = '\n'.join(lines[1:])
    sio = io.StringIO(unicode(one_line))
    
    # Process column names
    columnstr = lines[0].strip()
    columnstr = re.sub('\s+',' ',columnstr)
    columnstr = re.sub('"','',columnstr)
    columns = columnstr.split(",")
    columns = [s.strip() for s in columns]
    
    # Load the whole thing into Pandas
    df = pd.read_csv(sio,quotechar='"',names=columns,thousands=',')

    return df

df7 = ca_offenses_by_city('data/')
df7.head()
Out[39]:
City Population Violent crime Murder and nonnegligent manslaughter Rape (revised definition) Rape (legacy definition) Robbery Aggravated assault Property crime Burglary Larceny-theft Motor vehicle theft Arson
0 Adelanto 33005 212 2 14 NaN 48 148 808 434 254 120 24
1 Agoura Hills 20970 15 0 1 NaN 6 8 310 82 217 11 0
2 Alameda 78613 148 2 7 NaN 61 78 1819 228 1245 346 18
3 Albany 19723 34 1 6 NaN 16 11 605 95 447 63 0
4 Alhambra3 86175 168 1 13 NaN 74 80 1929 305 1413 211 6
In [40]:
def ca_offenses_by_county(data_directory):
    filename = 'ca_offenses_by_county.csv'

    # Load file into list of strings
    with open(data_directory + '/' + filename) as f:
        lines = f.readlines()
    
    one_line = '\n'.join(lines[1:])
    sio = io.StringIO(unicode(one_line))
    
    # Process column names
    columnstr = lines[0].strip()
    columnstr = re.sub('\s+',' ',columnstr)
    columnstr = re.sub('"','',columnstr)
    columns = columnstr.split(",")
    columns = [s.strip() for s in columns]
    
    # Load the whole thing into Pandas
    df = pd.read_csv(sio,quotechar='"',names=columns,thousands=',')

    return df

df8 = ca_offenses_by_county('data/')
df8.head()
Out[40]:
Metropolitan/Nonmetropolitan County Violent crime Murder and nonnegligent manslaughter Rape (revised definition) Rape(legacy definition) Robbery Aggravated assault Property crime Burglary Larceny-theft Motor vehicle theft Arson
0 Metropolitan Counties Alameda 510.0 6.0 13.0 NaN 177.0 314.0 2077.0 463.0 985.0 629.0 11.0
1 Metropolitan Counties Butte3 155.0 4.0 20.0 NaN 14.0 117.0 1422.0 708.0 693.0 21.0 0.0
2 Metropolitan Counties Contra Costa 426.0 14.0 24.0 NaN 153.0 235.0 2013.0 660.0 1332.0 21.0 18.0
3 Metropolitan Counties El Dorado3 252.0 9.0 46.0 NaN 32.0 165.0 2031.0 577.0 1412.0 42.0 4.0
4 Metropolitan Counties Fresno 962.0 8.0 32.0 NaN 112.0 810.0 3810.0 1398.0 1838.0 574.0 145.0

Summary: Where To From Here

Now that we've got functions to handle the ugly bits of regular expressions, cleanup, and parsing, we can set to work with analysis.

We have many choices of where to go from here with the analysis. I won't even begin to plot distributions, since, with 8 tables of data, we'll be here all day.

This notebook provides functions for obtaining each data file provided in this data set as a cleaned up DataFrame with a single function call, making it easier to combine these 8 DataFrames to explore this rich, multivariate dataset. Now get a move on, and get to the good stuff.