In this notebook, we'll be analyzing the law enforcement and crime statistics data about California from the FBI. The first thing we are going to do is define a bunch of functions to extract the data from the CSV files provided by the FBI via the Kaggle Dataset. These functions are also defined in the "Cleaning Up The Crime Scene" notebook; they're copied and pasted here without explanation so we can focus on analyzing the data.
We'll be utilizing each of these in turn, but it will be useful to define them all up front and get it out of the way.
# must for data analysis
% matplotlib inline
import numpy as np
import pandas as pd
from matplotlib.pyplot import *
# useful for data wrangling
import io, os, re, subprocess
# for sanity
from pprint import pprint
def ca_law_enforcement_by_agency(data_directory):
filename = 'ca_law_enforcement_by_agency.csv'
# Load file into list of strings
with open(data_directory + '/' + filename) as f:
content = f.read()
content = re.sub('\r',' ',content)
[header,data] = content.split("civilians\"")
header += "civilians\""
data = data.strip()
agencies = re.findall('\w+ Agencies', data)
all_but_agencies = re.split('\w+ Agencies',data)
del all_but_agencies[0]
newlines = []
for (a,aba) in zip(agencies,all_but_agencies):
newlines.append(''.join([a,aba]))
# Combine into one long string, and do more processing
one_string = '\n'.join(newlines)
sio = io.StringIO(unicode(one_string))
# Process column names
columnstr = header.strip()
columnstr = re.sub('\s+',' ',columnstr)
columnstr = re.sub('"','',columnstr)
columns = columnstr.split(",")
columns = [s.strip() for s in columns]
# Load the whole thing into Pandas
df = pd.read_csv(sio,quotechar='"',names=columns,thousands=',')
return df
def ca_law_enforcement_by_campus(data_directory):
filename = 'ca_law_enforcement_by_campus.csv'
# Load file into list of strings
with open(data_directory + '/' + filename) as f:
lines = f.readlines()
# Process each string in the list
newlines = []
for p in lines[1:]:
if( len(re.findall(',,,,',p))==0):
newlines.append(p)
# Combine into one long string, and do more processing
one_string = '\n'.join(newlines)
sio = io.StringIO(unicode(one_string))
# Process column names
columnstr = lines[0].strip()
columnstr = re.sub('\r',' ',columnstr)
columnstr = re.sub('\s+',' ',columnstr)
columns = columnstr.split(",")
columns = [s.strip() for s in columns]
# Load the whole thing into Pandas
df = pd.read_csv(sio,quotechar='"',names=columns,thousands=',')
return df
def ca_law_enforcement_by_city(data_directory):
filename = 'ca_law_enforcement_by_city.csv'
# Load file into list of strings
with open(data_directory + '/' + filename) as f:
content = f.read()
content = re.sub('\r',' ',content)
[header,data] = content.split("civilians\"")
header += "civilians\""
data = data.strip()
# Combine into one long string, and do more processing
one_string = re.sub(r'([0-9]) ([A-Za-z])',r'\1\n\2',data)
sio = io.StringIO(unicode(one_string))
# Process column names
columnstr = header.strip()
columnstr = re.sub('\s+',' ',columnstr)
columnstr = re.sub('"','',columnstr)
columns = columnstr.split(",")
columns = [s.strip() for s in columns]
# Load the whole thing into Pandas
df = pd.read_csv(sio,quotechar='"',names=columns,thousands=',')
return df
def ca_law_enforcement_by_county(data_directory):
filename = 'ca_law_enforcement_by_county.csv'
# Load file into list of strings
with open(data_directory + '/' + filename) as f:
content = f.read()
content = re.sub('\r',' ',content)
[header,data] = content.split("civilians\"")
header += "civilians\""
data = data.strip()
# Combine into one long string, and do more processing
one_string = re.sub(r'([0-9]) ([A-Za-z])',r'\1\n\2',data)
sio = io.StringIO(unicode(one_string))
# Process column names
columnstr = header.strip()
columnstr = re.sub('\s+',' ',columnstr)
columnstr = re.sub('"','',columnstr)
columns = columnstr.split(",")
columns = [s.strip() for s in columns]
# Load the whole thing into Pandas
df = pd.read_csv(sio,quotechar='"',names=columns,thousands=',')
return df
def ca_offenses_by_agency(data_directory):
filename = 'ca_offenses_by_agency.csv'
# Load file into list of strings
with open(data_directory + '/' + filename) as f:
lines = f.readlines()
one_line = '\n'.join(lines[1:])
sio = io.StringIO(unicode(one_line))
# Process column names
columnstr = lines[0].strip()
columnstr = re.sub('\s+',' ',columnstr)
columnstr = re.sub('"','',columnstr)
columns = columnstr.split(",")
columns = [s.strip() for s in columns]
# Load the whole thing into Pandas
df = pd.read_csv(sio,quotechar='"',names=columns,thousands=',')
return df
def ca_offenses_by_campus(data_directory):
filename = 'ca_offenses_by_campus.csv'
# Load file into list of strings
with open(data_directory + '/' + filename) as f:
content = f.read()
lines = content.split('\r')
lines = [l for l in lines if 'Medical Center, Sacramento5' not in l]
one_line = '\n'.join(lines[1:])
sio = io.StringIO(unicode(one_line))
# Process column names
columnstr = lines[0].strip()
columnstr = re.sub('\s+',' ',columnstr)
columnstr = re.sub('"','',columnstr)
columns = columnstr.split(",")
columns = [s.strip() for s in columns]
# Load the whole thing into Pandas
df = pd.read_csv(sio,quotechar='"',names=columns,thousands=',')
df = df[pd.notnull(df['University/College'])]
return df
def ca_offenses_by_city(data_directory):
filename = 'ca_offenses_by_city.csv'
# Load file into list of strings
with open(data_directory + '/' + filename) as f:
content = f.read()
lines = content.split('\r')
one_line = '\n'.join(lines[1:])
sio = io.StringIO(unicode(one_line))
# Process column names
columnstr = lines[0].strip()
columnstr = re.sub('\s+',' ',columnstr)
columnstr = re.sub('"','',columnstr)
columns = columnstr.split(",")
columns = [s.strip() for s in columns]
# Load the whole thing into Pandas
df = pd.read_csv(sio,quotechar='"',names=columns,thousands=',')
return df
def ca_offenses_by_county(data_directory):
filename = 'ca_offenses_by_county.csv'
# Load file into list of strings
with open(data_directory + '/' + filename) as f:
lines = f.readlines()
one_line = '\n'.join(lines[1:])
sio = io.StringIO(unicode(one_line))
# Process column names
columnstr = lines[0].strip()
columnstr = re.sub('\s+',' ',columnstr)
columnstr = re.sub('"','',columnstr)
columns = columnstr.split(",")
# Load the whole thing into Pandas
df = pd.read_csv(sio,quotechar='"',names=columns,thousands=',')
return df
Let's start our analysis with college campuses in California. We can start with the two data files about law enforcement and crime. Let's actually combine these into one DataFrame:
df_enforcement = ca_law_enforcement_by_campus('data/')
df_offenses = ca_offenses_by_campus('data/')
df_enforcement.head()
df_offenses.head()
len(df_offenses)
len(df_enforcement)
The fact that the number of elements in each DataFrame matches is a good sign. There are a few typos in the campus column, some trailing numbers at the end of campus names, so we'll use the map function to map a regular expression substitution function onto each value of the column.
for r in df_offenses['Campus']:
if(type(r)==type(' ')):
df_offenses['Campus'][df_offenses['Campus']==r].map(lambda x : re.sub(r'[0-9]$','',x))
It is now possible to merge these two DataFrames. Because the first three columns (University/College, Campus, and Student enrollment) should all match, we can specify that we want to merge the DataFrames on those three columns.
df_campus = pd.merge(df_offenses, df_enforcement,
on=[df_enforcement.columns[0],df_enforcement.columns[1],df_enforcement.columns[2]])
df_campus.head()
A summary of variables in the table:
A couple of input variables are useful directly, but it's also useful to compute some derived quantities that we can use for statistics (think: normalized quantities). For example, law enforcement statistics would benefit from a normalized measure of number of per capita law enforcement personnel. This would enable cross-comparisons between, say, a campus and its surrounding city.
Law enforcement statistics:
Crime statistics:
Law enforcement/crime combined statistics:
df_campus['Per Capita Law Enforcement Personnel'] = (df_campus['Total law enforcement employees'])/(df_campus['Student enrollment'])
df_campus['Law Enforcement Civilians Per Officer'] = (df_campus['Total civilians'])/(df_campus['Total officers'])
df_campus['Aggregate Crime'] = df_campus['Violent crime'] + df_campus['Property crime'] + df_campus['Arson']
df_campus['Per Capita Violent Crime'] = (df_campus['Violent crime'])/(df_campus['Student enrollment'])
df_campus['Per Capita Property Crime'] = (df_campus['Property crime'])/(df_campus['Student enrollment'])
df_campus['Per Capita Aggregate Crime'] = (df_campus['Violent crime'] + df_campus['Property crime'] + df_campus['Arson'])/(df_campus['Student enrollment'])
df_campus['Aggregate Crime Per Officer'] = (df_campus['Aggregate Crime'])/(df_campus['Total officers'])
df_campus['Violent Crime Per Officer'] = (df_campus['Violent crime'])/(df_campus['Total officers'])
df_campus['Property Crime Per Officer'] = (df_campus['Property crime'])/(df_campus['Total officers'])
Like cutting a deck of cards, we can cut a DataFrame at particular locations to discretize and bin data.
Suppose we want to cut the DataFrame at different school sizes, and and specify names for each category. In our last notebook we used the np.percentile()
function and specified the quantile we wanted. That just got us the cut locations, it didn't actually cut or re-categorize the data for us.
This time, let's look at how we would do that with Pandas, which is a much easier way to cut the data up.
# Start with the data we are going to cut up
data = df_campus['Student enrollment']
bins = [0, 0.20, 0.5, 0.80, 1.0]
# Here's what qcut looks like:
pd.qcut(data,bins).head()
group_names = ['Super Tiny','Small-ish','Large-ish','Massive']
df_campus['School size'] = pd.qcut(data,bins,labels=group_names)
#pprint(dir(pd.qcut(data,bins,labels=group_names)))
pd.value_counts(df_campus['School size']).sort_index()
One place we might want to start is to look for known or expected trends. We might expect, going in, that schools in larger and more populated areas will have a higher incidence of crimes, and a larger law enforcement presence, than others. We would also expect that because larger campuses tend to have higher concentrations of undergraduates living on campus, leading to a higher frequency of crime and requiring a heavier law enforcement presence.
But if we start with this known, or expected, relationship, we immediately find outliers.
import seaborn as sns
sns.stripplot(x="School size", y="Aggregate Crime", data=df_campus, jitter=True)
title('Aggregate Crime vs Campus Size')
show()
It should be noted (and we'll get to this later) that if we look at the per capita incidence of crime, the picture looks very different: larger schools have a lower relative rate of crimes. But we'll get into that in a moment.
sns.stripplot(x="School size", y="Per Capita Aggregate Crime", data=df_campus, jitter=True)
title('Per Capita Aggregate Crime vs Campus Size')
show()
This plot shows the aggregate crime (violent crime plus property crime) committed on campuses, versus their size. There seem to be two different groups plotted here - one large group for which aggregate crime rises very slowly with school size, and another smaller group of outliers for which aggregate crime is much higher and rises much more sharply. The trend of aggregate crime versus school size doesn't explain enough of the variation in data.
The one outliner - the very tiny school with very high aggregate crime - is another example of how our initial assumptions and mental models can sometimes be incorrect. The University of San Francisco is a small medical school, a campus in the heart of San Francisco. The campus is located near Golden Gate Park in a relatively high crime area, bucking the trend of smaller campuses having lower crime.
Here are the outliers in each category:
tiny_sorted = (df_campus[df_campus['School size']=='Super Tiny'].sort_values('Aggregate Crime',ascending=False))[['University/College','Campus']]
print tiny_sorted.iloc[0]
smallish_sorted = (df_campus[df_campus['School size']=='Small-ish'].sort_values('Aggregate Crime',ascending=False))[['University/College','Campus']]
print smallish_sorted.iloc[0:4]
largeish_sorted = (df_campus[df_campus['School size']=='Large-ish'].sort_values('Aggregate Crime',ascending=False))[['University/College','Campus']]
print largeish_sorted.iloc[0:4]
massive_sorted = (df_campus[df_campus['School size']=='Massive'].sort_values('Aggregate Crime',ascending=False))[['University/College','Campus']]
print massive_sorted.iloc[0:2]
It's pretty clear: the outliers are in cities, where crime rates tend to increase faster with student enrollment and campus size. But what about the other, larger group, where crime doesn't rise as fast with student enrollment? We can examine those as well, by listing the bottom of the list:
print tiny_sorted.iloc[-1]
print smallish_sorted.iloc[-3:-1]
print largeish_sorted.iloc[-4:-1]
print massive_sorted.iloc[-5:-1]
There are a few campuses in large cities (Cal State Los Angeles and Cal Poly Pomona), so we can't make blanket statements about city campuses versus non-city campuses. It would be useful to include some data about the surrounding city with each campus, since that would give us a quantitative variable to use, instead of a more fuzzy "these all sound like big cities."
Examining the trend of aggregate crime versus school size revealed grouping in the data. We can use a factor plot to explore other factors.
df_campus.columns
We are still looking for a quantitative way to split the aggregate crime versus school size into two groups, in a quantitative way. Here's the jitter plot we saw, that shows two clear groups:
sns.stripplot(x="School size", y="Aggregate Crime",
data=df_campus, jitter=True)
show()
The solution to partitioning these two groups turns out to be something right in front of our nose: the school's name. That's right! We can use the school's name to group campuses together, and examine crime statistics for campuses across a given system. When we do this, we see that the University of California system has a much steeper correlation between school size and aggregate crime, accounting nearly exclusively for the outliers we spotted in the plot above.
To aggregate the data by university/college system, we will use the pd.value_counts()
method to get a set of unique university/college names. We will then iterate through all elements of this set with a value count larger than 1 (meaning, university or college systems that have more than 1 campus) and add them to a list of campuses. Finally, we'll filter out the data to only include colleges and universities at these larger campuses.
unicol = df_campus['University/College']
university_categories = []
for (i,j) in (pd.value_counts(unicol)>1).iteritems():
if j:
# Compile a list of all College/University names with more than 1 campus
university_categories.append(i)
## To filter out 1-campus schools, use this:
#df_multi_campus = df_campus[df_campus['University/College'].map(lambda x : x in university_categories)]
# To add 1-campus schools to an "Other" category, use this:
df_campus['UCtemp'] = df_campus['University/College'].map(lambda x : x if x in university_categories else "Other")
sns.lmplot(x="Student enrollment", y="Aggregate Crime",
data=df_campus, hue="UCtemp")
show()
sns.lmplot(x="Student enrollment", y="Violent crime",
data=df_campus, hue="UCtemp")
show()
At this point it's clear that the University of California system suffers from a higher overall rate of crime, but it isn't clear why (except that UC schools are typically located in larger cities). While there is a single outlier, UCLA, with an unusually high violent crime rate, the trend holds for schools across the UC system.
df_campus.sort_values('Violent crime',ascending=False).iloc[0:2]
Now let's go back and re-examine that jitter plot with the per-capita incidence of crime. This gives a quite different picture of the incidence of crime. When we looked at the total number of crimes, crime at large schools looked "out of control", but here we see that the per capita incidence of crime on these campuses is not substantially outside the norm.
However, we do see two very small campuses - both University of California campuses - that stick out from the rest of the "Super Tiny" category. These are both University of California campuses located in the city of San Francisco:
These campuses are small enough that the total number of crimes was not outside the norm; but on a per-capita basis, these two smaller campuses are much more dangerous (higher likelihood of a given person experiencing a crime). Splitting out the crime data into violent crimes and property crimes shows us that these two campuses are dangerous for different reasons.
f, axes = subplots(1,3, figsize=(16, 4))
variables = ['Aggregate Crime','Violent Crime','Property Crime']
for ax,varlabel in zip(axes,variables):
sns.stripplot(x="School size", y="Per Capita "+varlabel, data=df_campus, jitter=True, ax=ax)
ax.set_title(varlabel+' vs Campus Size')
show()
label1 = ['University/College','Campus','Student enrollment']
label2 = ['Per Capita Aggregate Crime','Per Capita Violent Crime','Per Capita Property Crime']
tiny_schools = df_campus[df_campus['School size']=='Super Tiny']
for sort_label in label2:
print "="*60
print "Schools Ranked By "+sort_label+":"
pprint( tiny_schools.sort_values(sort_label, ascending=False).iloc[0:3][label1+label2].T )
Both UC Hastings and UC San Francisco are clustered together in a high-crime category of small schools - due to the fact that they are both small campuses located in San Francisco proper. But even then, we can see some major differences between the two: The UC Hastings campus has a per-capita rate of violent crimes that is 4.5 times higher than UC San Francisco's.
This is a reflection of the neighborhoods where these two schools are located: UC Hastings is close to downtown San Francisco and is on the border of the Tenderloin district, which holds the unenviable title of most crime-ridden neighborhood in San Francisco.
UC San Francisco, on the other hand, is more secluded, located between two large parks - Mt. Sutro and Golden Gate Park. This area is relatively more safe than the Tenderloin.
It is useful to look at the incidence of crime on different campuses to identify where crime happens; but we can also use data about law enforcement agencies to determine whether the number of law enforcement officers and other personnel are related to crime rates.
(Note that this is a tricky chicken-and-the-egg problem: we can only determine whether more law enforcement officers correlates with more crime; we cannot determine which one causes which. This is important because, on the one hand, we would expect that more law enforcement personnel would help lower crime; on the other hand, we would expect the number of law enforcement personnel to be higher if a campus is located in an area with higher crime rates.)
plot(df_campus['Per Capita Law Enforcement Personnel'], df_campus['Per Capita Aggregate Crime'],'o')
xlabel('Per Capita Law Enforcement Personnel')
ylabel('Per Capita Aggregate Crime')
show()
As we might have anticipated, we have some outliers: the two schools with small populations and high crime rates, inflating all of the per capita statistics.
print df_campus[['University/College','Campus']][df_campus['Per Capita Law Enforcement Personnel']>0.005]
Let's filter this data out so we can get a clearer picture of the relationship between law enforcement personnel and incidence of crime.
df_campus_filtered = df_campus[df_campus['Per Capita Law Enforcement Personnel']<0.005]
plot(df_campus_filtered['Per Capita Law Enforcement Personnel'], df_campus_filtered['Per Capita Aggregate Crime'],'o')
xlabel('Per Capita Law Enforcement Personnel')
ylabel('Per Capita Aggregate Crime')
show()
There is a general upward trend - the more law enforcement personnel there are, the higher the per capita aggregate crime rate. Let's condition this plot on the ratio of law enforcement civilians to law enforcement officers, to see whether it correlates with higher or lower crime rates.
The "Law Enforcement Civilians Per Officer" column contains the ratio we're interested in; cut the data at the 33rd and 66th quantiles, using the pd.qcut()
function, to split it into three categories, and name each label.
ratio_bins = [0.0,0.33,0.66,1.0]
ratio_data = df_campus['Law Enforcement Civilians Per Officer']
ratio_labels = ["More Civilians","Mixed","More Officers"]
df_campus['Law Enforcement Civilian Officer Ratio'] = 0.0
df_campus.loc[:,['Law Enforcement Civilian Officer Ratio']] = pd.qcut(ratio_data, ratio_bins, ratio_labels)
Now we can use these new category labels to visualize our data categorically.
df_campus_filtered = df_campus[df_campus['Per Capita Law Enforcement Personnel']<0.005]
sns.lmplot(x="Per Capita Law Enforcement Personnel", y="Per Capita Aggregate Crime",
hue="Law Enforcement Civilian Officer Ratio",
data=df_campus_filtered)
xlim([0.0,0.005])
ylim([0.00,0.05])
show()
There is not a strong dependence between the per capita crime rate and the per capita number of law enforcement personnel.
If we look at the total numbers (including the two outliers, UC San Francisco and UC Hastings), we see a general upward trend - more officers means more crimes - but this can be attributed to the fact that larger campuses have more people and therefore more crime.
sns.lmplot(x="Total law enforcement employees", y="Aggregate Crime",
hue="Law Enforcement Civilian Officer Ratio",
data=df_campus)
show()
An analysis of campus law enforcement and crime data showed a couple of trends we anticipated, and a couple we did not. Specifically, we confirmed that there was a general positive trend between total student enrollment and incidence of crime. However, we saw that this did not explain all of the data. There was quite a bit of variance in the enrollment versus incidence of crime trend line. This variance can be attributed to several outliers. When we identified the campuses that were outliers, we found that they were all University of California campuses (which tended to have a higher overall incidence of crime). UC Berkeley and UCLA have the highest rates of crime (both total and per capita) of any campus, but UC Hastings and UC San Francisco, both small campuses located in large metropolitan areas, had unusually high per-capita rates of crime.
There also does not appear to be a trend between the number of per-capita law enforcement employees and the per-capita incidence of crime, nor does it appear to be dependent on the makeup of the law enforcement agency (whether it is civilian-heavy or officer-heavy).
In some sense, these findings are no surprise - we would expect, going in, that UC Hastings, located in a bad neighborhood in a big city, would have a higher rate of crime per student than other campuses. But the data did give us a breakdown between violent and property crime, which showed significant differences between crime at UC San Francisco and UC Hastings.
Perhaps the most important observation here is that the campus data sets are incomplete. While the size of the campus gives a strong indication of the level of crime, this approach leads to multiple outliers. To properly describe (or predict) these outliers with a model requires data not contained in the data set - the crime rate of the surrounding neighborhood and city. The crime rates cannot be effectively described using campus size (too many outliers) or number of law enforcement officers (not significant).