Introduction

In this notebook, we'll be analyzing the law enforcement and crime statistics data about California from the FBI. The first thing we are going to do is define a bunch of functions to extract the data from the CSV files provided by the FBI via the Kaggle Dataset. These functions are also defined in the "Cleaning Up The Crime Scene" notebook; they're copied and pasted here without explanation so we can focus on analyzing the data.

We'll be utilizing each of these in turn, but it will be useful to define them all up front and get it out of the way.

In [1]:
# must for data analysis
% matplotlib inline
import numpy as np
import pandas as pd
from matplotlib.pyplot import *

# useful for data wrangling
import io, os, re, subprocess

# for sanity
from pprint import pprint
In [2]:
def ca_law_enforcement_by_agency(data_directory):
    filename = 'ca_law_enforcement_by_agency.csv'

    # Load file into list of strings
    with open(data_directory + '/' + filename) as f:
        content = f.read()

    content = re.sub('\r',' ',content)
    [header,data] = content.split("civilians\"")
    header += "civilians\""
    
    data = data.strip()
    agencies = re.findall('\w+ Agencies', data)
    all_but_agencies = re.split('\w+ Agencies',data)
    del all_but_agencies[0]
    
    newlines = []
    for (a,aba) in zip(agencies,all_but_agencies):
        newlines.append(''.join([a,aba]))
    
    # Combine into one long string, and do more processing
    one_string = '\n'.join(newlines)
    sio = io.StringIO(unicode(one_string))
    
    # Process column names
    columnstr = header.strip()
    columnstr = re.sub('\s+',' ',columnstr)
    columnstr = re.sub('"','',columnstr)
    columns = columnstr.split(",")
    columns = [s.strip() for s in columns]

    # Load the whole thing into Pandas
    df = pd.read_csv(sio,quotechar='"',names=columns,thousands=',')

    return df

def ca_law_enforcement_by_campus(data_directory):
    filename = 'ca_law_enforcement_by_campus.csv'

    # Load file into list of strings
    with open(data_directory + '/' + filename) as f:
        lines = f.readlines()

    # Process each string in the list
    newlines = []
    for p in lines[1:]:
        if( len(re.findall(',,,,',p))==0):
            newlines.append(p)

    # Combine into one long string, and do more processing
    one_string = '\n'.join(newlines)
    sio = io.StringIO(unicode(one_string))

    # Process column names
    columnstr = lines[0].strip()
    columnstr = re.sub('\r',' ',columnstr)
    columnstr = re.sub('\s+',' ',columnstr)
    columns = columnstr.split(",")
    columns = [s.strip() for s in columns]

    # Load the whole thing into Pandas
    df = pd.read_csv(sio,quotechar='"',names=columns,thousands=',')
    
    return df

def ca_law_enforcement_by_city(data_directory):
    filename = 'ca_law_enforcement_by_city.csv'

    # Load file into list of strings
    with open(data_directory + '/' + filename) as f:
        content = f.read()

    content = re.sub('\r',' ',content)
    [header,data] = content.split("civilians\"")
    header += "civilians\""
    
    data = data.strip()
    
    # Combine into one long string, and do more processing
    one_string = re.sub(r'([0-9]) ([A-Za-z])',r'\1\n\2',data)
    sio = io.StringIO(unicode(one_string))
    
    # Process column names
    columnstr = header.strip()
    columnstr = re.sub('\s+',' ',columnstr)
    columnstr = re.sub('"','',columnstr)
    columns = columnstr.split(",")
    columns = [s.strip() for s in columns]

    # Load the whole thing into Pandas
    df = pd.read_csv(sio,quotechar='"',names=columns,thousands=',')

    return df

def ca_law_enforcement_by_county(data_directory):
    filename = 'ca_law_enforcement_by_county.csv'

    # Load file into list of strings
    with open(data_directory + '/' + filename) as f:
        content = f.read()

    content = re.sub('\r',' ',content)
    [header,data] = content.split("civilians\"")
    header += "civilians\""
    
    data = data.strip()
        
    # Combine into one long string, and do more processing
    one_string = re.sub(r'([0-9]) ([A-Za-z])',r'\1\n\2',data)
    sio = io.StringIO(unicode(one_string))
    
    # Process column names
    columnstr = header.strip()
    columnstr = re.sub('\s+',' ',columnstr)
    columnstr = re.sub('"','',columnstr)
    columns = columnstr.split(",")
    columns = [s.strip() for s in columns]

    # Load the whole thing into Pandas
    df = pd.read_csv(sio,quotechar='"',names=columns,thousands=',')

    return df

def ca_offenses_by_agency(data_directory):
    filename = 'ca_offenses_by_agency.csv'

    # Load file into list of strings
    with open(data_directory + '/' + filename) as f:
        lines = f.readlines()
    
    one_line = '\n'.join(lines[1:])
    sio = io.StringIO(unicode(one_line))
    
    # Process column names
    columnstr = lines[0].strip()
    columnstr = re.sub('\s+',' ',columnstr)
    columnstr = re.sub('"','',columnstr)
    columns = columnstr.split(",")
    columns = [s.strip() for s in columns]

    # Load the whole thing into Pandas
    df = pd.read_csv(sio,quotechar='"',names=columns,thousands=',')

    return df

def ca_offenses_by_campus(data_directory):
    filename = 'ca_offenses_by_campus.csv'

    # Load file into list of strings
    with open(data_directory + '/' + filename) as f:
        content = f.read()
    
    lines = content.split('\r')
    lines = [l for l in lines if 'Medical Center, Sacramento5' not in l]
    one_line = '\n'.join(lines[1:])
    sio = io.StringIO(unicode(one_line))
    
    # Process column names
    columnstr = lines[0].strip()
    columnstr = re.sub('\s+',' ',columnstr)
    columnstr = re.sub('"','',columnstr)
    columns = columnstr.split(",")
    columns = [s.strip() for s in columns]

    # Load the whole thing into Pandas
    df = pd.read_csv(sio,quotechar='"',names=columns,thousands=',')
    df = df[pd.notnull(df['University/College'])]

    return df

def ca_offenses_by_city(data_directory):
    filename = 'ca_offenses_by_city.csv'

    # Load file into list of strings
    with open(data_directory + '/' + filename) as f:
        content = f.read()
    
    lines = content.split('\r')
    one_line = '\n'.join(lines[1:])
    sio = io.StringIO(unicode(one_line))
    
    # Process column names
    columnstr = lines[0].strip()
    columnstr = re.sub('\s+',' ',columnstr)
    columnstr = re.sub('"','',columnstr)
    columns = columnstr.split(",")
    columns = [s.strip() for s in columns]

    # Load the whole thing into Pandas
    df = pd.read_csv(sio,quotechar='"',names=columns,thousands=',')
    
    return df

def ca_offenses_by_county(data_directory):
    filename = 'ca_offenses_by_county.csv'

    # Load file into list of strings
    with open(data_directory + '/' + filename) as f:
        lines = f.readlines()
    
    one_line = '\n'.join(lines[1:])
    sio = io.StringIO(unicode(one_line))
    
    # Process column names
    columnstr = lines[0].strip()
    columnstr = re.sub('\s+',' ',columnstr)
    columnstr = re.sub('"','',columnstr)
    columns = columnstr.split(",")
    
    # Load the whole thing into Pandas
    df = pd.read_csv(sio,quotechar='"',names=columns,thousands=',')

    return df

Campuses

Let's start our analysis with college campuses in California. We can start with the two data files about law enforcement and crime. Let's actually combine these into one DataFrame:

In [3]:
df_enforcement = ca_law_enforcement_by_campus('data/')
df_offenses = ca_offenses_by_campus('data/')
In [4]:
df_enforcement.head()
Out[4]:
University/College Campus Student enrollment Total law enforcement employees Total officers Total civilians
0 Allan Hancock College NaN 11047 10.0 5.0 5.0
1 California State Polytechnic University Pomona 23966 27.0 19.0 8.0
2 California State Polytechnic University San Luis Obispo 20186 33.0 17.0 16.0
3 California State University Bakersfield 8720 21.0 14.0 7.0
4 California State University Channel Islands 5879 28.0 14.0 14.0
In [5]:
df_offenses.head()
Out[5]:
University/College Campus Student enrollment Violent crime Murder and nonnegligent manslaughter Rape (revised definition) Rape (legacy definition) Robbery Aggravated assault Property crime Burglary Larceny-theft Motor vehicle theft Arson
0 Allan Hancock College NaN 11047.0 0.0 0.0 0.0 NaN 0.0 0.0 21.0 2.0 18.0 1.0 0.0
1 California State Polytechnic University Pomona 23966.0 6.0 0.0 4.0 NaN 1.0 1.0 173.0 5.0 150.0 18.0 1.0
2 California State Polytechnic University San Luis Obispo4 20186.0 3.0 0.0 2.0 NaN 0.0 1.0 163.0 7.0 149.0 7.0 1.0
3 California State University Bakersfield 8720.0 1.0 0.0 0.0 NaN 0.0 1.0 78.0 12.0 65.0 1.0 0.0
4 California State University Channel Islands 5879.0 2.0 0.0 0.0 NaN 1.0 1.0 62.0 7.0 54.0 1.0 0.0
In [6]:
len(df_offenses)
Out[6]:
48
In [7]:
len(df_enforcement)
Out[7]:
48

The fact that the number of elements in each DataFrame matches is a good sign. There are a few typos in the campus column, some trailing numbers at the end of campus names, so we'll use the map function to map a regular expression substitution function onto each value of the column.

In [8]:
for r in df_offenses['Campus']:
    if(type(r)==type(' ')):
        df_offenses['Campus'][df_offenses['Campus']==r].map(lambda x : re.sub(r'[0-9]$','',x))

It is now possible to merge these two DataFrames. Because the first three columns (University/College, Campus, and Student enrollment) should all match, we can specify that we want to merge the DataFrames on those three columns.

In [9]:
df_campus = pd.merge(df_offenses, df_enforcement, 
                     on=[df_enforcement.columns[0],df_enforcement.columns[1],df_enforcement.columns[2]])
df_campus.head()
Out[9]:
University/College Campus Student enrollment Violent crime Murder and nonnegligent manslaughter Rape (revised definition) Rape (legacy definition) Robbery Aggravated assault Property crime Burglary Larceny-theft Motor vehicle theft Arson Total law enforcement employees Total officers Total civilians
0 Allan Hancock College NaN 11047.0 0.0 0.0 0.0 NaN 0.0 0.0 21.0 2.0 18.0 1.0 0.0 10.0 5.0 5.0
1 California State Polytechnic University Pomona 23966.0 6.0 0.0 4.0 NaN 1.0 1.0 173.0 5.0 150.0 18.0 1.0 27.0 19.0 8.0
2 California State University Bakersfield 8720.0 1.0 0.0 0.0 NaN 0.0 1.0 78.0 12.0 65.0 1.0 0.0 21.0 14.0 7.0
3 California State University Channel Islands 5879.0 2.0 0.0 0.0 NaN 1.0 1.0 62.0 7.0 54.0 1.0 0.0 28.0 14.0 14.0
4 California State University Dominguez Hills 14687.0 4.0 0.0 0.0 NaN 0.0 4.0 72.0 5.0 65.0 2.0 0.0 26.0 20.0 6.0

Variables

A summary of variables in the table:

  • Violent crime is an aggregate of the next 5 columns - murder, nonnegligent manslaughter, rape (revised and legacy definition), robbery, aggrevated assault
  • Property crime is an aggregate of the next 3 columns - burglary, larceny-theft, motor vehicle theft.
  • Arson is a separate category not accounted for by violent crime or property crime columns.

Derived Quantities

A couple of input variables are useful directly, but it's also useful to compute some derived quantities that we can use for statistics (think: normalized quantities). For example, law enforcement statistics would benefit from a normalized measure of number of per capita law enforcement personnel. This would enable cross-comparisons between, say, a campus and its surrounding city.

Law enforcement statistics:

  • Per capita law enforcment personnel. Hypothesis: this number will correlate with crime frequency.
  • Law enforcement civilians per law enforcement officer. Hypothesis: this number will decrease with crime frequency.

Crime statistics:

  • Aggregate crime: property + violent + arson

Law enforcement/crime combined statistics:

  • Number of violent crimes per law enforcement officer.
  • Number of property crimes per law enforcement officer.
In [10]:
df_campus['Per Capita Law Enforcement Personnel'] = (df_campus['Total law enforcement employees'])/(df_campus['Student enrollment'])
df_campus['Law Enforcement Civilians Per Officer'] = (df_campus['Total civilians'])/(df_campus['Total officers'])

df_campus['Aggregate Crime'] = df_campus['Violent crime'] + df_campus['Property crime'] + df_campus['Arson']
df_campus['Per Capita Violent Crime'] = (df_campus['Violent crime'])/(df_campus['Student enrollment'])
df_campus['Per Capita Property Crime'] = (df_campus['Property crime'])/(df_campus['Student enrollment'])
df_campus['Per Capita Aggregate Crime'] = (df_campus['Violent crime'] + df_campus['Property crime'] + df_campus['Arson'])/(df_campus['Student enrollment'])

df_campus['Aggregate Crime Per Officer'] = (df_campus['Aggregate Crime'])/(df_campus['Total officers'])
df_campus['Violent Crime Per Officer'] = (df_campus['Violent crime'])/(df_campus['Total officers'])
df_campus['Property Crime Per Officer'] = (df_campus['Property crime'])/(df_campus['Total officers'])

Cutting The Data

Like cutting a deck of cards, we can cut a DataFrame at particular locations to discretize and bin data.

Suppose we want to cut the DataFrame at different school sizes, and and specify names for each category. In our last notebook we used the np.percentile() function and specified the quantile we wanted. That just got us the cut locations, it didn't actually cut or re-categorize the data for us.

This time, let's look at how we would do that with Pandas, which is a much easier way to cut the data up.

In [11]:
# Start with the data we are going to cut up
data = df_campus['Student enrollment']

bins = [0, 0.20, 0.5, 0.80, 1.0]

# Here's what qcut looks like:
pd.qcut(data,bins).head()
Out[11]:
0     (8720, 20225]
1    (20225, 34508]
2      [1003, 8720]
3      [1003, 8720]
4     (8720, 20225]
Name: Student enrollment, dtype: category
Categories (4, object): [[1003, 8720] < (8720, 20225] < (20225, 34508] < (34508, 41845]]
In [12]:
group_names = ['Super Tiny','Small-ish','Large-ish','Massive']
df_campus['School size'] = pd.qcut(data,bins,labels=group_names)
#pprint(dir(pd.qcut(data,bins,labels=group_names)))
In [13]:
pd.value_counts(df_campus['School size']).sort_index()
Out[13]:
Super Tiny     8
Small-ish     10
Large-ish     11
Massive        7
Name: School size, dtype: int64

School Size vs Incidence of Crime: Jitter Plot

One place we might want to start is to look for known or expected trends. We might expect, going in, that schools in larger and more populated areas will have a higher incidence of crimes, and a larger law enforcement presence, than others. We would also expect that because larger campuses tend to have higher concentrations of undergraduates living on campus, leading to a higher frequency of crime and requiring a heavier law enforcement presence.

But if we start with this known, or expected, relationship, we immediately find outliers.

In [14]:
import seaborn as sns
/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/matplotlib/__init__.py:872: UserWarning: axes.color_cycle is deprecated and replaced with axes.prop_cycle; please use the latter.
  warnings.warn(self.msg_depr % (key, alt_key))
In [15]:
sns.stripplot(x="School size", y="Aggregate Crime", data=df_campus, jitter=True)
title('Aggregate Crime vs Campus Size')
show()
/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/matplotlib/__init__.py:892: UserWarning: axes.color_cycle is deprecated and replaced with axes.prop_cycle; please use the latter.
  warnings.warn(self.msg_depr % (key, alt_key))

It should be noted (and we'll get to this later) that if we look at the per capita incidence of crime, the picture looks very different: larger schools have a lower relative rate of crimes. But we'll get into that in a moment.

In [16]:
sns.stripplot(x="School size", y="Per Capita Aggregate Crime", data=df_campus, jitter=True)
title('Per Capita Aggregate Crime vs Campus Size')
show()

This plot shows the aggregate crime (violent crime plus property crime) committed on campuses, versus their size. There seem to be two different groups plotted here - one large group for which aggregate crime rises very slowly with school size, and another smaller group of outliers for which aggregate crime is much higher and rises much more sharply. The trend of aggregate crime versus school size doesn't explain enough of the variation in data.

The one outliner - the very tiny school with very high aggregate crime - is another example of how our initial assumptions and mental models can sometimes be incorrect. The University of San Francisco is a small medical school, a campus in the heart of San Francisco. The campus is located near Golden Gate Park in a relatively high crime area, bucking the trend of smaller campuses having lower crime.

Here are the outliers in each category:

In [17]:
tiny_sorted = (df_campus[df_campus['School size']=='Super Tiny'].sort_values('Aggregate Crime',ascending=False))[['University/College','Campus']]
print tiny_sorted.iloc[0]
University/College    University of California
Campus                           San Francisco
Name: 32, dtype: object
In [18]:
smallish_sorted = (df_campus[df_campus['School size']=='Small-ish'].sort_values('Aggregate Crime',ascending=False))[['University/College','Campus']]
print smallish_sorted.iloc[0:4]
             University/College          Campus
34     University of California      Santa Cruz
23      Sonoma State University             NaN
10  California State University  San Bernardino
5   California State University        East Bay
In [19]:
largeish_sorted = (df_campus[df_campus['School size']=='Large-ish'].sort_values('Aggregate Crime',ascending=False))[['University/College','Campus']]
print largeish_sorted.iloc[0:4]
            University/College     Campus
26    University of California      Davis
31    University of California  San Diego
21  San Diego State University        NaN
30    University of California  Riverside
In [20]:
massive_sorted = (df_campus[df_campus['School size']=='Massive'].sort_values('Aggregate Crime',ascending=False))[['University/College','Campus']]
print massive_sorted.iloc[0:2]
          University/College       Campus
28  University of California  Los Angeles
25  University of California     Berkeley

It's pretty clear: the outliers are in cities, where crime rates tend to increase faster with student enrollment and campus size. But what about the other, larger group, where crime doesn't rise as fast with student enrollment? We can examine those as well, by listing the bottom of the list:

In [21]:
print tiny_sorted.iloc[-1]
University/College    Marin Community College
Campus                                    NaN
Name: 18, dtype: object
In [22]:
print smallish_sorted.iloc[-3:-1]
             University/College      Campus
13      College of the Sequoias         NaN
11  California State University  San Marcos
In [23]:
print largeish_sorted.iloc[-4:-1]
                         University/College      Campus
9               California State University  Sacramento
1   California State Polytechnic University      Pomona
15                        El Camino College         NaN
In [24]:
print massive_sorted.iloc[-5:-1]
                         University/College      Campus
8               California State University  Northridge
24  State Center Community College District         NaN
14           Contra Costa Community College         NaN
19              Riverside Community College         NaN

There are a few campuses in large cities (Cal State Los Angeles and Cal Poly Pomona), so we can't make blanket statements about city campuses versus non-city campuses. It would be useful to include some data about the surrounding city with each campus, since that would give us a quantitative variable to use, instead of a more fuzzy "these all sound like big cities."

Factor Plot: School Size and...?

Examining the trend of aggregate crime versus school size revealed grouping in the data. We can use a factor plot to explore other factors.

In [25]:
df_campus.columns
Out[25]:
Index([u'University/College', u'Campus', u'Student enrollment',
       u'Violent crime', u'Murder and nonnegligent manslaughter',
       u'Rape (revised definition)', u'Rape (legacy definition)', u'Robbery',
       u'Aggravated assault', u'Property crime', u'Burglary', u'Larceny-theft',
       u'Motor vehicle theft', u'Arson', u'Total law enforcement employees',
       u'Total officers', u'Total civilians',
       u'Per Capita Law Enforcement Personnel',
       u'Law Enforcement Civilians Per Officer', u'Aggregate Crime',
       u'Per Capita Violent Crime', u'Per Capita Property Crime',
       u'Per Capita Aggregate Crime', u'Aggregate Crime Per Officer',
       u'Violent Crime Per Officer', u'Property Crime Per Officer',
       u'School size'],
      dtype='object')

We are still looking for a quantitative way to split the aggregate crime versus school size into two groups, in a quantitative way. Here's the jitter plot we saw, that shows two clear groups:

In [26]:
sns.stripplot(x="School size", y="Aggregate Crime",
               data=df_campus, jitter=True)
show()

UC vs Cal State: The System

The solution to partitioning these two groups turns out to be something right in front of our nose: the school's name. That's right! We can use the school's name to group campuses together, and examine crime statistics for campuses across a given system. When we do this, we see that the University of California system has a much steeper correlation between school size and aggregate crime, accounting nearly exclusively for the outliers we spotted in the plot above.

To aggregate the data by university/college system, we will use the pd.value_counts() method to get a set of unique university/college names. We will then iterate through all elements of this set with a value count larger than 1 (meaning, university or college systems that have more than 1 campus) and add them to a list of campuses. Finally, we'll filter out the data to only include colleges and universities at these larger campuses.

In [27]:
unicol = df_campus['University/College']

university_categories = []
for (i,j) in (pd.value_counts(unicol)>1).iteritems():
    if j:
        # Compile a list of all College/University names with more than 1 campus
        university_categories.append(i)

## To filter out 1-campus schools, use this:
#df_multi_campus = df_campus[df_campus['University/College'].map(lambda x : x in university_categories)]

# To add 1-campus schools to an "Other" category, use this:
df_campus['UCtemp'] = df_campus['University/College'].map(lambda x : x if x in university_categories else "Other")
In [28]:
sns.lmplot(x="Student enrollment", y="Aggregate Crime",
               data=df_campus, hue="UCtemp")
show()
In [29]:
sns.lmplot(x="Student enrollment", y="Violent crime",
               data=df_campus, hue="UCtemp")
show()

At this point it's clear that the University of California system suffers from a higher overall rate of crime, but it isn't clear why (except that UC schools are typically located in larger cities). While there is a single outlier, UCLA, with an unusually high violent crime rate, the trend holds for schools across the UC system.

In [30]:
df_campus.sort_values('Violent crime',ascending=False).iloc[0:2]
Out[30]:
University/College Campus Student enrollment Violent crime Murder and nonnegligent manslaughter Rape (revised definition) Rape (legacy definition) Robbery Aggravated assault Property crime ... Law Enforcement Civilians Per Officer Aggregate Crime Per Capita Violent Crime Per Capita Property Crime Per Capita Aggregate Crime Aggregate Crime Per Officer Violent Crime Per Officer Property Crime Per Officer School size UCtemp
28 University of California Los Angeles 41845.0 97.0 0.0 27.0 NaN 22.0 48.0 817.0 ... 0.633333 916.0 0.002318 0.019524 0.021890 15.266667 1.616667 13.616667 Massive University of California
25 University of California Berkeley 37565.0 35.0 0.0 9.0 NaN 12.0 14.0 749.0 ... 1.323077 785.0 0.000932 0.019939 0.020897 12.076923 0.538462 11.523077 Massive University of California

2 rows × 28 columns

Per Capita Crime Rates

Now let's go back and re-examine that jitter plot with the per-capita incidence of crime. This gives a quite different picture of the incidence of crime. When we looked at the total number of crimes, crime at large schools looked "out of control", but here we see that the per capita incidence of crime on these campuses is not substantially outside the norm.

However, we do see two very small campuses - both University of California campuses - that stick out from the rest of the "Super Tiny" category. These are both University of California campuses located in the city of San Francisco:

  • UC Hastings College of Law
  • UC San Francisco Medical School

These campuses are small enough that the total number of crimes was not outside the norm; but on a per-capita basis, these two smaller campuses are much more dangerous (higher likelihood of a given person experiencing a crime). Splitting out the crime data into violent crimes and property crimes shows us that these two campuses are dangerous for different reasons.

In [31]:
f, axes = subplots(1,3, figsize=(16, 4))

variables = ['Aggregate Crime','Violent Crime','Property Crime']

for ax,varlabel in zip(axes,variables):
    sns.stripplot(x="School size", y="Per Capita "+varlabel, data=df_campus, jitter=True, ax=ax)
    ax.set_title(varlabel+' vs Campus Size')
show()
In [32]:
label1 = ['University/College','Campus','Student enrollment']
label2 = ['Per Capita Aggregate Crime','Per Capita Violent Crime','Per Capita Property Crime']
tiny_schools = df_campus[df_campus['School size']=='Super Tiny']

for sort_label in label2:
    print "="*60
    print "Schools Ranked By "+sort_label+":"
    pprint( tiny_schools.sort_values(sort_label, ascending=False).iloc[0:3][label1+label2].T )
============================================================
Schools Ranked By Per Capita Aggregate Crime:
                                                  32  \
University/College          University of California   
Campus                                 San Francisco   
Student enrollment                              3170   
Per Capita Aggregate Crime                  0.142587   
Per Capita Violent Crime                  0.00473186   
Per Capita Property Crime                   0.137855   

                                                  27  \
University/College          University of California   
Campus                       Hastings College of Law   
Student enrollment                              1003   
Per Capita Aggregate Crime                 0.0588235   
Per Capita Violent Crime                   0.0189432   
Per Capita Property Crime                  0.0398804   

                                                   17  
University/College          Humboldt State University  
Campus                                            NaN  
Student enrollment                               8485  
Per Capita Aggregate Crime                  0.0190925  
Per Capita Violent Crime                  0.000824985  
Per Capita Property Crime                   0.0180318  
============================================================
Schools Ranked By Per Capita Violent Crime:
                                                  27  \
University/College          University of California   
Campus                       Hastings College of Law   
Student enrollment                              1003   
Per Capita Aggregate Crime                 0.0588235   
Per Capita Violent Crime                   0.0189432   
Per Capita Property Crime                  0.0398804   

                                                  32  \
University/College          University of California   
Campus                                 San Francisco   
Student enrollment                              3170   
Per Capita Aggregate Crime                  0.142587   
Per Capita Violent Crime                  0.00473186   
Per Capita Property Crime                   0.137855   

                                                     7   
University/College          California State University  
Campus                                     Monterey Bay  
Student enrollment                                 6631  
Per Capita Aggregate Crime                   0.00708792  
Per Capita Violent Crime                     0.00180968  
Per Capita Property Crime                    0.00527824  
============================================================
Schools Ranked By Per Capita Property Crime:
                                                  32  \
University/College          University of California   
Campus                                 San Francisco   
Student enrollment                              3170   
Per Capita Aggregate Crime                  0.142587   
Per Capita Violent Crime                  0.00473186   
Per Capita Property Crime                   0.137855   

                                                  27  \
University/College          University of California   
Campus                       Hastings College of Law   
Student enrollment                              1003   
Per Capita Aggregate Crime                 0.0588235   
Per Capita Violent Crime                   0.0189432   
Per Capita Property Crime                  0.0398804   

                                                   17  
University/College          Humboldt State University  
Campus                                            NaN  
Student enrollment                               8485  
Per Capita Aggregate Crime                  0.0190925  
Per Capita Violent Crime                  0.000824985  
Per Capita Property Crime                   0.0180318  

Both UC Hastings and UC San Francisco are clustered together in a high-crime category of small schools - due to the fact that they are both small campuses located in San Francisco proper. But even then, we can see some major differences between the two: The UC Hastings campus has a per-capita rate of violent crimes that is 4.5 times higher than UC San Francisco's.

This is a reflection of the neighborhoods where these two schools are located: UC Hastings is close to downtown San Francisco and is on the border of the Tenderloin district, which holds the unenviable title of most crime-ridden neighborhood in San Francisco.

UC San Francisco, on the other hand, is more secluded, located between two large parks - Mt. Sutro and Golden Gate Park. This area is relatively more safe than the Tenderloin.

Gauging the Impact of Law Enforcement

It is useful to look at the incidence of crime on different campuses to identify where crime happens; but we can also use data about law enforcement agencies to determine whether the number of law enforcement officers and other personnel are related to crime rates.

(Note that this is a tricky chicken-and-the-egg problem: we can only determine whether more law enforcement officers correlates with more crime; we cannot determine which one causes which. This is important because, on the one hand, we would expect that more law enforcement personnel would help lower crime; on the other hand, we would expect the number of law enforcement personnel to be higher if a campus is located in an area with higher crime rates.)

In [33]:
plot(df_campus['Per Capita Law Enforcement Personnel'], df_campus['Per Capita Aggregate Crime'],'o')
xlabel('Per Capita Law Enforcement Personnel')
ylabel('Per Capita Aggregate Crime')
show()

As we might have anticipated, we have some outliers: the two schools with small populations and high crime rates, inflating all of the per capita statistics.

In [34]:
print df_campus[['University/College','Campus']][df_campus['Per Capita Law Enforcement Personnel']>0.005]
          University/College                   Campus
27  University of California  Hastings College of Law
32  University of California            San Francisco

Let's filter this data out so we can get a clearer picture of the relationship between law enforcement personnel and incidence of crime.

In [35]:
df_campus_filtered = df_campus[df_campus['Per Capita Law Enforcement Personnel']<0.005]
plot(df_campus_filtered['Per Capita Law Enforcement Personnel'], df_campus_filtered['Per Capita Aggregate Crime'],'o')
xlabel('Per Capita Law Enforcement Personnel')
ylabel('Per Capita Aggregate Crime')
show()

There is a general upward trend - the more law enforcement personnel there are, the higher the per capita aggregate crime rate. Let's condition this plot on the ratio of law enforcement civilians to law enforcement officers, to see whether it correlates with higher or lower crime rates.

The "Law Enforcement Civilians Per Officer" column contains the ratio we're interested in; cut the data at the 33rd and 66th quantiles, using the pd.qcut() function, to split it into three categories, and name each label.

In [36]:
ratio_bins = [0.0,0.33,0.66,1.0]
ratio_data = df_campus['Law Enforcement Civilians Per Officer']
ratio_labels = ["More Civilians","Mixed","More Officers"]

df_campus['Law Enforcement Civilian Officer Ratio'] = 0.0
df_campus.loc[:,['Law Enforcement Civilian Officer Ratio']] = pd.qcut(ratio_data, ratio_bins, ratio_labels)

Now we can use these new category labels to visualize our data categorically.

In [37]:
df_campus_filtered = df_campus[df_campus['Per Capita Law Enforcement Personnel']<0.005]
sns.lmplot(x="Per Capita Law Enforcement Personnel", y="Per Capita Aggregate Crime", 
           hue="Law Enforcement Civilian Officer Ratio",
           data=df_campus_filtered)
xlim([0.0,0.005])
ylim([0.00,0.05])
show()

There is not a strong dependence between the per capita crime rate and the per capita number of law enforcement personnel.

If we look at the total numbers (including the two outliers, UC San Francisco and UC Hastings), we see a general upward trend - more officers means more crimes - but this can be attributed to the fact that larger campuses have more people and therefore more crime.

In [38]:
sns.lmplot(x="Total law enforcement employees", y="Aggregate Crime", 
           hue="Law Enforcement Civilian Officer Ratio",
           data=df_campus)
show()

Conclusions from Campus Data Analysis

An analysis of campus law enforcement and crime data showed a couple of trends we anticipated, and a couple we did not. Specifically, we confirmed that there was a general positive trend between total student enrollment and incidence of crime. However, we saw that this did not explain all of the data. There was quite a bit of variance in the enrollment versus incidence of crime trend line. This variance can be attributed to several outliers. When we identified the campuses that were outliers, we found that they were all University of California campuses (which tended to have a higher overall incidence of crime). UC Berkeley and UCLA have the highest rates of crime (both total and per capita) of any campus, but UC Hastings and UC San Francisco, both small campuses located in large metropolitan areas, had unusually high per-capita rates of crime.

There also does not appear to be a trend between the number of per-capita law enforcement employees and the per-capita incidence of crime, nor does it appear to be dependent on the makeup of the law enforcement agency (whether it is civilian-heavy or officer-heavy).

In some sense, these findings are no surprise - we would expect, going in, that UC Hastings, located in a bad neighborhood in a big city, would have a higher rate of crime per student than other campuses. But the data did give us a breakdown between violent and property crime, which showed significant differences between crime at UC San Francisco and UC Hastings.

Perhaps the most important observation here is that the campus data sets are incomplete. While the size of the campus gives a strong indication of the level of crime, this approach leads to multiple outliers. To properly describe (or predict) these outliers with a model requires data not contained in the data set - the crime rate of the surrounding neighborhood and city. The crime rates cannot be effectively described using campus size (too many outliers) or number of law enforcement officers (not significant).

In [ ]: