In this notebook, we'll be analyzing the law enforcement and crime statistics data about California from the FBI, specifically looking at campuses versus cities in California. The first thing we are going to do is define a bunch of functions to extract the data from the CSV files provided by the FBI via the Kaggle Dataset. These functions are also defined in the "Cleaning Up The Crime Scene" notebook; they're copied and pasted here without explanation so we can focus on analyzing the data.
We alrady covered the campus law enforcement data, specifically how to merge the two data files together to make a master data set, so in this notebook we'll write a function to import and merge that information with a single function call. We already explored some of its variables, and added some useful derived quantities, so we'll do the same for city data in this notebook, and see if we can utilize cross-connections between the two data sets to improve our understanding of campus crime statistics in a broader context.
# must for data analysis
% matplotlib inline
import numpy as np
import pandas as pd
from matplotlib.pyplot import *
import seaborn as sns
# useful for data wrangling
import io, os, re, subprocess
# for sanity
from pprint import pprint
We'll start with a function to import data about all California campuses and stitch them together, as we did in the last notebook, "California Crime Compendium - Campuses". This consisted of two function calls to convert two files to two DataFrames, then a regular expression to fix a few errant digits, and finally, a call to the pandas.merge()
function to stitch the two DataFrames together.
This required some delicate handling of the input files, in the ca_law_enforcement_by_campus()
and ca_offenses_by_campus()
functions, to strip white space in the proper spots and avoid mistakes like thinking "Campus" is a different column than "Campus ".
Speaking of which - prior notebooks copied-and-pasted this functionality into a huge cell. This functionality has been abstracted away into cali_crime()
so that we can better utilize the principle of continuous improvement.
from cali_crime import *
def import_campus_df(data_directory):
df_enforcement = ca_law_enforcement_by_campus(data_directory)
df_offenses = ca_offenses_by_campus(data_directory)
# Fix errant digits
for r in df_offenses['Campus']:
if(type(r)==type(' ')):
df_offenses['Campus'][df_offenses['Campus']==r].map(lambda x : re.sub(r'[0-9]$','',x))
df_campus = pd.merge(df_offenses, df_enforcement,
on=[df_enforcement.columns[0],df_enforcement.columns[1],df_enforcement.columns[2]])
# Useful derived quantities
df_campus['Per Capita Law Enforcement Personnel'] = (df_campus['Total law enforcement employees'])/(df_campus['Student enrollment'])
df_campus['Law Enforcement Civilians Per Officer'] = (df_campus['Total civilians'])/(df_campus['Total officers'])
df_campus['Aggregate Crime'] = df_campus['Violent crime'] + df_campus['Property crime'] + df_campus['Arson']
df_campus['Per Capita Violent Crime'] = (df_campus['Violent crime'])/(df_campus['Student enrollment'])
df_campus['Per Capita Property Crime'] = (df_campus['Property crime'])/(df_campus['Student enrollment'])
df_campus['Per Capita Aggregate Crime'] = (df_campus['Violent crime'] + df_campus['Property crime'] + df_campus['Arson'])/(df_campus['Student enrollment'])
df_campus['Aggregate Crime Per Officer'] = (df_campus['Aggregate Crime'])/(df_campus['Total officers'])
df_campus['Violent Crime Per Officer'] = (df_campus['Violent crime'])/(df_campus['Total officers'])
df_campus['Property Crime Per Officer'] = (df_campus['Property crime'])/(df_campus['Total officers'])
return df_campus
df_campus = import_campus_df('data/')
Importing city data may require some additional cleanup, as the campus data did. (Edit: as it turns out, the criminal offenses statistics had digits at the end of some city names, so the same regular expression worked to clean those up.)
The city data set is significantly larger than the campus data set, so it should prove an interesting data set. Let's see what we've got.
df_city_enforcement = ca_law_enforcement_by_city('data/')
df_city_offenses = ca_offenses_by_city('data/')
df_city_enforcement.head()
df_city_offenses.head()
print len(df_city_offenses['City'])
# Fix more errant digits
df_city_offenses['City'] = df_city_offenses['City'].map(lambda x : re.sub(r'[0-9]$','',x))
df_city_offenses.head()
df_city = pd.merge(df_city_offenses, df_city_enforcement,
on=[df_city_enforcement.columns[0],df_city_enforcement.columns[1]])
df_city.head()
print df_city['City'][df_city['Violent crime']==max(df_city['Violent crime'])]
print df_city['City'][df_city['Property crime']==max(df_city['Property crime'])]
# Stuff it all in a function:
def import_city_df(data_directory):
df_city_enforcement = ca_law_enforcement_by_city(data_directory)
df_city_offenses = ca_offenses_by_city(data_directory)
# Fix errant digits
df_city_offenses['City'] = df_city_offenses['City'].map(lambda x : re.sub(r'[0-9]$','',x))
df_city = pd.merge(df_city_offenses, df_city_enforcement,
on=[df_city_enforcement.columns[0],df_city_enforcement.columns[1]])
# Useful derived quantities
df_city['Aggregate Crime'] = df_city['Violent crime'] + df_city['Property crime']
df_city['Per Capita Violent Crime'] = df_city['Violent crime']/df_city['Population']
df_city['Per Capita Property Crime'] = df_city['Property crime']/df_city['Population']
df_city['Per Capita Aggregate Crime'] = df_city['Aggregate Crime']/df_city['Population']
df_city['Per Capita Law Enforcement Personnel'] = (df_city['Total law enforcement employees'])/(df_city['Population'])
df_city['Aggregate Crime Per Officer'] = (df_city['Aggregate Crime'])/(df_city['Total officers'])
df_city['Violent Crime Per Officer'] = (df_city['Violent crime'])/(df_city['Total officers'])
df_city['Property Crime Per Officer'] = (df_city['Property crime'])/(df_city['Total officers'])
return df_city
Let's examine the total amount of violent and property crime in cities. When looking at population data, you're looking across many, many scales - from hundreds to millions. Logs are essential when visualizing this kind of information. We'll take the log transform of each quantity, and plot its histogram.
fig, ax2 = subplots(1, figsize=(8,4))
sns.distplot(df_city['Population'][df_city['Violent crime']>1].map(lambda x : np.log10(x)), label='Population', ax=ax2)
sns.distplot(df_city['Violent crime'][df_city['Violent crime']>0].map( lambda x : np.log10(x) ), label='Violent', ax=ax2)
sns.distplot(df_city['Property crime'][df_city['Property crime']>0].map( lambda x : np.log10(x) ), label='Property', ax=ax2)
ax2.set_xlabel('Log Number (Pwr Of 10)')
ax2.set_title('Log Population/Crime Rate')
ax2.legend()
show()
This is quite a remarkable figure, integrating an incredible amount of data about human behavior. Violent crime, property crime, and populations all have nearly identical distributions, which indicate that proeprty crime and violent crime generally follow population trends (with property crime occurring at a rate of $\frac{10^3}{10^5}$ or about 1 incident of property crime per 100 people over the course of a year, while violent crime occurrs at a rate of $\frac{10^2}{10^5}$ or 1 incident of violent crime per 1,000 people over the course of a year. We know from experience, of course, that those incidents are definitely not evenly distributed.
There are slight differences in the distribution of property crime (stronger tails at at higher and lower values) and violent crime (stronger tails at higher values) compared to the general population distribution. If we assume that violence is normally distributed (in the statistical, not the moral, sense), which is the only reasonable conclusion from the similarities in the distributions above, and occurrs at roughly the rate given above in any given population of humans, these tail peaks would indicate deviation from this "normal" behavior - not just in very small communities where the violent crime rate is zero, but in the accumulation of a lower violent crime rate across all of society (leading to a shift in the spread of the distribution).
We can also see this in quantile-quantile plots of these distributions. Outliers on the high end are shown on the right side of the quantile-quantile plot, while the outliers on the low end are shown on the left side of the plot.
import scipy.stats as stats
import statsmodels.api as sm
fig, (ax1,ax2) = subplots(1,2, figsize=(14,4))
vcqq = sm.qqplot(df_city['Violent crime'][df_city['Violent crime']>0].map( lambda x : np.log10(x) ),
fit=True, line='45', ax=ax1)
ax1.set_title('Quantile-Quantile Plot: Violent Crime')
pcqq = sm.qqplot(df_city['Property crime'][df_city['Property crime']>0].map( lambda x : np.log10(x) ),
fit=True, line='45', ax=ax2)
ax2.set_title('Quantile-Quantile Plot: Property Crime')
show()
We can see what cities correspond to these outlier behaviors (obviously, the one on the very top right is going to be Los Angeles - which has five times the number of violent crimes reported as the runner up, San Francisco). We can pick out a couple of cities that represent the high and low ends, the abnormal tails, of the violent crime and property crime distributions:
vcqq = stats.probplot(df_city['Violent crime'][df_city['Violent crime']>0].map( lambda x : np.log10(x) ), dist="norm")
# len: 327
vcqqx = vcqq[0][0]
vcqqy = vcqq[0][1]
cut1 = 15
cut2 = 310
plot(vcqqx[:cut1],vcqqy[:cut1],'o', color=sns.xkcd_rgb["dusty green"])
plot(vcqqx[cut1:cut2],vcqqy[cut1:cut2],'o', color=sns.xkcd_rgb["denim blue"])
plot(vcqqx[cut2:],vcqqy[cut2:],'o', color=sns.xkcd_rgb["watermelon"])
title('Violent Crime: Quantile-Quantile Color Coded for Outliers')
show()
The blue points are all following a 45-degree line pretty closely, which indicates they are normally distributed (that little dip at x = -1 is the bulge in the left side of the distribution of violent crime). The color-coded sections are the sections that deviate significantly from the 45-degree line, indicating they have an abnormally high or low incidence of violent crime for the human population sampled by this data, which is...
print "%0.3e"%(df_city['Population'].sum())
26.7 million people.
The two slices of the DataFrame containing city data show the cities corresponding to these green and red points.
print "="*40
print "Unusually high total incidence of violent crime:"
print df_city[['City','Population','Violent crime']].sort_values('Violent crime',ascending=False)[:cut1]
There you go - your list of the cities that are inflating that tail at the top end of the violent crime distribution. The cities listed above are the watermelon-colored points in the color-coded quantile-quantile plot.
The top five cities for violent crime in California are:
print "="*40
print "Unusually low total incidence of violent crime:"
print df_city[['City','Population','Violent crime']].sort_values('Violent crime',ascending=True)[:df_city.shape[0]-cut2]
And for those who prefer Pleasantville, USA, here is the other end of the spectrum. The cities listed above are the blue points in the color-coded quantile-quantile plot.
pcqq = stats.probplot(df_city['Property crime'][df_city['Property crime']>1].map( lambda x : np.log10(x) ), dist='norm')
pcqqx = pcqq[0][0]
pcqqy = pcqq[0][1]
cut1 = 6
cut2 = 312
plot(pcqqx[:cut1],pcqqy[:cut1],'o', color=sns.xkcd_rgb["dusty green"])
plot(pcqqx[cut1:cut2],pcqqy[cut1:cut2],'o', color=sns.xkcd_rgb["denim blue"])
plot(pcqqx[cut2:],pcqqy[cut2:],'o', color=sns.xkcd_rgb["watermelon"])
title('Property Crime: Quantile-Quantile Color Coded for Outliers')
show()
print "="*40
print "Unusually high total incidence of property crime:"
print df_city[['City','Population','Property crime']].sort_values('Property crime',ascending=False)[:cut1]
San Jose jumps up and Stockton drops down when switching from violent crime to property crime.
print "="*40
print "Unusually low total incidence of property crime:"
print df_city[['City','Population','Property crime']].sort_values('Property crime',ascending=True)[:df_city.shape[0]-cut2]
The campus data set showed that switching to a per capita perspective can give a dramatically different view of the data, so let's see how that looks.
# Useful derived quantities
df_city['Aggregate Crime'] = df_city['Violent crime'] + df_city['Property crime']
df_city['Per Capita Violent Crime'] = df_city['Violent crime']/df_city['Population']
df_city['Per Capita Property Crime'] = df_city['Property crime']/df_city['Population']
df_city['Per Capita Aggregate Crime'] = df_city['Aggregate Crime']/df_city['Population']
df_city['Per Capita Law Enforcement Personnel'] = (df_city['Total law enforcement employees'])/(df_city['Population'])
df_city['Aggregate Crime Per Officer'] = (df_city['Aggregate Crime'])/(df_city['Total officers'])
df_city['Violent Crime Per Officer'] = (df_city['Violent crime'])/(df_city['Total officers'])
df_city['Property Crime Per Officer'] = (df_city['Property crime'])/(df_city['Total officers'])
fig, ax1 = subplots(1, figsize=(8,4))
sns.distplot(df_city['Per Capita Violent Crime'][df_city['Violent crime']>0].map( lambda x : np.log10(x) ), label='Violent', ax=ax1)
sns.distplot(df_city['Per Capita Property Crime'][df_city['Property crime']>0].map( lambda x : np.log10(x) ), label='Property', ax=ax1)
ax1.set_xlabel('Log Number (Pwr Of 10)')
ax1.set_title('Log Population/Crime Rate')
ax1.legend()
show()
Indeed, we get an interesting view of the data here. The shorter, fatter distribution of number of violent crime incidents indicates that it has a lower overall rate of occurrence on average, and that it occurs more in bunches, suggested by the large variance (wide distribution). This follows the old saying: "violence begets violence."
Property crime, on the other hand, is narrower, with small peaks in the tails. These peaks are likely industrial districts where the already high per capita crime rates are inflated due to a low number of residents. The more narrow distribution (smaller variance) means that property crime occurs at around the same rate everywehre, and a higher mean (distribution shifted to the right) means property crime occurs across society at some baseline rate (again, in the log scale above, -2 means 10^-2 or 1/100, which means 1 incident per 100 people).
vcqq = stats.probplot(df_city['Per Capita Violent Crime'][df_city['Violent crime']>0].map( lambda x : np.log10(x) ), dist="norm")
# len: 327
vcqqx = vcqq[0][0]
vcqqy = vcqq[0][1]
cut1 = 3
cut2 = 323
plot(vcqqx[:cut1],vcqqy[:cut1],'o', color=sns.xkcd_rgb["dusty green"])
plot(vcqqx[cut1:cut2],vcqqy[cut1:cut2],'o', color=sns.xkcd_rgb["denim blue"])
plot(vcqqx[cut2:],vcqqy[cut2:],'o', color=sns.xkcd_rgb["watermelon"])
title('Per Capita Violent Crime: Quantile-Quantile Color Coded for Outliers')
show()
print "="*40
print "Unusually high per capita incidence of property crime:"
print df_city[['City','Population','Per Capita Property Crime']].sort_values('Per Capita Property Crime',ascending=False)[:cut1]
print "="*40
print "Unusually low per capita incidence of property crime:"
print df_city[['City','Population','Per Capita Property Crime']].sort_values('Per Capita Property Crime',ascending=True)[:df_city.shape[0]-cut2]
Let's return to the campus data that we analyzed in the "California Crimes Compendium - Campuses" notebook. We concluded that the data set didn't have enough information to adequately predict levels of crime, because we didn't have crime statistics for the surrounding metropolitan areas. Let's find those stats and pair them up.
df_campus.head()
We'll start with the "Campus" field, and see how many of them match cities in the city crime stats DataFrame:
z = pd.DataFrame([])
for (i,row) in df_campus.iterrows():
if row['Campus'] is not np.nan:
city = row['Campus']
if city in df_city['City'].values:
z = pd.concat([ z, df_city[df_city['City']==city] ])
print len(z)
print z['City']
A good start: we can extract crime stats for these 14 campus/city combinations automatically. We can go back and improve the matches later by breaking the university names into tokens and searching for matching tokens, or by manually tagging the city of different campuses by hand.
Let's' extract what we're interested in, and visualize it:
campus_matches = pd.DataFrame([])
for (i,row) in df_campus.iterrows():
if row['Campus'] is not np.nan:
city = row['Campus']
if city in df_city['City'].values:
# Merge DataFrames:
# part 1: row
# part 2: df_city[df_city['City']==city]
ff = row.to_frame().transpose()
merged = pd.merge( ff, df_city[df_city['City']==city], left_on='Campus', right_on='City', suffixes=(' campus',' city'))
campus_matches = pd.concat([ campus_matches, merged ])
print len(campus_matches)
print campus_matches.columns
print campus_matches[['University/College','Campus']]
We have 14 schools for which we also have a matching city. Note that these will be biased, since they consist primarily of University of California schools, and we saw in a prior notebook that University of California schools have a stronger trend between campus size and the incidence of crime. We'll manually tag each university and look at the whole set shortly. But for now, we'll look at joint campus and city crime statistics, and see how the safety of a campus compares to the safety of the surrounding city.
g = sns.jointplot(x="Per Capita Aggregate Crime campus",y="Per Capita Aggregate Crime city",data=campus_matches)
UCSF shows up again as a huge outlier on the per capita aggregate crime occurring on campus versus in the surrounding city. Filtering out that value, we can examine the rest of the distribution more closely:
filtered_campus_matches = campus_matches.sort_values('Per Capita Aggregate Crime campus',ascending=False)[1:]
g = sns.jointplot(x="Per Capita Aggregate Crime campus",y="Per Capita Aggregate Crime city",data=filtered_campus_matches)
fig = figure(figsize=(10,4))
ax1 = fig.add_subplot(121)
ax2 = fig.add_subplot(122)
ax1.plot(filtered_campus_matches['Per Capita Aggregate Crime campus'],
filtered_campus_matches['Per Capita Aggregate Crime city'],
'o')
ax1.set_xlabel('Per Capita Crime Rate, Campus')
ax1.set_ylabel('Per Capita Crime Rate City')
ax1.set_title('City vs Campus Crime Rates')
ax2.plot(campus_matches['Per Capita Aggregate Crime campus'],
campus_matches['Per Capita Aggregate Crime city'],
'o')
ax2.set_xlabel('Per Capita Crime Rate, Campus')
ax2.set_ylabel('Per Capita Crime Rate City')
ax2.set_title('City vs Campus Crime Rates, Extended Axis')
show()
The plot above seems to be clustered into four major groups: high campus crime/low city crime (3 schools, bottom right), moderate campus crime/moderate city crime (7 schools, center), low campus crime/high city crime (2 schools, upper left), and high campus crime/high city crime (2 cities, upper right).
Our one outlier is, you guessed it, UCSF: it's a small campus in a large city, but suffers from lots of crime. And the city is contained in its name, so it made it into our data set. The per capita incidence of crime is 7 times higher at UCSF than at UCLA or UC Berkeley.
fields = ['University/College','Campus','Per Capita Aggregate Crime campus']
campus_matches[fields].sort_values(fields[2],ascending=False)[:3]
If we throw population into the mix with a scatter plot, we see some other interesting trends. Starting with student population (log) versus per capita crime rate, we can see that, with the expection of UCSF, crime on campus occurs at a much lower rate. Start by adding some quantities to the DataFrame:
# city to campus crime ratio:
# large means city is more dangerous than campus.
# small is very bad.
campus_matches['City to Campus Per Capita Crime Ratio'] = campus_matches['Per Capita Aggregate Crime city']/campus_matches['Per Capita Aggregate Crime campus']
campus_matches['City to Campus Per Capita Violent Crime Ratio'] = campus_matches['Per Capita Violent Crime city']/campus_matches['Per Capita Violent Crime campus']
campus_matches['City to Campus Per Capita Property Crime Ratio'] = campus_matches['Per Capita Property Crime city']/campus_matches['Per Capita Property Crime campus']
# city to campus population ratio:
# larger means, big city and small campus. UCSF.
# small means, small city and big campus. Davis.
campus_matches['City to Campus Population Ratio'] = campus_matches['Population']/campus_matches['Student enrollment']
campus_matches['City to Campus Log Population Ratio'] = campus_matches['City to Campus Population Ratio'].map(lambda x : np.log10(x))
Now one scatterplot will compare two quantities:
Another scatterplot will show the city:campus ratio of population versus the city: campus ratio of per capita crime rates. Together these will help identify patterns between crime on campus and crime in the surrounding area.
f, (ax1,ax2,ax3) = subplots(1,3, figsize=(14,4))
# ----------
# plot 1
# Student enrollment vs per capita crime rate (city and campus)
ax1.semilogx(campus_matches['Student enrollment'],campus_matches['Per Capita Aggregate Crime city'],'o',label='City')
ax1.semilogx(campus_matches['Student enrollment'],campus_matches['Per Capita Aggregate Crime campus'],'o',label='Campus')
ax1.legend()
ax1.set_xlabel('Student Population (k)')
ax1.set_ylabel('Per Capita Crime Rate')
ax1.set_title('Population vs. Aggregate Crime Rate')
# -----------
# plot 2
# City:campus population ratio vs City:campus per capita crime ratio
# clipped x axis
ax2.semilogy(campus_matches['City to Campus Population Ratio'],
campus_matches['City to Campus Per Capita Crime Ratio'],
'o', color=sns.xkcd_rgb['muted pink'])
#ax2.semilogy(campus_matches['City to Campus Population Ratio'],
# campus_matches['City to Campus Per Capita Violent Crime Ratio'],
# 'o', color=sns.xkcd_rgb['muted blue'])
ax2.set_xlabel('City:Campus Ratio: Population')
ax2.set_ylabel('City:Campus Ratio: Per Capita Aggregate Crime')
ax2.set_xlim([0,50])
ax2.set_title('Population Ratio vs. Aggregate Crime Rate Ratio')
# ------------
# plot 3
# City:campus population ratio vs City:campus per capita crime ratio
ax3.semilogy(campus_matches['City to Campus Population Ratio'][campus_matches['University/College']<>'University of California'],
campus_matches['City to Campus Per Capita Crime Ratio'][campus_matches['University/College']<>'University of California'],
'o', color=sns.xkcd_rgb['faded red'], label='Non-UC')
ax3.semilogy(campus_matches['City to Campus Population Ratio'][campus_matches['University/College']=='University of California'],
campus_matches['City to Campus Per Capita Crime Ratio'][campus_matches['University/College']=='University of California'],
'o', color=sns.xkcd_rgb['faded purple'], label='Univ of Calif')
ax3.legend(loc='best')
ax3.set_xlabel('City:Campus Ratio: Population')
ax3.set_ylabel('City:Campus Ratio: Per Capita Aggregate Crime')
ax3.set_title('Population Ratio vs. Crime Rate Ratio, Extended')
show()
The middle and right plots are showing the same quantity, but the plot on the far right is color-coded for University of California schools and plots an extended range on the x-axis.
Both plots show UCSF as a huge outlier - a high crime rate and a small campus population. (We don't see UC Hastings because it wasn't matched up with San Francisco. We'll address that shortly.)
Let's look at the first plot, on the left: the city and campus per capita crime rate. This shows an interesting pattern: if we ignore UCSF as an outlier, there are two opposite trends in the two categories of data. The blue data, the per capita crime rate in the surrounding city, starts from being very high, and gradually decreases with campus size. The green data, the per capita crime rate on campus, starts from being very low, and gradually increases with campus size.
This curious result seems to indicate that the effect of increasing city size on crime rate is the opposite between the college population and city-dwellers. It's important to note that the cities here are not cities in California generally - they are specifically cities with colleges and universities in California. The effect of increasing size on crime rates in a college/university city tends to be downward (per capita frequency of crime decreases as cities get bigger), while the opposite happens as colleges and universities: as their population increases, the frequency of crime increases.
The second plot, in the middle, and the last plot, on the right, are illustrating the same thing with different axes. These plots show a remarkable pattern in the nature of crime at colleges and universities, but should be read carefully to interpret it properly. This figure shows the population ratio (city to campus) on the x-axis, and the city to campus ratio of crime on the y-axis. Thus, a value of 10 means city is 10 times larger than the campus. The schools with low values here are UC Davis, UC Santa Cruz and UC Santa Barbara, which are large schools relative to their surrounding environs. UC Berkeley also has a small value, but is obviously in a bit different situation than UC Davis.
fields = ['University/College','Campus','City to Campus Population Ratio']
campus_matches[fields].sort_values(fields[2],ascending=True)[:4]
The quantity on the y-axis is a log plot of the ratio of per capita incidence of crime in the city to per capita incidence of crime on campus. A low value (less than 1) means there is a higher incidence of crime per capita on campus than in the surrounding city, while a high value (greater than 1) means the opposite. The first notable fact is that all of the points except UCSF lie above $10^0$ - or 1 - meaning, with the exception of UCSF, campuses are safer than the cities that surround them, across the board.
Here's UCSF, the only campus with a per capita incidence of crime higher than the surrounding city:
fields = ['University/College','Campus','City to Campus Per Capita Crime Ratio']
campus_matches[fields].sort_values(fields[2],ascending=True)[:4]
This graph has the shape of an inverted V - which has definite significance. The V starts around 1: when a campus makes up 25% or more of a city's population, the violent crime rate on campus is within an order of magnitude (1-10 times smaller) than the crime rate in the surrounding city.
The steady increase in this graph means that as the surrounding city gets larger relative to the campus, the crime rates in the city also get larger relative to the crime rates on campus. This trend indicates campuses are (relatively) safer than the surrounding city.
This trend stops abruptly when the size of the surrounding city becomes larger than 10 - and immediately reverses. As the cities surrounding universities and colleges increase in size relative to the campus, the crime rates on campus rise to match the level of the surrounding city.
The two big outliers on the right are also surprising - particularly the bottom right point. These are campuses in large cities (Los Angeles), where the crime rates are extremely high. The ratio of per capita crime in the surrounding city to per capita crime on campus is close to 1 for the bottom right point (closer to 10 for the upper point). These would indicate a campus where the high rate of crime in the surrounding city pervades the campus as well.
Now that we've compared the ratio of city to campus populations and crime rates, let's split up the type of crime into violent crime and property crime.
f, (ax1,ax2) = subplots(1,2, figsize=(10,4))
# -----------
# plot 2
# City:campus population ratio vs City:campus per capita crime ratio
# clipped x axis
ax1.semilogy(campus_matches['City to Campus Population Ratio'],
campus_matches['City to Campus Per Capita Crime Ratio'],
'o', color=sns.xkcd_rgb['muted purple'],
label='Aggregate')
ax1.semilogy(campus_matches['City to Campus Population Ratio'],
campus_matches['City to Campus Per Capita Violent Crime Ratio'],
'o', color=sns.xkcd_rgb['muted pink'],
label='Violent')
ax1.semilogy(campus_matches['City to Campus Population Ratio'],
campus_matches['City to Campus Per Capita Property Crime Ratio'],
'o', color=sns.xkcd_rgb['muted blue'],
label='Property')
ax1.set_xlim([0,50])
ax2.semilogy(campus_matches['City to Campus Population Ratio'],
campus_matches['City to Campus Per Capita Crime Ratio'],
'o', color=sns.xkcd_rgb['muted purple'],
label='Aggregate')
ax2.semilogy(campus_matches['City to Campus Population Ratio'],
campus_matches['City to Campus Per Capita Violent Crime Ratio'],
'o', color=sns.xkcd_rgb['muted pink'],
label='Violent')
ax2.semilogy(campus_matches['City to Campus Population Ratio'],
campus_matches['City to Campus Per Capita Property Crime Ratio'],
'o', color=sns.xkcd_rgb['muted blue'],
label='Property')
ax1.legend(loc='best')
ax1.set_xlabel('City:Campus Ratio: Population')
ax1.set_ylabel('City:Campus Ratio: Per Capita Crime')
ax1.set_title('Population Ratio vs. Crime Rate Ratio')
ax2.legend(loc='best')
ax2.set_xlabel('City:Campus Ratio: Population')
ax2.set_ylabel('City:Campus Ratio: Per Capita Crime')
ax2.set_title('(Same Plot, Extended Axis)')
show()
If we look at violent crime and property crime separately, we see a different trend - violent crime happens less frequently on campus (the pink points tend higher, meaning the ratio of violent crime in surrounding cities to violent crime on campus is higher).
fields = ['University/College','Campus','City to Campus Per Capita Violent Crime Ratio']
campus_matches[fields].sort_values(fields[2],ascending=True)[:4]
fields = ['University/College','Campus','City to Campus Per Capita Property Crime Ratio']
campus_matches[fields].sort_values(fields[2],ascending=True)[:4]
But remember, we're only looking at a subset of 14 schools, which we successfully matched to cities. Next we'll hard-code city locations and look at the broader set of data, using the same plots, and see if the same patterns emerge.
If we manually enter the name of the surrounding (or nearest) city for each college or university, we can look at a more complete set of joint campus and city statistics. We'll fill in some of the details linking cities to campuses that the original data set is missing - doing some sleuthing with a map and an encyclopedia is enough to complete the task. (Note: I also wrote a short Python script that guessed at the Wikipedia page for each college or university, and used urllib2
and regular expressions to search the text of a Wikipedia page for "
df_campus[['University/College','Campus']].head()
list_o_cities = ["Santa Barbara",
"Pomona",
"Bakersfield",
"Santa Barbara",
"Los Angeles",
"Hayward",
"Fresno",
"Santa Cruz",
"Los Angeles",
"Sacramento",
"San Bernardino",
"San Diego",
"Turlock",
"Fresno",
"San Pablo",
"Los Angeles",
"Sunnyvale",
"Arcata",
"San Rafael",
"Riverside",
"San Bernardino",
"San Diego",
"San Francisco",
"Cotati",
"Fresno",
"Berkeley",
"Davis",
"San Francisco",
"Los Angeles",
"Merced",
"Riverside",
"San Diego",
"San Francisco",
"Santa Barbara",
"Santa Cruz",
"Ventura"
]
s = pd.Series(list_o_cities)
df_campus['City'] = s
# Check if any cities are not found in the list of cities we have crime stats on
# (should output nothing)
for i in df_campus['City'].values:
if i not in df_city['City'].values:
print i
new_campus_matches = pd.DataFrame([])
for (i,row) in df_campus.iterrows():
# Merge DataFrames:
# part 1: row
# part 2: df_city[df_city['City']==city]
ff = row.to_frame().transpose()
city = row['City']
merged = pd.merge( ff, df_city[df_city['City']==city], on='City',
suffixes=(' campus',' city'))
new_campus_matches = pd.concat([ new_campus_matches, merged ])
# Now we have a big list of joint campus/city statistics:
print len(new_campus_matches)
new_campus_matches.head()
We've just expanded the dataset we were looking at, of joint campus-city data, from 14 data points to 36. This is a big jump! It will be useful to repeat some of the same plots of the more complete data set:
g = sns.jointplot(x="Per Capita Aggregate Crime campus",y="Per Capita Aggregate Crime city",data=new_campus_matches)
Filtering out the outliers (UCSF and UC Hastings), we can get a better picture of the distribution:
filtered_new_campus_matches = new_campus_matches[new_campus_matches['Per Capita Aggregate Crime campus']<0.025]
g = sns.jointplot(x="Per Capita Aggregate Crime campus",y="Per Capita Aggregate Crime city",data=filtered_new_campus_matches)
The four clusters that we identified earlier still seem to be there: upper right (2 points), upper left (4 points), middle (most points), lower right (5 points). This time we have two outliers: UC Hastings and UCSF, so those form a fifth cluster. Here is a scatterplot showing the same data as above, but without the marginals:
fig = figure(figsize=(10,4))
ax1 = fig.add_subplot(121)
ax1.plot(new_campus_matches['Per Capita Aggregate Crime campus'],
new_campus_matches['Per Capita Aggregate Crime city'],
'o')
ax1.set_xlabel('Per Capita Crime Rate, Campus')
ax1.set_ylabel('Per Capita Crime Rate City')
ax1.set_xlim([0.0,0.02])
ax2 = fig.add_subplot(122)
ax2.plot(new_campus_matches['Per Capita Aggregate Crime campus'],
new_campus_matches['Per Capita Aggregate Crime city'],
'o')
ax2.set_xlabel('Per Capita Crime Rate, Campus')
ax2.set_ylabel('Per Capita Crime Rate City')
show()
Again we see the two outliers: UCSF and UC Hastings, head and shoulders ahead of other campuses in terms of the per capita crime rate, followed by the two campuses with the highest total number of crimes:
fields = ['University/College','Campus','Per Capita Aggregate Crime campus']
new_campus_matches[fields].sort_values(fields[2],ascending=False)[:4]
Looking at distributions of populations, distributions of crime, and the absolute versus per capita crime rates all reveal interesting patterns, but to put law enforcement and crime on campus into proper perspective with the surrounding city, ratios are more useful.
Here these ratios are recomputed, since the data has been re-loaded and re-joined.
# city to campus crime ratio:
# large means city is more dangerous than campus.
# small is very bad.
new_campus_matches['City to Campus Per Capita Crime Ratio'] = new_campus_matches['Per Capita Aggregate Crime city']/(new_campus_matches['Per Capita Aggregate Crime campus']+0.0001)
new_campus_matches['City to Campus Per Capita Violent Crime Ratio'] = new_campus_matches['Per Capita Violent Crime city']/(new_campus_matches['Per Capita Violent Crime campus']+0.0001)
new_campus_matches['City to Campus Per Capita Property Crime Ratio'] = new_campus_matches['Per Capita Property Crime city']/(new_campus_matches['Per Capita Property Crime campus']+0.0001)
# city to campus population ratio:
# larger means, big city and small campus. UCSF.
# small means, small city and big campus. Davis.
new_campus_matches['City to Campus Population Ratio'] = new_campus_matches['Population']/new_campus_matches['Student enrollment']
new_campus_matches['City to Campus Log Population Ratio'] = new_campus_matches['City to Campus Population Ratio'].map(lambda x : np.log10(x))
We've finished a good portion of our analysis using derived quantities that are city-to-campus ratios of various quantities. However, while these ratios themselves are useful, they can be made more intuitive by turning them into campus-to-city ratios. Here we'll repeat some of our derived quantity calculations, but inverting the ratio to be campus-to-city.
# city to campus crime ratio:
# large means city is more dangerous than campus.
# small is very bad.
new_campus_matches['Campus to City Per Capita Crime Ratio'] = new_campus_matches['Per Capita Aggregate Crime campus']/new_campus_matches['Per Capita Aggregate Crime city']
new_campus_matches['Campus to City Per Capita Violent Crime Ratio'] = new_campus_matches['Per Capita Violent Crime campus']/new_campus_matches['Per Capita Violent Crime city']
new_campus_matches['Campus to City Per Capita Property Crime Ratio'] = new_campus_matches['Per Capita Property Crime campus']/new_campus_matches['Per Capita Property Crime city']
# city to campus population ratio:
# larger means, big city and small campus. UCSF.
# small means, small city and big campus. Davis.
new_campus_matches['Campus to City Population Ratio'] = new_campus_matches['Student enrollment']/new_campus_matches['Population']
new_campus_matches['Campus to City Log Population Ratio'] = new_campus_matches['Campus to City Population Ratio'].map(lambda x : np.log10(x))
Now, the plot we were looking at before takes on a more intuitive look - the high per capita crime rates (matching the surrounding cities) when the size of the campus is either very large or very small relative to the surrounding city, and the subsequent dip in the ratio of per capita crime on campus versus in the surrounding city, now appears as a dip in the middle of the graph.
f, (ax1,ax2) = subplots(1,2, figsize=(10,4))
# -----------
# plot 1
ax1.semilogy(new_campus_matches['Campus to City Log Population Ratio'],
new_campus_matches['Campus to City Per Capita Crime Ratio'],
'o', color=sns.xkcd_rgb['muted purple'],
label='Aggregate')
ax1.semilogy(new_campus_matches['Campus to City Log Population Ratio'],
new_campus_matches['Campus to City Per Capita Violent Crime Ratio'],
'o', color=sns.xkcd_rgb['muted pink'],
label='Violent')
ax1.semilogy(new_campus_matches['Campus to City Log Population Ratio'],
new_campus_matches['Campus to City Per Capita Property Crime Ratio'],
'o', color=sns.xkcd_rgb['muted blue'],
label='Property')
ax1.set_xlabel('Campus:City Log Ratio: Population')
ax1.set_ylabel('Campus:City Ratio: Per Capita Crime')
ax1.set_title('Population Ratio vs. Crime Rate Ratio')
ax1.legend(loc='best')
# -----------
# plot 2
# City:campus population ratio vs City:campus per capita crime ratio
# clipped x axis
ax2.semilogy(new_campus_matches['Campus to City Log Population Ratio'],
new_campus_matches['Campus to City Per Capita Crime Ratio'],
'o', color=sns.xkcd_rgb['muted pink'])
ax2.set_xlabel('Campus:City Ratio: Population')
ax2.set_ylabel('Campus:City Ratio: Per Capita Aggregate Crime')
ax2.set_title('Population Ratio vs. Aggregate Crime Rate Ratio')
show()
Again, flipping this ratio helps with interpreting these plots - violent crime is lower on campuses than in the surrounding city, and this is reflected in the figure by the pink points being lower than the blue and purple points.
We identified the interesting pattern in the campus-to-city ratio of per capita crime versus the campus-to-city ratio of population: if the population of the campus composes about 10% of the surrounding population, there is a "sweet spot" in terms of having lower crime rates on campus than in the surrounding population. When the campus composes more than 10% of the surrounding population, the rate of crime on campus gradually goes up as the campus accounts for more of the surrounding population. When the campus composes less than 10% of the surrounding population, the pattern is similar: rate of crime on campus goes up as the campus gets smaller relative to the surrounding area.
If we split campuses into two bins, campuses composing more or less than 10% of the surrounding population, we can perform a linear regression on the data to quantify differences in these trends:
name_fields = ['University/College','Campus','City']
fields = ['Campus to City Log Population Ratio','Campus to City Per Capita Crime Ratio']
low = new_campus_matches[name_fields+fields][new_campus_matches[fields[0]]<-1.0]
low['Campus-City Ratio']='Campus < 10% City'
low = low.sort_values(fields[0],ascending=True)
hi = new_campus_matches[name_fields+fields][new_campus_matches[fields[0]]>-1.0]
hi['Campus-City Ratio']='Campus > 10% City'
hi = hi.sort_values(fields[0],ascending=True)
both = pd.concat([low,hi])
both = both.sort_values(fields[0],ascending=True)
sns.lmplot(data=both, x=fields[0], y=fields[1],hue='Campus-City Ratio')
show()
Let's look at one more thing before we wrap up: we'll examine some residuals to identify outliers, and see if we can figure out what the pattern is (represented by the regression line), and what the exceptions are (the causes of substantial deviation from the regression line).
The Seaborn model functionality doesn't give us the linear model it fits in a nice format, so we'll build our own first.
import statsmodels.api as sm
# ----------
# Build the linear model for
# low campus pop/city pop ratios
# floats and numpy 64 floats don't play well together...???
lowx_names = low[name_fields+fields]
lowx = low[fields[0]].values
lowy = low[fields[1]].map(lambda b : np.float64(b)).values
# super handy method
lowx = sm.add_constant(lowx)
lowlm = sm.OLS(lowy,lowx).fit()
# ----------
# Build the linear model for
# hi campus pop/city pop ratios
hix_names = hi[name_fields+fields]
hix = hi[fields[0]].values
hiy = hi[fields[1]].map(lambda b : np.float64(b)).values
# handy
hix = sm.add_constant(hix)
hilm = sm.OLS(hiy, hix).fit()
If we examine these models in more detail, we can see they aren't doing too hot, mainly because of a small number of major outliers - likely schools with special situations that we can capture with another variable. We'll drill into this model to find out how to do that.
lowlm.summary()
hilm.summary()
We'll visualize the points being regressed and the regression line, and then visualize the residuals, which will highlight the outliers:
# Create a grid of x values for regression line (xprime, yprime)
lowxprime = np.linspace( -3, 1, 100)#np.min(lowx), np.max(lowx) )
lowxprime = sm.add_constant(lowxprime)
lowyprime = lowlm.predict(lowxprime)
# Calculate residuals for original x values
lowresid = lowy - lowlm.predict( sm.add_constant(lowx) )
# Create a grid of x values for regression line (xprime, yprime)
hixprime = np.linspace( -3, 1, 100)#np.min(hix), np.max(hix), 100 )
hixprime = sm.add_constant(hixprime)
hiyprime = hilm.predict(hixprime)
# Calculate residuals for original x values
hiresid = hiy - hilm.predict( sm.add_constant(hix) )
# Oh, and colors too
colorz = ['dusty blue','dusty green']
colors = [sns.xkcd_rgb[z] for z in colorz]
# Now visualize everything we've assembled:
fig = figure(figsize=(6,4))
ax1 = fig.add_subplot(111)
#ax2 = fig.add_subplot(122)
ax1.plot(lowx[:,1],lowy,'o', color=colors[0])
ax1.plot(lowxprime[:,1], lowyprime,'-', color=colors[0])
ax1.plot(hix[:,1],hiy,'o', color=colors[1])
ax1.plot(hixprime[:,1], hiyprime,'-', color=colors[1])
ax1.set_xlabel('Campus to City Log Population Ratio')
ax1.set_ylabel('Campus to City Per Capita Crime Rate Ratio')
show()
At this point the lowlm
object represents the blue linear model shown above, while the hilm
object represents the green linear model.
The linear model fit to these data indicate trends in how crime changes as a function of the size of the campus and the size of the surrounding population. We can also quantify residuals from the model to determine how accurate the model is at predicting our data, determine where the outliers are, and improve our model by parameterizing the crime rate with those other variables (which would help account for some of the variability and the deviations from the model's predictions).
We have residuals, which are key to doing this. We'll make a quantile plot of the residuals, using the probplot method from the statsmodel
package.
# More colors
quantile_colorz = ['dusty green','denim blue','watermelon']
quantile_colors = [sns.xkcd_rgb[z] for z in quantile_colorz]
# Get the low side linear model residual quantiles (UCSF/UC Hastings will be the huge outliers)
lowqq = stats.probplot( lowresid, dist='norm' )
lowqqx = lowqq[0][0]
lowqqy = lowqq[0][1]
# Get the hi side linear model residual quantiles
hiqq = stats.probplot( hiresid, dist='norm' )
hiqqx = hiqq[0][0]
hiqqy = hiqq[0][1]
# The cuts are where colors change (used to highlight outliers)
fig = figure(figsize=(6,4))
ax = fig.add_subplot(111)
cut1 = 0
cut2 = 21
ax.plot(lowqqx[:cut1], lowqqy[:cut1], 'o', color=quantile_colors[0])
ax.plot(lowqqx[cut1:cut2],lowqqy[cut1:cut2],'o', color=quantile_colors[1])
ax.plot(lowqqx[cut2:], lowqqy[cut2:], 'o', color=quantile_colors[2])
ax.set_xlabel('Normal Distribution')
ax.set_ylabel('Data')
ax.set_title('Quantile-Quantile of Residuals for Population Ratio-Crime Ratio\nLow School:City Pop. Ratio')
show()
It turns out that among this cluster of campuses, there is only one major outlier. Let's figure out which one it is. We saved all of the original data in lowx_names
and hix_names
, so we can rank those and identify the sole outlier. The outlier is on the right side, so it occurs at a higher value of the data (higher per capita crime ratio, which is the y variable), so we'll use ascending=False
to sort from high to low:
lowx_names.sort_values('Campus to City Per Capita Crime Ratio',ascending=False)[:3]
It's interesting to note that only the UCSF campus showed up on the quantile plot as an outlier - the UC Hastings campus deviated a bit from the 45 degree line that represents a normal distribution, but nothing like UCSF.
# The cuts are where colors change (used to highlight outliers)
fig = figure(figsize=(6,4))
ax = fig.add_subplot(111)
cut1 = 6
cut2 = 11
ax.plot(hiqqx[:cut1], hiqqy[:cut1], 'o', color=quantile_colors[0])
ax.plot(hiqqx[cut1:cut2],hiqqy[cut1:cut2],'o', color=quantile_colors[1])
ax.plot(hiqqx[cut2:], hiqqy[cut2:], 'o', color=quantile_colors[2])
ax.set_xlabel('Normal Distribution')
ax.set_ylabel('Data')
ax.set_title('Quantile-Quantile of Residuals for Population Ratio-Crime Ratio\nHi School:City Pop. Ratio')
show()
Yikes, this distribution is pretty far off from normal. This is basically laying on the 45 degree line lopsidedly.
Here's the distribution plotted with Seaborn:
sns.distplot(hiqqy)
xlabel('Residual y - yhat')
The high moments of this distribution (longer tail, taller shoulder, asymmetry, etc) means multiple campuses are not matched well by the data. Outliers on the low end are negative residuals, meaning the model overpredicted. Outliers on the right side, the positive side, represent values that were higher than the model predicted.
This is a model that can use some improvement. Let's start with the biggest outliers, arguably the two at the right side of the quantile-quantile plot, corresponding to the two highest values. Remember we're looking at campuses that are a sizeable portion of the community around them (over 10%). The three campuses listed below have a rate of crime incidence that more closely reflects the city surrounding it. It could be speculated this is due to a lack of a boundary, having a more "porous" campus, or lacking a sense of community respect for the campus as a safe space.
print hix_names.sort_values('Campus to City Per Capita Crime Ratio',ascending=False)[:3]
On the flip side are the campuses that maintain a lower rate of incidence of crime than would be expected, given their size relative to the community. These indicate campuses where there is some complementary phenomena to the one previously mentioned that preserves the campus as a safe place in the community.
print hix_names.sort_values('Campus to City Per Capita Crime Ratio',ascending=True)[:3]
It seems like there's an element of the rural at work here... Let's plot residuals versus total population of the city and see if there are any patterns to be seen.
hix_names['Resid'] = hiresid
lowx_names['Resid'] = lowresid
#print hix_names.columns
#print new_campus_matches.head()
#print hix_names.columns
z = pd.merge(hix_names,new_campus_matches,how='inner')
z['Log Population'] = z['Population'].map(lambda x : np.log10(x))
print z.columns
We are almost to our goal of identifying a way to bin up and quantify the characteristics of cities. We've picked an interesting relationship between the campus-to-city ratio of populations and the campus-to-city ratio of crime rates. We were able to fit a linear model to these two variables by splitting the data into two bins and fitting them separately with models. We then used a quantile plot to analyze the outliers.
We're now looking at the residuals (we plotted their quantiles above) and seeing how they might change, functionally, with some variables. For example, if we found that the residuals had a very clear, very obvious quadratic relationship with some particular variable $x_J$, then we would go back and revise the model to include a parameter for a quadratic term.
Seaborn's PairGrid
functionality, which creates pairwise plots of variables, is great for this purpose. We can map a scatterplot to the PairGrid to get a scatterplot of the residuals as a function of various other factors we expect might affect the residual. The plot below picks one law enforcement statistic, two per capita crime stats, and a population stat.
(Note that earlier we showed that law enforcement stats didn't correlate very strongly with rates of crime, and that crime incidence is most closely tied to the surrounding city.
fields = ['Resid',
'Per Capita Law Enforcement Personnel campus',
'Student enrollment',
'Population']
g = sns.PairGrid(z[fields])
#g.map(scatter)
g.map_diag(hist)
g.map_offdiag(scatter);
qqdat = z.sort_values('Aggregate Crime Per Officer campus')['Resid']
# Get the hi side linear model residual quantiles
hiqq = stats.probplot( hiresid, dist='norm' )
hiqqx = hiqq[0][0]
hiqqy = hiqq[0][1]