The end result here is to create a function allowing us to return a bag of words used in a President's inaugural speech.
In another notebook, we'll parse that data, put it into a format D3.js likes, and export it as a JSON file.
Finally, we'll visualize it using the Pelican site and a Javascript visualization with the D3.js library.
%matplotlib inline
from __future__ import division
# bring in the big guns
import pandas as pd
import matplotlib.pylab as plt
import seaborn as sns
import numpy as np
import nltk
# The io module makes unicode easier to deal with
import io, os, re
data_prefix = '../'
fnames = [f for f in os.listdir(data_prefix) if 'txt' in f]
print "Opening",fnames[0]
with open(data_prefix+fnames[0],'r') as f:
contents = f.read()
tokens = nltk.word_tokenize(contents)
print "Statistics for George Washington's first inaugural address:"
print "Number of words:\t%d"%(len(tokens))
print "Number of unique words:\t%d"%(len(set(tokens)))
print "Average word length:\t%0.2f"%( np.sum([len(x) for x in tokens])/(1.0*len(tokens)) )
#print set(tokens)
Now let's stuff this into a function so we can do cross-comparisons. The easiest way is to put things into a dictionary:
Input: filename (prefix and all)
Output: dictionary
Dictionary keys:
# This function takes a name and date,
# which we'll get from the metadata file.
def get_speech_dict(prefix,fname,name,date):
#print "Opening",fname
with io.open(prefix+fname,'r') as f:
contents = f.read()
tokens = nltk.word_tokenize(contents)
sd = {}
sd['name'] = name
sd['date'] = date
sd['word_count'] = len(tokens)
sd['unique_word_count'] = len(set(tokens))
sd['avg_word_length'] = np.sum([len(x) for x in tokens])/(1.0*len(tokens))
sd['set_unique_words'] = list(set(tokens))
return sd
Now that we have a convenient way of parsing a single speech, and packing useful info about it into a dictionary, we can loop over all speeches, and start to build a picture of the connections between these speeches.
The ultimate aim is to create a visualization of number of words in common between different inaugural speeches. Each speech will be on the outer perimeter of a dendrogram, presumably fixed width; the number of connections into or out of that dendrogram will depend on the number of words that are held in common between that speech and another speech.
To do this, we'll need to create a list of information about each speech. We'll then construct a connections matrix, total size $N^2$, where connection $(i,j)$ has a value indicating the number of words shared between speech $i$ and speech $j$ (the size of the intersection of the sets of words).
md = pd.read_csv('../metadata.csv',header=0)
print md.head()
list_of_sd = []
prefix = '../'
for (i,row) in md.iterrows():
sd = get_speech_dict(prefix, row['filename'], row['name'], row['date'])
list_of_sd.append(sd)
Now list_of_sd
stores a dictionary containing several key-value pairs. Here are the keys (the set_unique_words
is really long, so we won't print it here):
print list_of_sd[0].keys()
It would be interesting to look at some histograms of different quantities, like number of words or number of unique words. Our data is not in a convenient format to do that, so we'll put the numbers in a list, then pass those lists to Seaborn to create a histogram:
# Let's visualize some quantities.
uwc = []
wc = []
wl = []
for sd in list_of_sd:
uwc.append(sd['unique_word_count'])
wc.append(sd['word_count'])
wl.append(sd['avg_word_length'])
The three lists uwc
(unique word count), wc
(word count), and wl
(average word length) are now used to construct histograms. First, a little bit of color stuff, to make the plots look nice:
colors = [sns.xkcd_rgb[j] for j in ['dusty purple','dusty blue','dusty green']]
fig = plt.figure(figsize=(6,12))
ax1,ax2,ax3 = [fig.add_subplot(311+i) for i in range(3)]
sns.distplot(uwc,ax=ax1,
color=colors[0])
ax1.set_xlabel('Unique word count')
sns.distplot(wc,ax=ax2,
color=colors[1])
ax2.set_xlabel('Word count')
sns.distplot(wl,ax=ax3,
color=colors[2])
ax3.set_xlabel('Word Length')
plt.show()
Next, we'll run through each of the speeches to construct a cross-word matrix. The rows and columns will be identical, and will consist of the unique speeches. A given entry in the matrix $(i,j)$ represents the number of words shared in common between speech $i$ and speech $j$. (The diagonals are set to 0.)
# Construct cross-word matrix
crosswords = []
for i,sd1 in enumerate(list_of_sd):
crosswurds = []
wurds1 = set(sd1['set_unique_words'])
for j,sd2 in enumerate(list_of_sd):
wurds2 = set(sd2['set_unique_words'])
shared = len( wurds2.intersection(wurds1) )
if(i==j):
shared=0
crosswurds.append(shared)
crosswords.append(crosswurds)
The variable crosswords
is a list of lists, which we now convert to a Numpy matrix:
cw = np.array(crosswords)
#print cw
and that is converted to a Pandas DataFrame complete with column labels:
df = pd.DataFrame(cw,columns=md['name'])
Everything is now in place for the heatmap. We pass the DataFrame of cross-word counts to the heatmap()
function, specifying x and y labels:
fig = plt.figure(figsize=(14,14))
xlab = md['name'].tolist()
ylab = xlab
sns.heatmap(data=df, square=True, xticklabels=xlab, yticklabels=ylab)
plt.title('Cross-Word Connections Between Inaugural Addresses')
plt.show()
The Chord Diagram D3 Block illustrates how to draw a chord diagram. The data must be in a matrix form:
// From http://mkweb.bcgsc.ca/circos/guide/tables/
var matrix = [
[11975, 5871, 8916, 2868],
[ 1951, 10048, 2060, 6171],
[ 8010, 16145, 8090, 8045],
[ 1013, 990, 940, 6907]
];
But that's D3 version 4. Careful.
This stack overflow post is a nice description of the important aspects of the chord diagram visualization.
Basically, there's a whole bunch of hugger-mugger about the Javascript and shapes and layout and drawing the svg, but what it all boils down to really is the matrix above. Everything else is overhead.
So, we can dump the above heatmap, as an array of arrays, into a JSON dictionary with a single key:
{
'labels' : ['George Washington (1789)','George Washington (1793)',...
'data' : [[0, 100, 210, 400, ... ],
[201, 0, 450, 120, ... ],
}
json_labels = ["%s (%d)"%(name,date) for (name,date) in zip(md['name'].tolist(),md['date'].tolist())]
json_labels_small = json_labels[:6]
print json_labels_small
json_data = cw.tolist()
json_data_small = cw[:6,:6].tolist()
#print json_data
json_dict = { 'labels' : json_labels,
'data' : json_data }
json_dict_small = { 'labels' : json_labels_small,
'data' : json_data_small }
import json
with open('chord_crosswords.json','w') as f:
json.dump(json_dict,f)
with open('chord_crosswords_small.json','w') as f:
json.dump(json_dict_small,f)