Inaugural Speech Analysis

Table of Contents

  • Simple Speech Analysis
  • Multiple Speech Analysis
  • Cross-Word Heatmap

Single Speech Analysis

The end result here is to create a function allowing us to return a bag of words used in a President's inaugural speech.

In another notebook, we'll parse that data, put it into a format D3.js likes, and export it as a JSON file.

Finally, we'll visualize it using the Pelican site and a Javascript visualization with the D3.js library.

In [92]:
%matplotlib inline 

from __future__ import division

# bring in the big guns
import pandas as pd
import matplotlib.pylab as plt
import seaborn as sns
import numpy as np
import nltk

# The io module makes unicode easier to deal with
import io, os, re
In [93]:
data_prefix = '../'
fnames = [f for f in os.listdir(data_prefix) if 'txt' in f]

print "Opening",fnames[0]

with open(data_prefix+fnames[0],'r') as f:
    contents = f.read()

tokens = nltk.word_tokenize(contents)
Opening 01_washington_1789.txt
In [94]:
print "Statistics for George Washington's first inaugural address:"
print "Number of words:\t%d"%(len(tokens))
print "Number of unique words:\t%d"%(len(set(tokens)))
print "Average word length:\t%0.2f"%( np.sum([len(x) for x in tokens])/(1.0*len(tokens)) )
Statistics for George Washington's first inaugural address:
Number of words:	1539
Number of unique words:	627
Average word length:	4.67
In [95]:
#print set(tokens)

Now let's stuff this into a function so we can do cross-comparisons. The easiest way is to put things into a dictionary:

Input: filename (prefix and all)

Output: dictionary

Dictionary keys:

  • name
  • date
  • number of words
  • number of unique words
  • average word length
  • set of unique words
In [96]:
# This function takes a name and date, 
# which we'll get from the metadata file.
def get_speech_dict(prefix,fname,name,date):
    #print "Opening",fname

    with io.open(prefix+fname,'r') as f:
        contents = f.read()
        
    tokens = nltk.word_tokenize(contents)

    sd = {}
    sd['name'] = name
    sd['date'] = date
    
    sd['word_count'] = len(tokens)
    sd['unique_word_count'] = len(set(tokens))
    sd['avg_word_length'] = np.sum([len(x) for x in tokens])/(1.0*len(tokens))
    sd['set_unique_words'] = list(set(tokens))
    
    return sd

Multiple Speech Analysis

Now that we have a convenient way of parsing a single speech, and packing useful info about it into a dictionary, we can loop over all speeches, and start to build a picture of the connections between these speeches.

The ultimate aim is to create a visualization of number of words in common between different inaugural speeches. Each speech will be on the outer perimeter of a dendrogram, presumably fixed width; the number of connections into or out of that dendrogram will depend on the number of words that are held in common between that speech and another speech.

To do this, we'll need to create a list of information about each speech. We'll then construct a connections matrix, total size $N^2$, where connection $(i,j)$ has a value indicating the number of words shared between speech $i$ and speech $j$ (the size of the intersection of the sets of words).

In [97]:
md = pd.read_csv('../metadata.csv',header=0)
In [98]:
print md.head()
                 filename               name  date
0  01_washington_1789.txt  George Washington  1789
1  02_washington_1793.txt  George Washington  1789
2  03_adams_john_1797.txt         John Adams  1797
3   04_jefferson_1801.txt   Thomas Jefferson  1801
4   05_jefferson_1805.txt   Thomas Jefferson  1805
In [99]:
list_of_sd = []
prefix = '../'
for (i,row) in md.iterrows():
    sd = get_speech_dict(prefix, row['filename'], row['name'], row['date'])
    list_of_sd.append(sd)

Now list_of_sd stores a dictionary containing several key-value pairs. Here are the keys (the set_unique_words is really long, so we won't print it here):

In [134]:
print list_of_sd[0].keys()
['name', 'set_unique_words', 'word_count', 'date', 'avg_word_length', 'unique_word_count']

It would be interesting to look at some histograms of different quantities, like number of words or number of unique words. Our data is not in a convenient format to do that, so we'll put the numbers in a list, then pass those lists to Seaborn to create a histogram:

In [135]:
# Let's visualize some quantities.

uwc = []
wc = []
wl = []
for sd in list_of_sd:
    uwc.append(sd['unique_word_count'])
    wc.append(sd['word_count'])
    wl.append(sd['avg_word_length'])

The three lists uwc (unique word count), wc (word count), and wl (average word length) are now used to construct histograms. First, a little bit of color stuff, to make the plots look nice:

In [136]:
colors = [sns.xkcd_rgb[j] for j in ['dusty purple','dusty blue','dusty green']]
In [137]:
fig = plt.figure(figsize=(6,12))
ax1,ax2,ax3 = [fig.add_subplot(311+i) for i in range(3)]

sns.distplot(uwc,ax=ax1,
            color=colors[0])
ax1.set_xlabel('Unique word count')

sns.distplot(wc,ax=ax2,
            color=colors[1])
ax2.set_xlabel('Word count')

sns.distplot(wl,ax=ax3,
            color=colors[2])
ax3.set_xlabel('Word Length')

plt.show()

Next, we'll run through each of the speeches to construct a cross-word matrix. The rows and columns will be identical, and will consist of the unique speeches. A given entry in the matrix $(i,j)$ represents the number of words shared in common between speech $i$ and speech $j$. (The diagonals are set to 0.)

In [138]:
# Construct cross-word matrix
crosswords = []
for i,sd1 in enumerate(list_of_sd):
    crosswurds = []
    wurds1 = set(sd1['set_unique_words'])
    for j,sd2 in enumerate(list_of_sd):
        wurds2 = set(sd2['set_unique_words'])
        shared = len( wurds2.intersection(wurds1) )
        if(i==j):
            shared=0
        crosswurds.append(shared)
    
    crosswords.append(crosswurds)

The variable crosswords is a list of lists, which we now convert to a Numpy matrix:

In [139]:
cw = np.array(crosswords)
#print cw

and that is converted to a Pandas DataFrame complete with column labels:

In [140]:
df = pd.DataFrame(cw,columns=md['name'])

Everything is now in place for the heatmap. We pass the DataFrame of cross-word counts to the heatmap() function, specifying x and y labels:

In [141]:
fig = plt.figure(figsize=(14,14))
xlab = md['name'].tolist()
ylab = xlab
sns.heatmap(data=df, square=True, xticklabels=xlab, yticklabels=ylab)
plt.title('Cross-Word Connections Between Inaugural Addresses')
plt.show()

Observations

  • George Washington's second inaugural speech - extremely short - is a white band because it is so terse and verbose, sharing very few words with other speeches.
  • Some curious connections between different generations - Truman, Eisenhower, and Reagan had inaugural speeches that shared much of the language of Van Buren and W. H. Harrison.
  • The earlier speechers are more closely connected, language-wise, than later speeches are to each other. (More drift in language, perhaps. A greater volume of language and output by the President and the United States government, to be sure.)

Preparing Data for D3

The Chord Diagram D3 Block illustrates how to draw a chord diagram. The data must be in a matrix form:

// From http://mkweb.bcgsc.ca/circos/guide/tables/
var matrix = [
  [11975,  5871, 8916, 2868],
  [ 1951, 10048, 2060, 6171],
  [ 8010, 16145, 8090, 8045],
  [ 1013,   990,  940, 6907]
];

But that's D3 version 4. Careful.

This stack overflow post is a nice description of the important aspects of the chord diagram visualization.

Basically, there's a whole bunch of hugger-mugger about the Javascript and shapes and layout and drawing the svg, but what it all boils down to really is the matrix above. Everything else is overhead.

So, we can dump the above heatmap, as an array of arrays, into a JSON dictionary with a single key:

{ 
    'labels' : ['George Washington (1789)','George Washington (1793)',...

    'data' : [[0, 100, 210, 400, ... ],
              [201, 0, 450, 120, ... ],
}
In [130]:
json_labels = ["%s (%d)"%(name,date) for (name,date) in zip(md['name'].tolist(),md['date'].tolist())]
json_labels_small = json_labels[:6]
print json_labels_small
['George Washington (1789)', 'George Washington (1789)', 'John Adams (1797)', 'Thomas Jefferson (1801)', 'Thomas Jefferson (1805)', 'James Madison (1809)']
In [131]:
json_data = cw.tolist()
json_data_small = cw[:6,:6].tolist()
#print json_data
In [132]:
json_dict = { 'labels' : json_labels,
              'data' : json_data }
json_dict_small = { 'labels' : json_labels_small,
                    'data' : json_data_small }
In [133]:
import json
with open('chord_crosswords.json','w') as f:
    json.dump(json_dict,f)
with open('chord_crosswords_small.json','w') as f:
    json.dump(json_dict_small,f)