In this notebook, we illustrate the use of Logistic Regression to categorize the abalone shell data set by number of rings. The notebook starts by importing the data as a scikit Bunch object. It then builds a cross-validated Logistic Regression model using a 70/30 split of training and test data, and plots the confusion matrix.
The results turn out to be pretty dismal. However, we can improve the model quite a bit by utilizing results from prior notebooks. There, we saw that adding a volume variable and normalizing the input variables were all helpful.
%matplotlib inline
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
import statsmodels.api as sm
import scipy.stats as stats
from sklearn import metrics, cross_validation, preprocessing
from sklearn.linear_model import LogisticRegression
from sklearn.datasets.base import Bunch
import pickle
The function to load the data reads the data from a CSV file, but populates it using a scikit Bunch
object, which is basically a DataFrame with the inputs and outputs separated.
def load_data():
# Load the data from this file
data_file = 'abalone/Dataset.data'
# x data labels
xnlabs = ['Sex']
xqlabs = ['Length','Diameter','Height','Whole weight','Shucked weight','Viscera weight','Shell weight']
xlabs = xnlabs + xqlabs
# y data labels
ylabs = ['Rings']
# Load data to dataframe
df = pd.read_csv(data_file, header=None, sep=' ', names=xlabs+ylabs)
# Filter zero values of height/length/diameter
df = df[df['Height']>0.0]
df = df[df['Length']>0.0]
df = df[df['Diameter']>0.0]
dummies = pd.get_dummies(df[xnlabs], prefix='Sex')
dfdummies = df[xqlabs+ylabs].join(dummies)
xqlabs = xqlabs + dummies.columns.tolist()
return Bunch(data = dfdummies[xqlabs],
target = df[ylabs],
feature_names = xqlabs,
target_names = ylabs)
# Load the dataset
dataset = load_data()
X = dataset.data
y = dataset.target
print X.head()
print "-"*20
print y.head()
Now we can split the data into two parts, a training set and a testing set. We'll use the training set to train the model and fit parameters, and the testing set to assess how well it does. Splitting the inputs and outputs in this way is common when cross-validating a model (for example, to try cutting the data in different places to see if there are significant changes in the fit parameters).
# Split into a training set and a test set
# 70% train, 30% test
X_train, X_test, y_train, y_test = \
cross_validation.train_test_split(X, y, test_size=0.2)
Now we create a logistic regression model, which is predicting abalone age as a categorical variable (the class of 1 ring, the class of 2 rings, and so on.)
# Fit the training data to the model
model = LogisticRegression()
model.fit(X_train, y_train)
print model
Once we've trained the model on the training set, we assess the model with the testing set. If we cut our data into k pieces and repeated this procedure using each of the k cuts as the testing set, and compared the resulting parameters, it would be called k-fold cross validation.
# Make predictions
yhat_test = model.predict(X_test)
# Make sure y_test is a numpy array
y_test = y_test['Rings'].apply(lambda x : int(x)).values
# Compare yhat_test to y_test to determine how well the model did
# This is not usually a good way to assess categorical models,
# but in this case, we're guessing age, so the categories are quantitative.
print model.score(X_test,y_test)
## Yikes. This model may not be worth saving.
#with open('logistic_regression.pickle', 'w') as f:
# pickle.dump(model, f)
fig = plt.figure(figsize=(7,7))
ax = fig.add_subplot(111)
sns.heatmap(metrics.confusion_matrix(y_test, yhat_test),
cmap="GnBu", square=True, ax=ax)
ax.set_title('Heatmap: Confusion Matrix for \nLogistic Regression Model')
ax.set_xlabel('Predicted Age')
ax.set_ylabel('Actual Age')
plt.show()
#print metrics.confusion_matrix(y_test, yhat_test)
print metrics.classification_report(y_test, yhat_test)
To interpret the above chart: the precision is the ratio of total number of positives in the prediction set to total number of positives in the test set. Most of the abalones have between 7 and 11 rings. For these categories our precision is around 20-30%. This means that 70-80% of the abalones that we put in these categories (i.e., that we guessed have 7-11 rings) actually have a different number of rings.
The reacll of the 7-10 ring categories have a recall of about 40%, which means that 60% of the ablones that should have been in this category are not.
So basically, a lot of miscategorization, with most of it happening for the 7-11 rings categories (which also happen to be the most common).
resid = y_test - yhat_test
print np.mean(resid)
print np.std(resid)
fig = plt.figure(figsize=(4,4))
ax = fig.add_subplot(111)
stats.probplot(resid, dist='norm', plot=ax)
plt.show()
For our next step we'll compare a scaled model, to see how well that does.
# Split into a training set and a test set
# 70% train, 30% test
X_train, X_test, y_train, y_test = \
cross_validation.train_test_split(X, y, test_size=0.2)
# Repeat above, but with scaled inputs
Xscaler = preprocessing.StandardScaler().fit(X_train)
Xstd_train = Xscaler.transform(X_train)
Xstd_test = Xscaler.transform(X_test)
modelstd = LogisticRegression()
modelstd.fit(Xstd_train, y_train)
# Make predictions
yhatstd_test = modelstd.predict(Xstd_test)
y_test = y_test['Rings'].values
# This is not usually a good way to assess categorical models,
# but in this case, we're guessing age, so the categories are quantitative.
print modelstd.score(Xstd_test,y_test)
fig = plt.figure(figsize=(7,7))
ax = fig.add_subplot(111)
sns.heatmap(metrics.confusion_matrix(y_test, yhatstd_test),
cmap="GnBu", square=True, ax=ax)
ax.set_title('Heatmap: Confusion Matrix for \nNormalized Logistic Regression Model')
ax.set_xlabel('Predicted Age')
ax.set_ylabel('Actual Age')
plt.show()
resid = y_test - yhat_test
print np.mean(resid)
print np.std(resid)
fig = plt.figure(figsize=(4,4))
ax = fig.add_subplot(111)
stats.probplot(resid, dist='norm', plot=ax)
plt.show()
This model, like the corresponding unscaled version, is pretty terrible. We're underpredicting abalone age by a substantial amount, and the residuals still have curvature.
Moving forward, we can try adding a few additional features to our logistic regression model (more input variables, transformed responses, etc.). However, to do that we'll want to be a bit more careful about how we're assessing our models.
Here, we'll implement a k-fold cross validation of our logistic regression parameters, so we can be sure we're not just getting lucky or unlucky with how we cut our data set. To do this with scikit-learn we'll use some of the goodies provided in the scikit-learn cross-validation documentation. Namely, we'll build a logistic regression model (which we'll use to fit the data), a shuffle split object (which we'll use to split the data at random into training and test sets), and a pipeline to connect the standard scaler to the logistic regression model.
When we run the cross_val_score()
method, we'll pass it the pipeline as our "model", and the shuffle split object as our cross-validation object. We'll also pass it our original inputs and outputs, X and y (note that we no longer have to split the data, standardize it, fit the model, compare the predictions, etc etc.).
from sklearn.model_selection import cross_val_score, cross_val_predict
from sklearn.pipeline import make_pipeline
from sklearn.model_selection import ShuffleSplit
# Make a logistic regression model
mod = LogisticRegression()
# Make a ShuffleSplit object to split data into training/testing data sets randomly
cv = ShuffleSplit(n_splits=4, test_size=0.3, random_state=0)
# This will be our "model":
# a pipeline that scales our inputs first,
# then passes them to the logistic regression model
clf = make_pipeline(preprocessing.StandardScaler(), mod)
cross_val_score(clf, X, y, cv=cv)
This is a big improvement in workflow, if not in accuracy: we now split the data into training and testing data sets randomly, four different times, and see what the score of each model is. Note that if we want to access the predictions themselves, we can use the cross_val_predict()
method instead of the cross_val_score()
method. That will allow us to compute things like a confusion matrix or run a classification report.
Now that we have a more quantitative way to assess our models, let's start adding in some factors to see if we can improve our logistic regression model.
def load_data_with_volume():
# Load the data from this file
data_file = 'abalone/Dataset.data'
# x data labels
xnlabs = ['Sex']
xqlabs = ['Length','Diameter','Height','Whole weight','Shucked weight','Viscera weight','Shell weight']
xlabs = xnlabs + xqlabs
# y data labels
ylabs = ['Rings']
# Load data to dataframe
df = pd.read_csv(data_file, header=None, sep=' ', names=xlabs+ylabs)
# Filter zero values of height/length/diameter
df = df[df['Height']>0.0]
df = df[df['Length']>0.0]
df = df[df['Diameter']>0.0]
# -----------------------------
# Add volume
df['Volume'] = df['Height']*df['Length']*df['Diameter']
xqlabs.append('Volume')
# Add dimensions squared
sq = lambda x : x*x
df['Height2'] = df['Height'].apply(sq)
df['Length2'] = df['Length'].apply(sq)
df['Diameter2'] = df['Diameter'].apply(sq)
xqlabs.append('Height2')
xqlabs.append('Length2')
xqlabs.append('Diameter2')
# Add interactions
df['Height-Length'] = df['Height']*df['Length']
df['Length-Diameter'] = df['Length']*df['Diameter']
df['Height-Diameter'] = df['Height']*df['Diameter']
xqlabs.append('Height-Length')
xqlabs.append('Length-Diameter')
xqlabs.append('Height-Diameter')
# Add dimensions cubed
cube = lambda x : x*x*x
df['Height3'] = df['Height'].apply(cube)
df['Length3'] = df['Length'].apply(cube)
df['Diameter3'] = df['Diameter'].apply(cube)
xqlabs.append('Height3')
xqlabs.append('Length3')
xqlabs.append('Diameter3')
# -----------------------------
dummies = pd.get_dummies(df[xnlabs], prefix='Sex')
dfdummies = df[xqlabs+ylabs].join(dummies)
xqlabs = xqlabs + dummies.columns.tolist()
return Bunch(data = dfdummies[xqlabs],
target = df[ylabs],
feature_names = xqlabs,
target_names = ylabs)
# Load the dataset
datasetV = load_data_with_volume()
XV = datasetV.data
yV = datasetV.target
# Make a logistic regression model
mod = LogisticRegression()
# Make a ShuffleSplit object to split data into training/testing data sets randomly
cv = ShuffleSplit(n_splits=4, test_size=0.3, random_state=0)
# This will be our "model":
# a pipeline that scales our inputs first,
# then passes them to the logistic regression model
clf = make_pipeline(preprocessing.StandardScaler(), mod)
cross_val_score(clf, XV, yV, cv=cv)
Adding higher order variable inputs to our model didn't help much. Although we really didn't explore variable interactions very deeply, it's clear they're only getting us a boost of less than 0.05 in the model score. Let's actually fit the model to data, using the same model and pipeline and data set, but this time use cross_val_predict()
instead of cross_val_score()
so we can actually get the predictions from our model.
from sklearn.model_selection import StratifiedKFold
skf = StratifiedKFold(n_splits=4)
print XV.values.shape
#print len(yV.values)
print yV.values.reshape(len(yV.values)).shape
# Because this is an array of shape (N,1)
# and we need an array of shape (N,)
# we must reshape it.
yV = yV.values.reshape(len(yV.values))
yhatV = cross_val_predict(clf, XV, yV, cv=skf)
print len(yV)
print len(yhatV)
fig = plt.figure(figsize=(7,7))
ax = fig.add_subplot(111)
sns.heatmap(metrics.confusion_matrix(yV, yhatV),
cmap="GnBu", square=True, ax=ax)
ax.set_title('Heatmap: Confusion Matrix for \nNormalized Logistic Regression Model')
ax.set_xlabel('Predicted Age')
ax.set_ylabel('Actual Age')
plt.show()
Throwing in the towel here... The logistic model performs very poorly when compared to other techniques like ridge regression or state vector regression, and it'll take a lot of effort, focused on this particular model form, to get it anywhere close to state vector regression.