QUESTION ONE
Is There a Named Woman
GETTING STARTED
If you are having trouble getting BookNLP to run, here are some tricks that may get it working:
-BookNLP is designed to work on Linux (which means a Mac machine) so if you are on a Windows machine, try installing Ubuntu and then navigate to your :/C drive through Ubuntu with the command '/mnt/c'
-Make sure that Java (in Ubuntu or through the terminal) is up to date. You can check if it is with the command 'java -version'
-Try adding '--add-modules java.se.ee' after './runjava' when you are calling BookNLP
​
The file that you need from the BookNLP results will be saved in the "Output" folder in the book-nlp-master folderr that you downloaded from Github. The file will be saved as 'book.id.book' but this is really just a .JSON file. In order to use it, simply rename the file 'BOOKNAME.json'
​
For part one of this tutorial, we are going to be modifying the code provided by NLP For Hackers’ introduction to machine learning.
Before we begin, let’s start by talking about what machine learning is, exactly. Machine learning is a methodology that fits under the wider umbrella of “Artificial Intelligence.” To put it simply, machine learning relies on the idea that computers can learn from data without being explicitly programmed. As the name implies, machine learning teaches a computer how to learn. There are a couple of ways to classify machine learning problems, but for this tutorial, we are going to focus on a set of problems called “classification.”
The process of machine learning starts by feeding the computer a set of data from which it will learn. This is called a training set. Basically, this just means that we are going to provide the computer with a set of data that is already categorized (or classified in this case) and then we are going to let the computer to try to learn which characteristics of the data in each category is shared.
For example, if you were someone who had never seen a produce aisle in a grocery store before, and you had to decide how to group the produce, you might notice that typically, fruit are not green. This may mean that you put everything that isn’t green into one category and call that category “fruit.” Of course, there are plenty of exceptions to this observation, and it won’t work every time, but you should be able to correctly classify the majority of the fruit in the produce aisle. This is exactly what the computer is trying to do, but instead of grouping produce, we are going to try and get the computer to group character names.
You may be wondering where we are going to get our testing data (or, our training data) from. We can’t possibly know every character name and its assumed gender. Thankfully, the US Census is maintained by the National Archives and Records Administration online (https://www.census.gov/history/www/through_the_decades/overview/). Additionally, the US Social Security Administration maintains a list of most popular baby names and genders by decade (https://www.ssa.gov/oact/babynames/decades/names1880s.html ) Unfortunately, this list only goes back as far as 1880, but it will still be useful for our purposes. From this census information, we will be able to construct a list of names and genders according to the US Census data from the nineteenth century. This tutorial is only going to consider nineteenth century texts because nineteenth century names follow predictable gender conventions. However, even if we have accounted for the possibility that modern names do not follow predictable gender conventions in the same way, you can still see how our method of selecting training data is not perfect and may introduce an element of bias into the data.
Bias is a big problem in machine learning. Although machine learning may be understood as training a computer to learn with as little human intervention as possible, because the training data has to be curated by a human, the data will naturally be altered according to the programmer’s bias. For example, certain words such as “firefighter,” or “police officer” are often coded as masculine even though they are technically gender neutral. This is something to be aware of as you proceed through this tutorial. Where are you likely to introduce bias into your data?
Another issue of bias that we have to consider, is the fact that we are pulling our names and genders from the US Census. We know that census data in the nineteenth century was not recorded evenly across gender, class, or race. Because of this history, our data might not be the ideal for our training set, though it is likely one of the better records of names and genders available for the nineteenth century. One question to consider as you proceed through this tutorial is how you might draw from other historical records in order to develop a better training set.
Now, it is time to actually start looking at the code. This tutorial is going to proceed with modifying files that are produced by David Bamman’s natural language processing pipeline called BookNLP. You can find a link to BookNLP’s Github as well as instructions for running BookNLP on the Methodologies and Tools page. The file, in particular, that we will be using is the .JSON file that is generated once you have run a plain text file of your chosen novel through BookNLP.
In Python, giving a variable a name that is all capitalized--like TRAIN_SPLIT-- indicates that it is a variable whose value is not likely to change
In Python, using the command ".lower()" converts a string (or a series of letters) into lower case
Here, we're using the NumPy function 'vectorize' to convert the results of our gender_features function into a vector. This is because everything else that we're going to use requires lists
THE CODE
We are going to start by importing a couple of useful libraries. In Python, libraries are just pockets of code that allow you to perform actions without having to write the code, yourself.
​
​
LIBRARIES
Pandas
Pandas is a Python library that allows for the easy manipulation of data. In this case, we are going to be using Pandas because it makes working with CSV files easy.
Numpy
Numpy is a Python extension module that extends Python into a language that can manipulate numbers and work with math efficiently
Glob
Glob allows programmers to link to file paths easily. This library will come in handy when we are opening and closing local files such as the .JSON file produced by BookNLP
OS
Similarly to Glob, OS will help us access local files and save new files easily
The yellow highlighted code indicates that the code has been modified from this tutorial
import pandas as pd
import numpy as np
import glob
import os
​
​
​
Now, we are going to create a variable "names" to hold our csv of names and genders that we created using the census data.
names = pd.read_csv('names_dataset.csv')
Now, we need to convert our .csv file (which Pandas has read into Python as a Data Frame) into a Numpy matrix so we can use it as training data. Afterwards, we'll determine how much of the data we will use as training data (we're going to use 80%)
# Saving the data as a matrix makes it easier to access later
names = names.as_matrix()[:, 1:]
# We're using 80% of the data for training
TRAIN_SPLIT = 0.8
Here, we're going to be using a useful function that NLP for Hackers has developed for their tutorial. In Python, a 'function' is a pocket of code that will only run when its called. In this case, we are defining a function called "gender_features" and we're going to hold off on actually performing the actions under that definition until we have supplied the function with a name to actually break down into features.
def gender_features(name):
name = name.lower()
return {
'first-letter': name[0],
# First letter
'first2-letters': name[0:2], # First 2 letters
'first3-letters': name[0:3], # First 3 letters
'last-letter': name[-1],
'last2-letters': name[-2:],
'last3-letters': name[-3:],
}
English names (and particularly nineteenth century names) tend to follow pretty predictable naming conventions (for example, a name that ends with an "a" is likely to be gendered "girl") so splitting a name into sets of letters will help us get at these conventions easier
Try it yourself! If you run your code with this command, you should get:
​
{'first2-letters': 'ma', 'last-letter': 'y', 'first-letter': 'm', 'last2-letters': 'ry', 'last3-letters': 'ary', 'first3-letters': 'mar'}
Now, we're going to actually extract the features and genders from the .CSV file that we provided the code, earlier.
​
As the comments in the code indicate, the variable 'X' will contain the name features (first letters, last letters) and the 'y' variable will contain the genders that we have based on the .CSV file
print(gender_features("Mary"))
features = np.vectorize(gender_features)​
​
# Extract the features for the whole dataset
X = features(names[:, 0])
# X contains the features
# Get the gender column
y = names[:, 1]
# y contains the targets
This code will actually begin splitting out data into a training set and a testing set.
​
Training set = the set of data that the code will use as a "cheat sheet," i.e. lists of both names and genders already provided
​
Testing set = the set of just names which the code will try to categorize according to gender
from sklearn.utils import shuffle
X, y = shuffle(X, y)
​
X_train, X_test =
X[:int(TRAIN_SPLIT * len(X))], X[int(TRAIN_SPLIT * len(X)):]
y_train, y_test =
y[:int(TRAIN_SPLIT * len(y))], y[int(TRAIN_SPLIT * len(y)):]
Now, we need to transform the letters into something that the computer will be able to actually understand. This means, we're going transform letters to vectors.
​
And this is all that we need to set up our gender classifier!
Next, we're going to import the library "json." This library will allow us to read the .JSON file that we generated from BookNLP. Make sure that the .JSON is in the same folder as your python folder.
​
We are going to open the Wuthering Heights .JSON file, read it, and save the contents of the file as the variable"data"
from sklearn.feature_extraction import DictVectorizer
​
vectorizer = DictVectorizer()
vectorizer.fit(X_train)
​
​
import json
# This library will let us read in JSON files directly into our code
​
with open("wuthering.json", "r") as read_file:
data = json.load(read_file)
# We're going to save the data as "data"
This function, which we will call "find_names" will simply read in the data variable that we saved earlier, and locate a particular entry in the data.
​
BookNLP saves character names in the .JSON file under the key "n." What this function does, is simply open up the data variable, and search for any instance of this key, "n." Then, if there is an instance of "n," in the list the function will return that value. Otherwise, the function will continue searching for "n" until it has read the entire variable. (see Python cheatsheet entry for "recursion")
# Now, this function is going to search through the data list and look for the field "n"
def find_names(data):
if 'n' in data:
yield data['n']
for k in data:
if isinstance(data[k], list): #if item 'k' is in the data list
for i in data[k]:
for j in find_names(i):
yield j
Now, we're going to create a new variable, "nameslist," that is going to hold the results of the find_names function (which we will convert into a list)
To test that your code is working, try to print out the names list.
These lines of code will declare a variable called "genders" to hold the results of our machine learning algorithm. Then, we are going to feed the results into a while loop. The loop will look through the results and print "Yes" if there is at least one name that is gendered as woman
# Save the names into a list variable
nameslist = list(find_names(data))
print(nameslist)
genders = (clf.predict(vectorizer.transform(features(nameslist))))
i = 0
while i <= len(genders):
if genders[i] == "F":
print ("Yes")
i = len(genders) + 1
else:
i = i + 1
The next portion of this tutorial will focus on modifying the gender results of Step One to fit the nineteenth century, more specifically, and then will move on to addressing the second question of the Bechdel Test which addresses speech.