top of page

Question Two

Do two named women speak to each other?

Scroll Down

GETTING STARTED

For this portion of the tutorial, we are going to focus on modifying the gender results that we achieved in part one of the tutorial as well as extracting the dialogue that BookNLP pairs with character names. As we have discussed in previous sections, diversity is a huge problem in the field of machine learning for reasons we have now discovered through our initial determination of gender in Question One. However, the issue stems even deeper than what we have covered, so far. 

 

While to some degree, one could make the argument that because we were sourcing names and genders from the U.S. Census, what the code really reflects is the bias of those particular, governmental record keeping systems, Question Two is where we begin to complicate that notion. Since we are particularly working with nineteenth century novels, there is a whole other layer of gender related categorization issues that we must consider. For example, while we have thus far created a model that will successfully identify gender based on a name, many nineteenth century, fictional characters are only ever known by their last name. For this reason, we must consider questions such as: can we really consider a female character to be actualized in the novel if we only ever know her in terms of her relationship to a father or brother? Another question that we might consider, is if we were to assemble a list of words that immediately indicate gender in instances of dialogue (for example, ‘father,’ ‘brother,’ or ‘husband’) at what point can we be satisfied that we have accounted for all such instances? Do we consider words such as ‘firefighter’ or ‘doctor’ to be gendered male because of how limited the workforce was in the nineteenth century in terms of gender? If we choose to make these decisions, does this prevent the model from extracting instance where perhaps there might have been a woman doctor or firefighter?

 

All of these questions, and are, are questions that you as the researcher must decide. All of these decisions introduce the possibility that contemporary gender bias might be introduced to the model. These biases can present themselves in obvious ways (like in this example) but more often or not, these biases exist in the systems all around us in ways that are difficult to spot. For example, in a November, 2019 Medium article about gender bias and AI, Robert Munro shows how in most contemporary, widely used Natural Language Processing technologies, the pronoun “hers” is not recognized as a pronoun. Some of these NLP technologies include the NLP parser that BookNLP uses which is produced and maintained by Stanford University and is particularly popular in Digital Humanities. Unsurprisingly, Munro shows how these same technologies have no issue identifying “his” as a pronoun. These biases aren’t an issue of “algorithm bias” which is a term that has become popularly used in recent years. Rather, these biases reflect the biases of both the programmer and the language itself. You can read more about the biases against “hers” here.   

 

Keep all of these ideas in mind as you proceed throughout the tutorial as we will be returning to these concepts, later.

LIBRARIES

THE CODE

We are going to start by importing a couple of useful libraries. In Python, libraries are just pockets of code that allow you to perform actions without having to write the code, yourself.

​

​

JSON

Json is a Python library that allows a .JSON file to be read as a Python dictionary object. This library allows for users to work directly with a .JSON file rather than converting it

Six

Six is a library that allows for the smooth transition of libraries from Python 2 to Python 3. This library allows us to use iteritems which will iterate over a large list/dictionary

Default Dict

Defaultdict is a way of preventing a key error if we iterate over a dictionary and hit provide a key that doesn't exist. Rather than return an error, defaultdict will return a default value for a key error

​

import json
from six import iteritems

from collections import defaultdict

We're just going to start by declaring a couple of variables and setting them equal to 0. These variables will help us keep track of what position we are in, later. Additionally, we're going to declare two empty lists to hold the respective male and female names, as well as a dictionary called "attributes" which will hold each character's name, gender, and dialogue.

q = 0
j = 0
y = 0
p = 0

attributes= {} #dictionary
female_names = []
male_names = []

Now, we're going to create a nested loop that will iterate through our .JSON file. Remember from part one of this tutorial, that by this point in the code, we have imported the .JSON file as a dictionary object called "data," so any instance of the variable "data" is referring to that dictionary. Since our dictionary is nested (with the character names listed under the column "characters," our second loop (which is a while loop) will continue looping as long as the variable j is less than the number of characters. This way, we make sure that we grab all of the characters' data.

for i in data:
    while j <(len(data['characters'])):

For the next portion of this tutorial, we are going to be using a function called 'nested_lookup.' Since the point of this tutorial is not to teach you how functions work, I'm not going to go line by line and explain how this one works. All you need to know, is that this function allows us to look for particular values and keys in a nested dictionary without necessarily knowing how deeply nested it is. For example, we might not know how deeply embedded the 'name' variable is in our data dictionary. Because we might not have this information, we need to have some way to just check each entry until we find what we're looking for. That is what this function will do for us.

def nested_lookup(key, document, wild=False, with_keys=False):
    """Lookup a key in a nested document, return a list of values"""
    if with_keys:
        d = defaultdict(list)
        for k, v in _nested_lookup(key, document, wild=wild, with_keys=with_keys):
            d[k].append(v)
        return d
    return list(_nested_lookup(key, document, wild=wild, with_keys=with_keys))

def _nested_lookup(key, document, wild=False, with_keys=False):
    """Lookup a key in a nested document, yield a value"""
    if isinstance(document, list):
        for d in document:
            for result in _nested_lookup(key, d, wild=wild, with_keys=with_keys):
                yield result

    if isinstance(document, dict):
        for k, v in iteritems(document):
            if key == k or (wild and key.lower() in k.lower()):
                if with_keys:
                    yield k, v
                else:
                    yield v
            elif isinstance(v, dict):
                for result in _nested_lookup(key, v, wild=wild, with_keys=with_keys):
                    yield result
            elif isinstance(v, list):
                for d in v:
                    for result in _nested_lookup(key, d, wild=wild, with_keys=with_keys):
                        yield result

Next, we're going to declare some variables within our nested loop. Remember, indention matters in Python, so make sure that all of your spacing matches. The variables that we are declaring are as follows:

​

the 'key' variable is a way of keeping track of where we are in the data variable. In a way, the key is just a way of keeping our place. Because we are setting the different coordinates for the key as 'j' and 'i', the key will gradually increment which will allow us to traverse the entire dictionary.

 

the 'name' variable will look for the entry 'n' (which stands for name) in the data dictionary. Then, the name that is associated with the particular key that the loop is currently on will also be looked up. The correct name/key pairing will become the 'name' variable. 

​

the 'gender' variable will take the name from above and apply our prediction model on it (similarly to how we did this in Question One)

       key = data[i][j]

   name = nested_lookup('n', key)
   gender = clf.predict(vectorizer.transform(features(name)))

There are a few things that we need to fine-tune about how we are determining gender. As we discussed in the introduction above, many nineteenth century characters are primarily known by their last name. If you give a gender predictor a name, it can't distinguish the difference between a first and last name, so it will attempt to gender a name like "Mrs. Adams" or "Mr. Bennett" which may alter our results quite significantly. These conditional statements will correct for that issue by changing the gender of the name if it is preceded with either "Mrs." or "Mr." You can choose to ignore this step, if you wish, though remember that your results may vary. 

        if gender[0] == 'F':
                    #item = name[x]
            if name[0][:3] == 'Mr.':
                gender = 'M'
                male_names.append(name[q])
            else:
                female_names.append(name[q])
        if gender[0] == 'M':
            if name[0][:3] == 'Mrs':
                gender[0] = 'F'
                female_names.append(name[q])
            else: 
                male_names.append(name[q])
                    

If you chose to not correct the gender of last names, your final loop code will look like this:

for i in data:
    while j < (len(data['characters'])):
        key = data[i][j]

        name = nested_lookup('n', key)
        gender = clf.predict(vectorizer.transform(features(name)))

        if gender[0] == 'F':
                    #item = name[x]
            if name[0][:3] == 'Mr.':
                gender = 'M'
                male_names.append(name[q])
            else:
                female_names.append(name[q])
        if gender[0] == 'M':
            if name[0][:3] == 'Mrs':
                gender[0] = 'F'
                female_names.append(name[q])
            else: 
                male_names.append(name[q])
                    

If you chose to correct the gender of last names, your final loop code will look like this:

for i in data:
    while j < (len(data['characters'])):
        key = data[i][j]

        name = nested_lookup('n', key)
        gender = clf.predict(vectorizer.transform(features(name)))

​

        if gender[0] == 'M':

             male_names.append(name[q]) #we only need one of the names provided by BookNLP, in this case we are taking the first one

​

        if gender[0] == 'F':

             female_names.append(name[q])

​

There is one last bit of code that we need to add to the bottom of the loop (no matter what option you chose above). This code will look through our names and genders, and if the gender of a particular name = 'F', then that name, gender, and the speech associated with the character, will be added to a dictionary object called 'attributes'

​

Then, we are going to keep track of our keys in the 'keys' variable, and then increment the variables p, y, and j which we have used to help us keep our place. 

​

Remember that all of this code is located within the larger loop, still, so the indentations should all match  up.

        if gender[0] == 'F':
            attributes.update({y : ([{'name': name[0]},
                                  {'gender': gender[0]},
                                  {'speaking': nested_lookup('w', nested_lookup('speaking', key))}
                ])})
            keys.append(y)
        p = p + 1
        y = y + 1
        j = j + 1 

Now, we are just going to run a few print statements to make sure that everything looks good. The output of 'female_names' should be a complete list of all the names the code has identified as belonging to a woman, the 'male_names' output should be all of the names of men, and then the 'attributes' output should be a dictionary of all names, genders, and spoken dialogue.

print(female_names)
 

print(male_names)
 

print(attributes)

Now comes the tricky part of this tutorial. We have to have some way of determining whether or not one of the characters we have identified as a woman, is speaking to another woman. How we're going to do that, is to establish a list of "female_terms," or words that we can be reasonably assured signal that a woman is speaking to another woman. We're going to specifically look for these words in the instances of spoken dialogue that we have extracted.

female_terms = []

Below are two options for constructing our terms list. These two options are not comprehensive, they merely represent both a more loose and a more conservative approach to this problem.

Choosing a conservative approach might mean that some instances of gender in dialogue will not be recognized by the code

This approach is a more conservative approach. This means that we are going to make zero assumptions regarding gender based on perceived gender of specific roles. For this approach, we will construct a list of female pronouns as well as a limited list of roles that are guaranteed to be female such as 'sister' or 'mother.'

This approach is more loose. this means that we are going to make some assumptions regarding particular roles using an educated guess. In addition to obviously gendered roles such as 'sister' or 'mother,' we will also include words such as 'maid' or 'servant,' etc.

female_terms = ["her", "Her", "hers", "Hers", "her's", "Her's", "madam", "Madam", "Miss", "miss", "she", "She", "mother", "Mother", "sister", "Sister", "mama", "Mama", "wife", "Wife"]
to_women = female_names + female_terms

BIAS WARNING

​

At this point in the tutorial, you should be asking yourself whether your decisions regarding gender are being informed by your own expectations regarding gender. How will you ensure that you limit the amount of gender bias present in your code?

female_terms = ["her", "Her", "hers", "Hers", "her's", "Her's", "madam", "Madam", "Miss", "miss", "she", "She", "mother", "Mother", "sister", "Sister", "mama", "Mama", "wife", "Wife," "maid," "Maid," "servant," "Servant," "cook," "Cook," "launderer," "Launderer," "washer," "Washer," "nanny," "Nanny," "factory worker," "Factory worker" ]


to_women = female_names + female_terms

Choosing a more loose approach might mean that some of your educated guesses regarding gender roles are incorrect. In this case, you might pass over or include instances that shouldn't be

bottom of page