1. Introduction and Hypothesis¶
More female writers emerged in 19th century America than in any other century preceding it. During this time, the industrial revolution started a period of transformation for the American experience. In 1848, the first feminist convention was held in Seneca Falls, and this point marked the beginning of a new era for women in America. As a first-wave feminist movement was growing, an increase in access to education led to a significant rise in the number of female fiction writers. Studying works by female writers provide us with unique perspectives of the female experience during a time of great social change. However, during the time these novels were written, female fiction was not regarded as 'serious literature,' and speaking out against social expectations of women could result in harsh criticism.
Most scholars have argued that female writers in 19th century America only begin presenting their true views on the role of women towards the end of the century. At the beginning of the century, most works by women writers expressed a passive acceptance of their domestic roles. By the end of the century, more female writers begin to present the woman as a protagonist who resists the repression of a patriarchal society and demands a more equal partnership with a man.
To investigate these claims of a gradual feminist movement, I will aim to explore the following question: To what extent do female writers speak out against social expectations of women?
I hypothesize that more female writers will speak out against social expectations toward the end of the century after the Seneca Falls Convention in 1848. Topics and rhetoric surrounding the strong, independent woman will appear to contrast with writing which exhibits more conformity to domestic roles earlier in the century. Additionally, I hypothesize that topics written about by female writers during this time will greatly surround the woman's experience in her family. As feminist attitudes gained momentum throughout the course of the century, and we would expect to find a larger market for feminist writing which challenges conventional sex roles.
2. Corpus, data, and methods¶
Corpus and Data¶
From a corpus of 1,540 volumes of American fiction published between 1789 and 1875, I have pulled works by female authors, resulting in a corpus of 420 novels. Because this corpus includes only about 40% of the American fiction from the period (1789-1875), I am limited to novels by female authors which have been preserved in American academic libraries. This includes only 180 unique female authors of this time, and because they have been preserved in academic libraries, there is an uneven distribution of novels by authors. ~33% of novels in the corpus were written by 10 authors, and these 10 authors may not encapsulate the views or topics of all female writers during this time. Additionally, the majority of authors are white, and there is only one black author in this corpus, which limits my exploration into the female perspective to that of a white woman. Lastly, looking at the distribution of publication dates shows that a much greater number of novels were written mid-century, so this could limit my comparison of features of novels throughout the century.
Methods¶
Pre-processing: I will first subset the corpus to include only female writers during the 19th century. To pre-process, I will label the corpus as before and after 1848, which is the year of the first feminist convention and is said to be the beginning of first-wave feminism. Next, I will chunk each novel into ~10 chunks. To do so, I will tokenize and use a buffer to count tokens and create chunks once the count hits the buffer. I will then detokenize these token lists into texts and store these chunks in both a labeled dictionary and a nested list. This will make the chunks ready to be inputted into a vectorizer and an LDA model. I will transform the nested list into a DataFrame so that I can merge with metadata for later analysis.
Feminist Topic Modeling: After chunking and vectorizing, I will use topic modeling to find pattern in subject-matter amongst female writers during this time. This will account for overall changes in thought overtime, which I expect will show a difference in topics from the first half of the century to the second half. By looking at top words and different variables from the metadata, I can further analyze my topic modeling results. Additionally, to better label each topic, I will examine embedding-based similarities between words. This will help me determine whether more female writers speak out about the experiences of women after the feminist movement begins to take hold and a demand for feminist thought grows.
Regression and Classification: I will use my topic model features to create two models: a regressor and a classifier. The regressor will determine if topics are significant enough to predict the actual publishing year of a book. The classifier will determine if topics can predict the label of a novel as before or after the first feminist convention in 1848. These predictors will show whether or not novels by female authors differentiate themselves by topics over time.
Pre-processing¶
# Imports
import os
import pandas as pd
import numpy as np
from glob import glob
import spacy
from collections import Counter, defaultdict
from nltk import sent_tokenize, word_tokenize
import string
from nltk.corpus import stopwords
import seaborn as sns
from sklearn.preprocessing import StandardScaler
from nltk.tokenize.treebank import TreebankWordDetokenizer
# File locations
# Note that metadata are supplied as a TSV file
# Text files are in a directory, one file per (long, novel-like) document
metadata_file = os.path.join('data', 'us_fiction', 'corpus_data.tsv')
text_dir = os.path.join('data', 'us_fiction', 'us_texts')
# Load the metadata
metadata = pd.read_csv(
metadata_file,
sep='\t',
low_memory=False
).set_index('source_id')
# Subset metadata into only female authors
fem=metadata[metadata['gender']=='F']
fem.describe()
pub_date | gender_guess | born | died | words | |
---|---|---|---|---|---|
count | 420.000000 | 420.000000 | 307.000000 | 294.000000 | 420.000000 |
mean | 1860.038095 | 0.288095 | 1816.446254 | 1885.911565 | 85458.197619 |
std | 12.452621 | 0.453416 | 15.356745 | 16.712761 | 40334.030996 |
min | 1797.000000 | 0.000000 | 1759.000000 | 1840.000000 | 4521.000000 |
25% | 1854.000000 | 0.000000 | 1809.000000 | 1874.000000 | 57685.750000 |
50% | 1862.000000 | 0.000000 | 1818.000000 | 1886.000000 | 81393.500000 |
75% | 1870.000000 | 1.000000 | 1827.000000 | 1896.750000 | 111335.500000 |
max | 1875.000000 | 1.000000 | 1854.000000 | 1932.000000 | 247258.000000 |
Here I have pulled the metadata for the corpus of 420 novels. The data is now limited to only female authors in a new dataframe fem
.
print('fraction of novels written by the top ten authors: {:.3f}%'.format(
fem.author.value_counts(normalize=True)[:10].sum()*100))
fraction of novels written by the top ten authors: 33.095%
fem.head()
author | title | pub_place | publisher | pub_date | gender | gender_guess | ethnicity | occupation | occupation_free | state_born | state_main | state_died | born | died | words | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
source_id | ||||||||||||||||
eaf002 | Bacon, Delia Salter | Tales of the puritans | New Haven [Conn.] | A. H. Maltby | 1831 | F | 0.0 | White | Education | Teacher | OH | CT | CT | 1811.0 | 1859.0 | 70010 |
eaf003 | Bacon, Delia Salter | Love's martyr | Cincinnati | Printed by E. Morgan and Co. | 1838 | F | 0.0 | White | Education | Teacher | OH | CT | CT | 1811.0 | 1859.0 | 13547 |
eaf004 | Bacon, Delia Salter | The bride of Fort Edward | New York | Samuel Colman | 1839 | F | 0.0 | White | Education | Teacher | OH | CT | CT | 1811.0 | 1859.0 | 34309 |
eaf026 | Brooks, Maria Gowen | Idomen, or, The vale of Yumuri | New York | Samuel Colman | 1843 | F | 0.0 | White | Writer | Poet | MA | MA | Cuba | 1794.0 | 1845.0 | 48844 |
eaf041 | Child, Lydia Maria Francis | Hobomok | Boston | Cummings, Hilliard & Co. | 1824 | F | 0.0 | White | Politics-Government-Activism | Activist | MA | MA | MA | 1802.0 | 1880.0 | 56056 |
fem.loc[:, 'label']=[1 if fem['pub_date'][i]>=1848 else 0 for i in range(len(fem))]
print('baseline: {}'.format(fem['label'].sum()/len(fem)))
baseline: 0.8952380952380953
# Distribution of publication dates
metadata.pub_date.plot.hist(bins=metadata.pub_date.max()-metadata.pub_date.min()+1);
Looking at the distribution of authors and publication dates, ~33% of novels are concentrated amongst 10 authors and a large number of novels were published mid-century. These distributions will be important to keep in mind, as they may skew the results of analyses.
novel_files = glob(os.path.join(text_dir, '*'))
fem_files=[]
for f in fem.index:
for n in novel_files:
title=n.split('/')[-1]
if f==title:
fem_files.append(n)
len(fem_files)
420
all_novels={}
for n in fem_files:
title=n.split('/')[-1]
with open(n, 'r') as f:
novel_text = f.read()
all_novels[title]=novel_text
Above, I have opened all files of the corpus and will begin data exploration by pre-chunking the text for LDA.
Chunking¶
punct=list(string.punctuation)
all_chunks = {}
years=[]
labels=[]
nest_chunks=[]
for file in fem_files:
title=file.split('/')[-1]
wordcount=fem.loc[title, 'words'] # use wordcount from metadata to simplify
chunk_length=round(wordcount/10)
with open(file, 'r') as f:
counter=0 # counter to keep track of token count
chunks=[] # nested list to store chunks from each novel
working_tokens=[] # working token list to create chunks
lines=f.readlines()
for line in lines:
tokens = word_tokenize(line) # tokenize by line
for token in tokens: # loop over tokens
working_tokens.append(token.lower())
if token not in punct: # make sure token is not punctuation
counter+=1 # keep track of token count
if counter==chunk_length: # create a chunk when counter>chunk length
counter=0
chunks.append(TreebankWordDetokenizer().detokenize(working_tokens)) # turn into sentences
years.append(fem.loc[title, 'pub_date'])
labels.append(fem.loc[title, 'label'])
working_tokens=[]
if line is lines[-1]: # account for last chunk being uneven
chunks.append(TreebankWordDetokenizer().detokenize(working_tokens))
years.append(fem.loc[title, 'pub_date'])
labels.append(fem.loc[title, 'label'])
all_chunks[title]=chunks
nest_chunks+=chunks
len(all_chunks)
420
chunked_df=pd.DataFrame.from_dict(all_chunks, orient='index')
new=chunked_df.merge(fem[['pub_date']], left_index=True, right_index=True)
new.head()
0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | pub_date | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
eaf002 | the regicides we dig no lands for tyrants but ... | , move your chair for mr. russel . margaret ga... | and there hollowed into deep recesses which pa... | companions, or even cherished in her secret he... | no fancy then; that deep groan had borne its o... | more noble than a resting place in the tombs o... | by which she had leaned a few hours since, was... | dwelling, which she was that moment passing . ... | her in french with ease and fluency . listen, ... | remember a sin of far more deadly hue than aug... | None | 1831 |
eaf003 | the loved, the hated, the adored, each mortal,... | bright attributes that the common experience o... | family as nothing utterly worthless, that you ... | of these savages . the unconscious and bewilde... | home with him to england, and told me how happ... | window . the sound of retiring steps soon echo... | had last vanished listening breathlessly for s... | soon lost to her view, as with that strange an... | hill, preparing to scalp and murder each other... | to the left, in a small glen, but as the level... | to cloud the brightness of the eternal noontid... | 1838 |
eaf004 | 2 the bride of fort edward . part first . indu... | blowing without there, wasting for ever; and n... | and stones of this dull earth were precious to... | .' surely the bitterness is deep when that whi... | , my lady! here's some one at the gate . (an o... | ? this don't look much like it . 4 th sol . i ... | meet now, we are parted for ever; if i do not ... | speedily report our absence . 2nd sol . well, ... | as good . yes sir, yes sir, they are flocking ... | could not be . they told us she was murdered, ... | None | 1839 |
eaf026 | recital . the fireside . various misfortunes h... | their short robes or tunics of clean linen, bo... | ; and that his voluntary absence from a more h... | too cold marble before the picture of idomen; ... | hare and ptarmigan were seen in the sparkling ... | black bearing my usual impression . ` i looked... | of a desolate soul! power even to seek the gra... | confessing to him all i had felt; but the powe... | , said ethelwald, your address, and you shall ... | that roundness of form most remarked in the la... | None | 1843 |
eaf041 | 1 * hobomok . chap . i . how daur ye try sic s... | not worth the tears, which an onion draweth fo... | gratefully partaken, and all john's exploits i... | was considered as ishmael in the house of abra... | a controversial discussion with the plymouth e... | into her apartment, and hiding her face in the... | abide with you, on account of her sometime acq... | , in an agitated voice . verily, my dear wife,... | , as she uttered some mournful and incoherent ... | of his heart . he had proceeded near half a mi... | as he forgiveth you . and nowe, god in his mer... | 1824 |
Here, I have tokenized and chunked all of the text into ~10 chunks per novel. I then detokenized the text into a single string to be inputted into the vectorizer. Finally, I have created a DataFrame to store these chunks under their text file name. This will help me later merge with the metadata to look further into topical patterns.
Vectorizer to LDA¶
from sklearn.feature_extraction.text import CountVectorizer
vectorizer = CountVectorizer(
input = 'content',
encoding = 'utf-8',
strip_accents = 'unicode',
stop_words='english',
lowercase = True,
max_df = .75,
min_df = 0.01
)
X = vectorizer.fit_transform(nest_chunks)
print("Feature matrix shape:", X.shape)
print("Total vectorized words in the corpus:", X.sum())
print("Average vectorized chunk length:", int(X.sum()/X.shape[0]), "tokens")
Feature matrix shape: (4608, 16684) Total vectorized words in the corpus: 10769245 Average vectorized chunk length: 2337 tokens
from sklearn.decomposition import LatentDirichletAllocation
lda = LatentDirichletAllocation(
n_components=20, # Number of topics to find
n_jobs=-1, # Use all CPU cores
verbose=1, # Print progress
max_iter=50, # Might want more in production work
evaluate_every=0 # Set >=1 to test for convergence (slow, but can stop iteration)
)
lda.fit(X)
iteration: 1 of max_iter: 50 iteration: 2 of max_iter: 50 iteration: 3 of max_iter: 50 iteration: 4 of max_iter: 50 iteration: 5 of max_iter: 50 iteration: 6 of max_iter: 50 iteration: 7 of max_iter: 50 iteration: 8 of max_iter: 50 iteration: 9 of max_iter: 50 iteration: 10 of max_iter: 50 iteration: 11 of max_iter: 50 iteration: 12 of max_iter: 50 iteration: 13 of max_iter: 50 iteration: 14 of max_iter: 50 iteration: 15 of max_iter: 50 iteration: 16 of max_iter: 50 iteration: 17 of max_iter: 50 iteration: 18 of max_iter: 50 iteration: 19 of max_iter: 50 iteration: 20 of max_iter: 50 iteration: 21 of max_iter: 50 iteration: 22 of max_iter: 50 iteration: 23 of max_iter: 50 iteration: 24 of max_iter: 50 iteration: 25 of max_iter: 50 iteration: 26 of max_iter: 50 iteration: 27 of max_iter: 50 iteration: 28 of max_iter: 50 iteration: 29 of max_iter: 50 iteration: 30 of max_iter: 50 iteration: 31 of max_iter: 50 iteration: 32 of max_iter: 50 iteration: 33 of max_iter: 50 iteration: 34 of max_iter: 50 iteration: 35 of max_iter: 50 iteration: 36 of max_iter: 50 iteration: 37 of max_iter: 50 iteration: 38 of max_iter: 50 iteration: 39 of max_iter: 50 iteration: 40 of max_iter: 50 iteration: 41 of max_iter: 50 iteration: 42 of max_iter: 50 iteration: 43 of max_iter: 50 iteration: 44 of max_iter: 50 iteration: 45 of max_iter: 50 iteration: 46 of max_iter: 50 iteration: 47 of max_iter: 50 iteration: 48 of max_iter: 50 iteration: 49 of max_iter: 50 iteration: 50 of max_iter: 50
LatentDirichletAllocation(evaluate_every=0, max_iter=50, n_components=20, n_jobs=-1, verbose=1)
# from topic modeling lecture
def print_top_words(model, feature_names, n_top_words, hide_stops=False):
if hide_stops:
from sklearn.feature_extraction.text import ENGLISH_STOP_WORDS
for topic_idx, topic in enumerate(model.components_):
message = f"Topic {topic_idx: >2}: "
top_words_idx = topic.argsort()
if not hide_stops:
top_words = [feature_names[i]
for i in top_words_idx[:-n_top_words - 1:-1]]
else:
top_words = []
i = 1
while len(top_words) < n_top_words:
if feature_names[top_words_idx[-i]] not in ENGLISH_STOP_WORDS:
top_words.append(feature_names[top_words_idx[-i]])
i += 1
message += " ".join(top_words)
print(message)
print()
print_top_words(lda, vectorizer.get_feature_names_out(), n_top_words=10, hide_stops=True)
Topic 0: thou thy thee sweet spirit heaven beauty earth eye deep Topic 1: alice bessie frank laura herbert willie replied sister hamilton ashley Topic 2: ll tom yer got dat em ye miss ve master Topic 3: church lord holy mary christ sister prayer faith christian heaven Topic 4: philip miss susan answered lucia honor thoughts spoke need power Topic 5: water boy black trees road sea horse red wind sun Topic 6: elsie ll loved dead matter hard miss ve lips marry Topic 7: mary ll says doctor got kate money ruth bed boy Topic 8: ll reply boy daisy ve replied warren glad business son Topic 9: agnes mabel robert uncle john julia fanny charles lydia richard Topic 10: ellen lucy aunt clara mary isabel cora miss constance esther Topic 11: captain sir colonel lord miss sybil exclaimed general heaven inquired Topic 12: helen replied daughter son power fate future placed fortune louise Topic 13: happiness replied feelings edith husband affection continued manner loved florence Topic 14: lips fell cried arms answered wild pale dead bed husband Topic 15: jane country replied received miss party indian till eye son Topic 16: harry lily miss amy owen arnold nancy adele marcia ophelia Topic 17: women society human power country eva self fact state certain Topic 18: miss pretty ladies dress music cousin beauty gentleman hair party Topic 19: margaret maud roland effie allan bruce irene edmund st bishop
Using the function provided from lecture, print_top_words
, I can now see the top 10 words for all 20 topics. I will now transform my LDA model, which will output a matrix of the probabilities of belonging to a topic for each chunk.
Doc-topic Matrix¶
import warnings
with warnings.catch_warnings():
warnings.simplefilter('ignore')
doc_topic_matrix = lda.transform(X)
print("Doc-topic matrix shape:", doc_topic_matrix.shape)
Doc-topic matrix shape: (4608, 20)
sorted_df=new[[0,1,2,3,4,5,6,7,8,9,10]].T
sorted_df=sorted_df.fillna(' ')
# Examine dominant topics per text
def find_dominant(model, df, vectorizer):
"""
Returns DataFrame labeled with the dominant topic of a document
"""
labeled_df=pd.DataFrame()
for c in df.columns:
new_lst=[' '.join(sorted_df[c])]
with warnings.catch_warnings():
warnings.simplefilter('ignore')
proba = model.transform(vectorizer.transform(new_lst))
for i in range(len(proba[0])):
if proba[0][i]==proba[0].max():
labeled_df.loc[c, 'dom_top']=i
labeled_df.loc[c, 'top_prob']=proba[0][i]
return labeled_df
dom_df=find_dominant(lda, sorted_df, vectorizer)
dom_df.head()
dom_top | top_prob | |
---|---|---|
eaf002 | 0.0 | 0.291249 |
eaf003 | 0.0 | 0.385452 |
eaf004 | 0.0 | 0.394618 |
eaf026 | 0.0 | 0.422981 |
eaf041 | 15.0 | 0.291122 |
After executing the function find_dominant
, I can now merge with the metadata and create some visualizations based on the dominant topic of each novel.
Dominant Topics and Keyword Weighting¶
new_fem=fem.merge(dom_df, left_index=True, right_index=True)
plt.figure(figsize=(10,8))
g=sns.histplot(x='dom_top', hue='label', data=new_fem, multiple='stack',
palette='pastel', binwidth=1, stat='density', common_norm=False)
g.set_xticks(range(20), labels=range(20))
plt.title('Dominant Topic Labels by Time Period')
plt.xlabel('Dominant Topic')
plt.legend(title='Year', loc='upper right', labels=['1848 and After', 'Before 1848'])
plt.show()
no_15=len(new_fem[(new_fem['dom_top']==15) & (new_fem['label']==0)])
print(f'number of novels written before 1848 with dominant topic 15: {no_15}')
number of novels written before 1848 with dominant topic 15: 23
We see that for novels written before 1848, a large majority have dominant topic 15. This topic consisted of the top words: jane country replied received miss party indian till eye son
. For novels written in 1848 and after, the most common dominant topic was topic 13, which consisted of the top word: happiness replied feelings edith husband affection continued manner loved florence
. At first glance, the top ten words of topic 13 may align with my hypothesis that female writers will write more about the greviences of a domestic life later in the century.
fig, axs = plt.subplots(5, 4, figsize=(30,20))
col=0
row=0
for topic_idx, topic in enumerate(lda.components_):
message = f"Topic {topic_idx: >2}: "
top_words_idx = topic.argsort()
word_scores=lda.components_[topic_idx]
feats_names = vectorizer.get_feature_names_out()
top_words = [feats_names[i]
for i in top_words_idx[:-10 - 1:-1]]
top_scores = [word_scores[i]
for i in top_words_idx[:-10 - 1:-1]]
ax1=sns.barplot(x=top_words, y=top_scores, ax=axs[row, col], palette='Accent')
if row==4:
col+=1
row+=1
if row>4:
row=0
ax1.title.set_text(f'Topic {topic_idx}')
plt.tight_layout()
Here you can see the weights of individual words within a topic. For Topic 13, the main weighted word is happiness
, but other words are similarly weighted. For Topic 15, the main weighted word is jane
, and the second and third are country
and replied
.
# Examine topic distribution in a random chunk
import random
doc_idx = random.randrange(len(doc_topic_matrix))
print(f"Topic distributions in chunk {doc_idx} ({years[doc_idx]})")
print("Sample of the chunk:\n", nest_chunks[doc_idx][:2000], "\n")
for i in range(len(doc_topic_matrix[0])):
if doc_topic_matrix[doc_idx][i]==doc_topic_matrix[doc_idx].max():
print(f'Dominant Topic {i}')
print()
print_top_words(lda, vectorizer.get_feature_names_out(), n_top_words=10, hide_stops=False)
Topic distributions in chunk 1147 (1859) Sample of the chunk: medora in the boudoir at lazy-bank, his restless excitement was such, that he could scarcely remain quietly seated . she came in, looking a shade paler than usual, and met him with an embarrassment in her manner almost equal to his own . little did he suspect the weary vigil she had kept last night, in endeavoring to school herself to the calm contemplation of accepting the offer that she felt sure would the next day he made, or that she met him with the deliberate intention of not committing herself until she was well assured that his worldly prospects were as brilliant as she had been led to suppose . floyd ventured to seat himself beside her on the sofa, and then with trembling lips began: — "miss medora, you can readily guess that since last night all my thoughts have been full of you, and that i have come here this morning to tell you how much i love you ." then glancing at her downcast eyes, he went on hurriedly . do not condemn my selfishness, though i have nothing now to give you but a true and faithful heart . i should not have ventured to address you had i not been, until to-day, ignorant of my real situation ." the color which his words had ripened to a blush in medora's cheeks faded slowly away, and she released herself from the trembling arm that he had stolen about her waist . at this act, floyd, who had paused, awaiting her reply in breathless silence, exclaimed passionately: — "oh, medora, do not repulse me; only, for pity's sake, tell me that you love me a little ." she glanced at his convulsed features, and touched by the depth of real feeling they expressed, said gently— "my friend, do not agitate yourself so much; let us speak calmly on this subject, so important to us both ." "i will do anything you choose," replied floyd, with more composure, "if you will only tell me that you love me ." without noticing this appeal, medora asked: "why do you talk of your selfishness towards me? i have ever found you a thoughtful friend ." "i trust i should alw Dominant Topic 13 Topic 0: thou thy thee sweet spirit heaven beauty earth eye deep Topic 1: alice bessie frank laura herbert willie replied sister hamilton ashley Topic 2: ll tom yer got dat em ye miss ve master Topic 3: church lord holy mary christ sister prayer faith christian heaven Topic 4: philip miss susan answered lucia honor thoughts spoke need power Topic 5: water boy black trees road sea horse red wind sun Topic 6: elsie ll loved dead matter hard miss ve lips marry Topic 7: mary ll says doctor got kate money ruth bed boy Topic 8: ll reply boy daisy ve replied warren glad business son Topic 9: agnes mabel robert uncle john julia fanny charles lydia richard Topic 10: ellen lucy aunt clara mary isabel cora miss constance esther Topic 11: captain sir colonel lord miss sybil exclaimed general heaven inquired Topic 12: helen replied daughter son power fate future placed fortune louise Topic 13: happiness replied feelings edith husband affection continued manner loved florence Topic 14: lips fell cried arms answered wild pale dead bed husband Topic 15: jane country replied received miss party indian till eye son Topic 16: harry lily miss amy owen arnold nancy adele marcia ophelia Topic 17: women society human power country eva self fact state certain Topic 18: miss pretty ladies dress music cousin beauty gentleman hair party Topic 19: margaret maud roland effie allan bruce irene edmund st bishop
This sample shows a snippet of a cluster with dominant topic 13. In this snippet, we can see that the female character is calm, collected, and strong-willed. She even speaks back to the man in the scene saying, 'why do you talk of your selfishness towards me?' while ignoring his appeal for her to tell him she loves him. Since this is only one sample, I will need look further into the semantic meanings of the top words of topics. Next, I will use embeddings to examine the top 10 words from topics 13 and 15 and their similarities to the words family
, woman
, strong
, and home
. I have chosen these words for exploratory reasons based on my initial hypothesis.
Embedding Similarities¶
def find_word_sims(vocab, nlp, sim_word):
"""
Returns DataFrame with cosine similarities between all words to the target similarity word
"""
vector_count = 0 # check to make sure all words have vectors
for v in vocab+[sim_word]:
if v in nlp.vocab.strings:
if nlp.vocab[v].has_vector:
vector_count+=1
# Make a word-vector matrix with labels
vector_matrix = np.zeros([vector_count,nlp.vocab.vectors_length]) # Initialize the output matrix
counter = 0
vocab_dict = {} # Dictionary to hold word index positions in the matrix
vocab_list = [] # List to hold words in order
for v in vocab+[sim_word]:
if v in nlp.vocab.strings:
if nlp.vocab[v].has_vector: # only want the ones with embeddings
vocab_dict[v] = counter # record position of this word
vocab_list.append(v) # add to our list of words
# l2-normalize the vector and update matrix
vector_matrix[counter] = nlp.vocab[v].vector/nlp.vocab[v].vector_norm
counter+=1 # increment counter
similarities = np.dot(vector_matrix, vector_matrix[vocab_dict[sim_word]])
top_n = np.argsort(similarities)[-10:][::-1]
# print(f'top ten words and their similarity to the word {sim_word}:\n')
df=pd.DataFrame()
for i in top_n:
if vocab_list[i] not in sim_words:
word=vocab_list[i]
df.loc[word, 'sim']= similarities[i]
df.loc[:, 'sim_word']=sim_word
return df
vocab=[]
for topic_idx, topic in enumerate(lda.components_):
message = f"Topic {topic_idx: >2}: "
top_words_idx = topic.argsort()
word_scores=lda.components_[topic_idx]
feats_names = vectorizer.get_feature_names_out()
top_words = [feats_names[i]
for i in top_words_idx[:-10 - 1:-1]]
if topic_idx==13 or topic_idx==15:
vocab.append([word for word in top_words])
nlp=spacy.load('en_core_web_lg')
sim_words=['family', 'woman', 'strong', 'home']
# TOPIC 13
fig, axs = plt.subplots(4, 1, figsize=(10,15))
for i in range(len(sim_words)):
df=find_word_sims(vocab[0], nlp, sim_words[i]).reset_index()
ax1= sns.barplot(x='index', y='sim', data=df, ax=axs[i], palette='Accent', order=vocab[0])
ax1.title.set_text(f'Topic 13 Word Similarities to {sim_words[i]}')
ax1.set_ylim(0,1)
ax1.set_xlabel('top 10 topic words')
ax1.set_ylabel('cosine similarity')
plt.tight_layout()
# TOPIC 15
fig, axs = plt.subplots(4, 1, figsize=(10,15))
for i in range(len(sim_words)):
df=find_word_sims(vocab[1], nlp, sim_words[i]).reset_index()
ax1= sns.barplot(x='index', y='sim', data=df, ax=axs[i], palette='Accent', order=vocab[1])
ax1.title.set_text(f'Topic 15 Word Similarities to {sim_words[i]}')
ax1.set_ylim(0,1)
ax1.set_xlabel('top 10 topic words')
ax1.set_ylabel('cosine similarity')
plt.tight_layout()
By looking purely at word similarities, it is clear that words from topic 13 are more similar to family
, strong
, and woman
than words in topic 15. Both sets of words from each topic appear to have similar similarities to the word home
. In the next section, I will use my topic model features in a linear regression model and classification model to determine the importance of topics in labeling novels.
Regression and Classification¶
# Predict novel date from topic content
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import cross_val_score
import matplotlib.pyplot as plt
# Fit and predict using topics
X_topics = StandardScaler().fit_transform(doc_topic_matrix)
predictor = LinearRegression().fit(X_topics, years)
y_pred = predictor.predict(X_topics)
# Score
print("Mean cross-validated R^2 (topics):", round(np.mean(cross_val_score(predictor,
X_topics,
years,
scoring='r2',
cv=10)),3))
# Plot
fig,ax = plt.subplots(figsize=(12,8))
sns.regplot(x=years, y=y_pred, scatter_kws={'alpha':0.1})
plt.xlabel("Actual")
plt.ylabel("Predicted")
plt.title("Topics as features")
plt.show()
Mean cross-validated R^2 (topics): -0.482
Treating this as a regression problem is not very accurate, which was expected due to the distribution of corpus publication dates. I will now use my topic features to fit a LogisticRegression
model which will predict its label as before 1848 or during/after 1848.
# Predict novel label as before or after 1848 from topic content
from sklearn.linear_model import LogisticRegression
# Fit and predict using topics
predictor_log = LogisticRegression(max_iter=300).fit(X_topics, labels)
# Score
print("Mean accuracy (topics):", round(np.mean(cross_val_score(predictor_log,
X_topics,
labels,
scoring='accuracy',
cv=10)),3))
Mean accuracy (topics): 0.916
Treating this as a classification problem results in much better accuracy, with a cross-validated score of 0.916. The high accuracy of this classifier shows that topics are very informative features for labeling a novel as before or during/after 1848.
3. Results¶
After pre-processing the texts into chunks suitable for LDA, I carried out topic modeling on female-written novels from 1797 to 1875. Topic modeling shows that novels written after the first feminist convention in 1848 begin to discuss different topics than before 1848. Specifically, I found that dominant topics for each novel vary greatly depending on the time period the novel was written. Additionally, I looked at the weights of the top 10 words for the most common dominant topic within each label. From the histogram, we see that Topic 13 is the most common dominant topic for novels written in or after 1848. The word with the most weight in this topic is happiness
. However, to validate my hypothesis, I also had to look at the context of this word.
Looking at a short text sample with dominant topic 13, it is clear that the woman's character exhibits strength and resilience. Topic 15, on the other hand, has three highly weighted words: jane
, country
, and replied
. These words are not particularly indicative of a certain way of thinking, but they also do not appear to relate to words from topic 13. In order to investigate the semantics, I used word embeddings to compare all top words from topic 13 and 15 to the words home
, woman
, family
, and strong
. These words were taken from my initial hypothesis that women will write about the strong female character near the end of the century and focus on topics about family and home life. Looking at word similarities, I found that the top words of topic 13 were more similar to the words woman
, family
, and strong
than the top words of topic 15. I have presented these similarities in a set of bar charts.
To determine the statistical importance of topics as features, I created a linear regression model as well as a logistic regression classifier. With a regression model, I tried to use topics to predict the year a novel was published in. I found a very low r$^2$ score of -0.482. With the classifier, I found that topics are very informative features to predict whether novels were written before or during/after 1848, as it resulted in an accuracy score of 0.916.
4. Discussion and Conclusions¶
The results of my classifier using topic modeling as a feature input show that topics are good predictors of a novel's label as written before or after the first feminist conference in 1848. In my hypothesis, I predicted that novels written within these two time periods would differ in content greatly. More specifically, I believed that novels written after 1848 would focus more on the role of a woman in a family and dissatisfaction with domestic life. Topic modeling on its own, however, makes it difficult to infer the proper labeling of a topic, as it assumes that the prevalence of top words can segment different topics.
After initially generating 20 topics, by looking at dominant topics in each novel, I compared top words from two topics, 13 and 15, which I found were the dominant topics for label 1 (post-1848) and label 0 (pre-1848), respectively. Using word embeddings, I was able to tell whether these words could fall under an umbrella word based on cosine similarities. Thus, my results support my hypothesis that topics of novels post-1848 may best fall under the umbrella words family
and woman
. This shows that analyzing works by American female writers of the 19th century can give a unique perspective into the progression of the first-wave feminist movement.
There are a few limitations of my study which could be addressed in future work. Firstly, the corpus I used was limited to novels preserved in academic libraries, which skewed the distribution of publication dates toward the end of the 19th century, as these novels are some of the most famous. This analysis should be done on a larger and more evenly distributed corpus to validate its results. Additionally, I did not have time to look outside of word embeddings and the two dominant topics which I focused on. It would be valuable to find proper labels for each topic based on n-grams and document embeddings.
5. References¶
[1] Snyder, N. (n.d.). Women’s Contribution to Early American Literature. Owlcation. Retrieved May 12, 2022, from https://owlcation.com/humanities/female-perspectives-in-American-literature
[2] Women’s Literature in the 19th Century: Introduction | Encyclopedia.com. (n.d.). Retrieved May 11, 2022, from https://www.encyclopedia.com/social-sciences/encyclopedias-almanacs-transcripts-and-maps/womens-literature-19th-century-introduction
[3] Women’s suffrage in the United States. (2022). In Wikipedia. https://en.wikipedia.org/w/index.php?title=Women%27s_suffrage_in_the_United_States&oldid=1084951447
Prof Matthew Wilkens and INFO 3350 TAs