Getting the Data: Mining

After it became clear that chapters were not a common feature of texts in this corpus, I settled what I thought would be the largest common structural element: the paragraph.

It had also become clear to me that manipulating the TEI files with regular expressions would be more tedious than I had imagined.

I decided to mine the data using OutWit Hub, a program that provides simple, powerful tools for extracting information from webpages, automating the process over a series of many pages, and exporting the extracted (or in OutWit’s parlance, scraped) data in a convenient CSV file.

paragraph scraper

Paragraph scraper in OutWit Hub

Preparing a scraper in Outwit to mine paragraphs from a webpage is incredibly easy, and looked like this:

div1 poem

One poem passed over by OutWit

After crawling through all of the narratives in the collection, the scraper produced a CSV file with nearly 100,000 rows – that is, the full text of about 100,000 paragraphs that make up the narratives.

It is important to emphasize here that OutWit simply extracted everything contained within paragraph tags from the webpages I directed it to, which is slightly different from a print reader’s idea of a paragraph. Elements of webpages that are tagged as paragraphs include headers and metadata (like page titles, copyrights, transcription notes). In some cases, prominent text elements that would make for rich close reading were not enclosed in paragraph tags. Typical text features that escaped the grasp of my paragraph scraper include poems.

For the purposes of this analysis, the trade-off of ease and speed of extraction outweighed the loss of these relatively few text elements. With 100,000 paragraphs, there ought to be plenty to analyze for now!

by Matthew McClellan