Interlude: Massaging the Metadata
Before rushing on ahead to throw my text into MALLET and get some topic models – which was very tempting! – I wanted to make sure I had some way to connect the data that Outwit had mined with the parent texts that each paragraph had been extracted from. Additionally, preserving the order of the paragraphs would open more potential paths of analysis (perhaps searching for stability and dramatic change in topics across neighboring paragraphs?). Luckily, Outwit’s scraper had assigned a unique ID and the source URL to each row of output; I just needed to transform DocSouth’s metadata to provide a field that could join Outwit’s source URL.
Though OutWit had extracted text from the XML pages, and thus exported URLs pointing to the XML pages, the metadata provided by DocSouth only included full URLs for the plain text and landing pages for each narrative:
For comparison’s sake, examples of each page:
- Equiano landing page: http://docsouth.unc.edu/neh/equiano1/menu.html
- Equiano plain text: http://docsouth.unc.edu/full-text/na-slave-narratives/data/texts/neh-equiano1-equiano1.txt
- Equiano XML: http://docsouth.unc.edu/neh/equiano1/equiano1.xml
The “filename” field of each entry provided a complete “URL-suffix” for each text, requiring a simple substitution of slashes for hyphens and adding the URL stem to construct a complete URL matching the “source URL” field in OutWit’s export. Now I could match every paragraph to its parent XML document in my database:
The “filename” field of each entry provided a complete “URL-suffix” for each text, requiring a simple substitution of slashes for hyphens and adding the URL stem to construct a complete URL matching the “source URL” field in OutWit’s export. Now I could match every paragraph to its parent XML document in my database:
Finally, chronology is of obvious interest to anyone reading in this corpus with a historical perspective. The vast majority of the texts are associated with a single year. The publishing records of some of the texts in this collection are spotty, leading the editors to assign ranges (“[between 1902 and 1912]”), more specific dates (“Vol. 15, (July 1909)”), and approximations in multiple ways (“[1837?]”, “186-?”).
In response, I created fields to account for both range and approximate dates: Year_begin, Year_end, and Circa. These departures were rare enough, and the variation in their representation wide enough, that I manually addressed them.