Interlude: Massaging the Metadata

Before rushing on ahead to throw my text into MALLET and get some topic models – which was very tempting! – I wanted to make sure I had some way to connect the data that Outwit had mined with the parent texts that each paragraph had been extracted from. Additionally, preserving the order of the paragraphs would open more potential paths of analysis (perhaps searching for stability and dramatic change in topics across neighboring paragraphs?). Luckily, Outwit’s scraper had assigned a unique ID and the source URL to each row of output; I just needed to transform DocSouth’s metadata to provide a field that could join Outwit’s source URL.

metadata original.png

A sample from the metadata provided by DocSouth

Though OutWit had extracted text from the XML pages, and thus exported URLs pointing to the XML pages, the metadata provided by DocSouth only included full URLs for the plain text and landing pages for each narrative:

The “filename” field of each entry provided a complete “URL-suffix” for each text, requiring a simple substitution of slashes for hyphens and adding the URL stem to construct a complete URL matching the “source URL” field in OutWit’s export. Now I could match every paragraph to its parent XML document in my database:

TOC transform xml 2.png

Transform to XML url

TOC paragraph relational.png

Relation in FMP database

The “filename” field of each entry provided a complete “URL-suffix” for each text, requiring a simple substitution of slashes for hyphens and adding the URL stem to construct a complete URL matching the “source URL” field in OutWit’s export. Now I could match every paragraph to its parent XML document in my database:

02 toc date format.png

Date format in original metadata

Finally, chronology is of obvious interest to anyone reading in this corpus with a historical perspective. The vast majority of the texts are associated with a single year. The publishing records of some of the texts in this collection are spotty, leading the editors to assign ranges (“[between 1902 and 1912]”), more specific dates (“Vol. 15, (July 1909)”), and approximations in multiple ways (“[1837?]”, “186-?”).

In response, I created fields to account for both range and approximate dates: Year_begin, Year_end, and Circa. These departures were rare enough, and the variation in their representation wide enough, that I manually addressed them.

by Matthew McClellan