Getting the Data: False Start

To my great excitement, DocSouth provides a handy zip package including every narrative in the collection, in both plain text (.txt) and marked-up TEI versions (.xml), as well as metadata for the collection in an excel file.

In order for topic modeling to produce more meaningful results, I would need a larger body of texts. While 300 narratives might be a mean feat for a manual close-reader, topic modeling software like MALLET shine when fed thousands of texts, rather than hundreds.

In order to create a larger corpus to feed to MALLET, I wanted to break up the narratives into smaller chunks, while maintaining as much narrative integrity as possible within each chunk. I hoped to avoid breaking up each text according to an arbitrary marker of length, such as the page or X number of characters, which might produce chunks that begin or end in the middle of a sentence or word. The chapter seemed a natural narrative unit to explore first.

I began by exploring the Equiano XML file to take an inventory of TEI tags used in the file, paying attention to patterns of tags around significant features of the text, such as titles, chapter breaks, and arguments (summaries of the action at the beginning of every chapter).

Using regular expressions, I was able to isolate the chapters and arguments of Equiano’s text.

But then I found a problem with my tidy solution: The regular expression I had used to find chapter headings in Equiano did not work when applied to another lengthy text I knew had chapters, one of the narratives by William Wells Brown. The two texts used different code to mark chapters:

  • Equiano: <div2 type="chapter" org="uniform" sample="complete" part="N">
  • Brown: <div2 type="chapter">

The problem was broader than this one discrepancy. The texts included in the DocSouth collection appear to have been encoded differently, depending on the funding source. The different funders can be seen in the URLs of the documents:

After a comparison of prominent features like the title pages, chapter breaks, and headings, I determined that many of the tags for major structural elements were not consistent across all of the texts.

Moreover, many of the pieces did not have chapters at all! Perhaps this should have been obvious from the start, but the great diversity of the narratives extends to the form and length of the documents. While Equiano’s narrative is novel-length and divided into chapters, it proved to be exceptional in its length and structure. Many of the pieces were shorter, published as broadsides or pamphlets, and had no chapters or formal sections to speak of. As a result, my initial idea for breaking up the works by chapter would not work.

by Matthew McClellan