Preparing Data for Mallet

Since Outwit simply mined everything enclosed in paragraph tags, there was bound to be some junk data that did not belong with the text I wanted to analyze. At the beginning of each text, several “paragraphs” of header information appeared, including the copyright, descriptions of the transcription and encoding process, and more. Although every header did not consist of exactly the same content, every header did have the same first and last lines. Removing these unwanted lines was straightforward using regular expressions.

Common "junk" lines removed

After getting rid of these common lines, some lines of junk data remained: full text of the first entry in my “clean” spreadsheet reads “[Title Page Image]”. Lines like this appeared to be few and far between, and far outnumbered by meaningful text to influence topic models, so I decided against using regular expressions to remove the rest of these junk entries.

Now the text was ready for topic modeling!

by Matthew McClellan