Mallet: Proof Reader and Topic Modeler

10 post scrape semicolon topic.png

Topic full of semicolons

11 post scrape mysterious semicolons.png

Sample culprit entries

My first attempt at topic modeling produced one topic entirely composed of words joined by semicolons:

For some reason, my OutWit scraper had picked up phantom semicolons in certain paragraphs. After a brief replacement with regular expressions, I was back in business.

After experimenting with settings in MALLET, I found the following code produced results that seemed most useful for my initial analysis:

bin\mallet train-topics --input 11_28.mallet --num-threads 2 --num-topics 40 --optimize-interval 10 --output-model 11_28.model --output-doc-topics 11_28_composition.txt --output-topic-keys 11_28_keys.txt --output-state 11_28.gz

Perhaps the most important command above is “optimize-interval.” If that command is omitted, MALLET assumes that each of the topics occurs at a similar frequency across the corpus. With this command, MALLET does not make this assumption, and is free to construct topic of differing frequencies. This turned out to have enormous implications for my dataset!

When MALLET constructed topics with equal frequencies, none of my topic models produced one paragraph that was at least 70 percent associated with a topic, and never produced more than 40 paragraphs that were at least 50 percent associated with a topic.

When MALLET was freed to find topics of different frequencies, it identified 28,500 paragraphs that were at least 50 percent associated with one topic, and 254 paragraphs at least 90 percent associated with a single topic.

by Matthew McClellan