Visualizing a massive dataset: Technical Obstacles

12_08 broke excel.png

The limits of Excel

The size of this dataset poses a major practical obstacle to visualization and analysis. I am using Tableau Public to create my visualizations, which is a relatively lightweight, user-friendly program and features robust tools for hosting, sharing, and embedding dynamic charts. It is not, like FileMakerPro, a robust database management program, and struggles to process the ~1.4 GB CSV containing the information pertaining to each of the 40 topics in each of the 100,000 paragraphs of the corpus. Attempts to work with the whole corpus reliably froze Tableau. I had braced myself for this possibility after excel refused to open the CSV I created from MALLET's output (the spreadsheet had about 3.8 million rows, and Excel's limit is just over 1 million rows).

In the short run, I have chosen to respond to this challenge by limiting my visualization and analysis to subsets. These subsets are either limited by topic breadth (few topics, whole timespan) or chronology (few years/samples, many or all topics).

My immediate response to this is disappointment, as backing away from the massive, detailed dataset seems to betray the primary purpose of computer-aided distant reading. In the long run, I hope my further experimentations in visualization I develop with these subsets will lead me to more interesting methods of analyzing the whole corpus.

by Matthew McClellan