Visualizing a massive dataset: What Good are Sparklines?

In general, sparklines are useful for presenting condensed portraits of change over time. They typically have no labeled axes and variable scales to those axes, focusing on the pattern of variation rather than absolute frequency or magnitude.

With nearly 100,000 paragraphs in this dataset, there may be too many datapoints for sparklines to serve their intended purpose. It is difficult to make generalizations about the whole corpus using the data provided by these graphs alone. Part of this difficulty is a result of the data density: high data resolution and low visual resolution make it difficult to decode. As a result, I have broken at least one rule of sparklines, making mine quite a bit larger than normal, for readability.

I hope these tools of distant reading help us gain insights about the whole corpus, but I hesitate to draw conclusions when these charts do not provide obvious trends. However, these charts are good at identifying outliers and inflection points. Based on these sparklines, can I infer whether the genre became more or less violent over time? Or more or less Christian? These questions are not really answerable with my topic modeling output, nor would those answers be particularly interesting to a historian, literary critic, or other readers. Of course, works with fewer paragraphs are more prone to having outlier averages, and subsequent close reading of the source materials would challenge any hasty generalizations based on such small samples.

These outliers and inflection points may provide rich grounds for further analysis via close reading. For some of these I have begun assembling visualizations that are finer grained than sparklines, but coarser than a direct reading of the text itself. Let’s call it medium reading for now.

I have attempted to address the potential problems of reading cluttered static sparklines representing the whole corpus by presenting dynamic sparklines for select topics that can be filtered by date ranges. I have also visualized a small subset of the dataset – the first eight narratives – with data at the paragraph-level.

by Matthew McClellan