Phase 2: Formatting the Data

3.1.PNG

Unformatted data exported from OutWit Hub

 

After exporting the data from OutWit Hub to a spreadsheet, it looked like this:

1.3.JPG

Formatted data, ready for entry into a database

 

After formatting this data using regular expressions, it now looks like this:

 

 

 

I will describe how I did this in two main stages:

  1. Formatting the Date and Place column
  2. Formatting the Contents column

I will summarize the steps I took in the first stage, then walk through the second stage in more detail.

But first, the overarching plan. As the first picture shows, the unformatted data is separated into three main columns corresponding to the categories used in designing the scraper: Source, Date and Place, Contents. The formatted data is organized into seven columns: Source, Year, Month and Day, Place, Creator, Recipient, Contents.  It was a great challenge to format nearly 5,000 separate, varying entries in such a way as to extract more information when it was included (that is, some entries have information for all seven columns; others have less).  But, it is my hope that this higher number of categories will enable more ways to eventually manipulate and use the data. 

by Patrick Meehan