The Date and Place Column

3.2.PNG

Intermediate step in the formatting of the Date and Place column

The Date and Place column actually combines a number of elements into one line. The day is arguably a candidate for its own category, but I have chosen to pull out the year, the month and day, and the listed place of the document’s origin as the constituent elements to separate.  The way I did this was to use regular expressions in order to identify these three elements in each line—taking into account the considerable variety from line to line—and to tag them with unique brackets: <y></y> for the year; <m></m> for the month and day; and <P></P> for the place, as seen below.

I could then separate these tags by tabs and import them into a spreadsheet using tab-separated columns.

This was more complicated than it sounds because of the inconsistency of the data in each entry. Some entries, for instance, had a range of years in which it could have been issued. Another common complication was the use of brackets to indicate ambiguity, whether for the year, the month or day, the entire date, the place, or any combination thereof.  Because it was difficult to find consistent patterns across 4,959 entries, there was a lot of creative trial-and-error to use regular expressions effectively.

3.3.PNG

First column contains unformatted date/place content, followed by three new columns exported from the word-processing file

In sum, this is how I did it:

  1. To start, I imported the data as plaintext into a word processor with fairly robust support for regular expressions (LibreOffice)
  2. I placed the opening tag <y> at the beginning of each line, assuming that each entry began with the year.
  3. Then I used the closing tag </y> after each string of four digits followed by a space and a string of letters. There were complications, especially because of the ways that the sources’ editors marked ambiguity: ranges (1340-1345), brackets [1227 Dezember – 1229 April], and the / character (e.g. 1427/28), and so on. Plus, there are occasionally words modifying the year or month—ca. for circa, wohl (meaning “probably”), and so on. I found ways to deal with these kinks, though there are doubtless still imperfections in the formatted data.
  4. I used the tag <m> before the first letter-character following the </y> tag.
  5. The closing tag </m> came next, using a regular expression relying on the consistent pattern in the German date system of placing a period (.) after the day. A main complication here (and throughout this entire phase) was that many entries did not have days, and often included the “o.M.” or “o.T.” meaning ohne Monat and ohne Tag—“without month” or “day.” I adjusted for this by first eliminating the period in between the letters of the abbreviation (i.e. o.M. changed to oM.), then writing a regular expression to place the </m> tag after the last period preceding a string of letters and spaces without digits (since places do not have digits).
  6. The remaining steps are simple. <P> and </P> tags went after the </m> tag and then at the end of each line.
  7. Once every line had three sets of tags, I placed a tab between each separated element (i.e. in between every >< combination).
  8. Finally, I deleted all the tags and imported the finished result to a spreadsheet, which automatically created three columns corresponding to the tabs.
3.4.JPG

The unclosed brackets problem

It is worth emphasizing that these steps summarize the overall arc of the process, but leave out much of the detail as well as the many intermediate steps along the way to adjust for inconsistencies in the data. There was also a good deal of clean-up to do using regular expressions after importing to a spreadsheet. Many of the date entries, for instance, were left with unclosed brackets (see left).

To deal with this particular issue, I used the following regular expressions to search:

       For the Year column: (\[[ a-z0-9]+)(?!\])(\n)

       For the Month/Day column:  ^(?!\])([a-zA-Z0-9\.\-\?äöüß)+(\])(\])

Then placed a bracket at the open end, either at the beginning of the first column or the end of the second.

by Patrick Meehan