The Contents column

Each document in the Preußisches Urkundenbuch is accompanied by a brief description of its contents written by the editors. I realized that although these descriptions are too summary and contrived to be useful for close analysis, they include vital elements of information that could be potential fields in a database. Take the following entry (PrUB, JS-FS 10) dated January 6, 1383:

Großfürst Jagiello [Jogaila] von Litauen an Hochmeister Konrad Zöllner von Rotenstein: erläutert, daß er die Fürsten Witold [Vytautas] und Tokwyl von Litauen, die sich beim Orden aufhalten, nicht wieder ins Land lassen kann, weil er ihnen kein Vertrauen entgegenbringt, erklärt sich bereit, mit den Herzögen von Masowien bis Ostern einen Waffenstillstand einzugehen, wenn diese die arrestierten Waren und Kaufleute aus Wilna [Vilnius] freigeben, hat nichts gegen die Zahlungen für die Burg Wizna, wirft dem Orden aber vor, die Gegner Litauens zu unterstützen. Die Samaiten haben sich Jagiello und seinem Bruder Skirgal unterworfen, so daß der Hochmeister ihre Vertreter nicht mehr zu sich rufen braucht.

Just from the first line of text, it is obvious that this is a letter from the Grand Prince Jagiello of Lithuana written to the Grandmaster of the Teutonic Order, Konrad Zöllner von Rotenstein. Reading on, the description reveals some of the details of this particular diplomatic exchange, but for the purposes of this database, the first line alone provides two significant details: the name of the agent behind the document’s creation (Grand Prince Jagiello) and its destined audience or recipient (Hochmeister Konrad Zöllner von Rotenstein). In fact, the vast majority of the 4,959 entries include a “creator” (the first name or list of names in the description). Reference to “audience/recipients” is not common until entries for the late fourteenth century, but from then on, most entries do include this.

But considering the vast number of names and combinations of information in the contents, how is it possible to systematically this information from thousands of entries? At first I thought I might be able to make a list of all the possible titles (Pope, Bishop, King, Duke, Duchess, Grandmaster, etc.), but this quickly proved futile. Instead, the key is in two essential components of the first line of text: the word an designating the recipient, and the verb describing the intent of the sender or the document (here, it is erläutert). So, I separated out these two fields from the rest of the contents with two powerful regular expressions.

First, I played off the syntactical rule of German whereby the subject of a sentence is normally its first component, and the verb is its second; everything else comes after. By and large, the descriptions fit this rule. Moreover, since all relevant verbs are in the third person, they will end either in “t” or “en” (with some few exceptions). I wrote a regular expression to identify all words ending in “t” and “en”, with the condition that they be at least four letters long in order to avoid the common article den and the preposition mit:

^.+(?=(\b[a-zäöüß]{3,}(t|en)\b))

The “names” portion of contents descriptions

I placed an identifying tag (the symbol < ) before each verb. This made it easy to write an expression to insert a tab before the first < in every line, thus separating the relevant names from the rest of the description. In plaintext, the “name” portion looked like this:

I cleaned up the results, removing extraneous symbols and numbers, for instance. I also placed an easily recallable and searchable symbol ( % in this case) in the anomalies of the bunch—an example of this is lines that started with lower-case letters (thus, not a name).

The next step was to separate the “creators” from the “recipients” by relying on the convention of using the word an after the first name or set of names and before the second. The following regular expression identified each an:

^(.+)( )(an )

and the following one replaced the an with the more easily identifiable <AN >, while also inserting a tab to separate “creator” and “recipient” into two categories:

$1\t<AN >

Now the single Contents column has been divided into three. The first column (Creator) did not require too much clean-up right away. The “AN-column” required extensive clean-up, however, and I will describe two major tasks involved in this:

Dealing with Colons

Many contents descriptions (especially those for later entries) include colons which separate the personal information from the actual contents, such as in the above example of the Grand Prince and the Grandmaster. Sometimes, words following the colon (and thus not part of the “personal” information) ended up in the AN-column. To correct for this, I used the following regular expression to identify all colons:

( )?(:)( )?

Then I inserted a tab before it in order to separate it from the recipient-category. This was the start of a temporary column of entries that would ultimately be joined back up with the Description column.

Dealing with “False Ends” and Missing Information

The second main task in “cleaning up” the AN-column stemmed from the imperfection of the regular expression used to identity verbs, namely:

^.+(?=(\b[a-zäöüß]{3,}(t|en)\b))

Perhaps the biggest problem with this expression is that it identified all words at least five letters in length ending in –en. In addition to certain verb conjugations, two other huge syntactic categories include words ending in –en: many plural nouns, and many adjectival declensions. Thus, the regular expression cut off some entries too early, before an actual verb. I call these the “False Ends.”

I doubt that I have yet fully corrected for this, but I have done a significant amount of clean-up by searching for a group of “triggers” often followed by a word ending in –en. These words include: die, der, des, seines, seinen, einem, dem. Two other common “false ends” were the word und and the very common word betreffend, meaning “concerning”, and thus a sort of verbal parallel to the function of the colon, as described above.

I searched for these terms individually, added the symbols >> to help identify them, and then went through case by case to pull out information that had been dragged into the Description column. Time-consuming, but effective.

by Patrick Meehan