Dataset and Curation

Since the goal of this project is to better understand how medieval writers thought about the astral sciences, data for this project is drawn from medieval texts in the original Latin. To date, I have worked closely with 16 texts from the twelfth through early fourteenth centuries, ranging in length from about 5 pages to 90 pages:

Adelard of Bath's De Eodem et Diverso (c. 1120)
Hugh of Saint Victor's translations of Epitome Dindimi In Philosophiam and Practica Geometriae (c. 1120-1140)
Hugh of Santalla's translation of Fragmentum Pseudo Aristotelis (c. 1140)
Alexander Neckham's Suppletio Defectuum, book II (c. 1200)
Liber de stellis beibeniis, attributed to Hermes Trismegestus (1218)
John of Sacrobosco's Tractatus De Sphaera (c. 1230)
Michael Scot's translation of al-Bitruji's De Motibus Celorum (c. 1230)
Robert Grosseteste's De Cometis, De Motu Supercelestium, and De Sphaera (c. 1200-1235)
Raymond Llull's Liber Facilis Scientiae and Questiones Factae Supra Librum Facilis Scientiae (1311)

While this dataset could, in theory, grow to include thousands of entries, this initial proof-of-concept project makes use of select Latin manuscripts dating from the twelfth through fourteenth centuries, drawn from the corpus of texts recognized by historians of medieval science, and available at Harvard libraries and online. It excludes works of poetry and literature as well as works in the vernacular, although such sources may eventually be incorporated if the research was to expand its scope beyond discussions among the scholarly elite.

Why these texts?

These authors' works represent the spectrum of information available to literate Europeans in the centuries after knowledge of the astral sciences, which had been developed and refined by Arabic scholars, was translated into Latin. Some of the works I consider are translations themselves, such as those attributed to Hugh of Saint Victor, Hugh of Santalla, and Michael Scot. Some authors of original works (such as Adelard of Bath) present the astral sciences favorably, and others (such as Raymond Llull) sceptically, for various reasons. Having collected data from across this spectrum from which to draw preliminary conclusions, I hope to add works until my dataset includes at least 200 discrete texts. However, due to the tedious process of obtaining "clean" texts described below, reaching that goal will take at least a year.

Medieval conversations about the astral sciences extended beyond their study per se, and played an important role in determining the parameters of what was considered "useful" or "good" knowledge, especially in thirteenth, fourteenth, and fifteenth century Europe (after the so-called Twelfth Century Renaissance). To represent this rich intellectual context, I will also be analyzing to what extent the astral sciences were included in two other medieval topics of conversation. The first of these are writings on "cosmology," which often combines astronomy and philosophy. Pierre d'Ailly's Ymago Mundi (1410), although more properly a work of cosmography, will be included in my preliminary analyses since it incorporates d'Ailly's views on astrology. The second of the two medieval topics I consider has to do with the organization of "the sciences," specifically as they were taught according to the seven liberal arts. John of Salisbury's Metalogicaon (twelfth century) will be the first of these works, although I hope to eventually have a dozen works representing influential thinkers' opinions.

Due to the admittedly partial selection of sources, this project does not aim to provide a comprehensive portrait of ideas about the astral sciences, or even a comprehensive sketch based on known writings. Rather, it attempts to (a) build an accessible, meaningful, and useful relational database drawn from key late medieval Latin texts, (b) identify relevant passages in those texts and rendering them useable through an OCR program, and (c) determine the “tone” of a text by means of textual analytics and quantitative corpus linguistics, before (d) display this information in an intuitive format.

A page of Carmody's edited version of Scot's translation of al-Bitruji's De Motibus Celorum (On the Motions of the Heavens). I took this image with my smartphone's camera and then ran it through an OCR program to turn it into searchable text. While it would be ideal to have a flatter page when taking a photo of it, most OCR programs compensate for page curvature... and when your text is 82 pages long (as this one was), sometimes it's okay to sacrifice best practices for the sake of getting through tediously repetitive tasks!

De Motibus Celorum Sigla and Abbreviations.jpg

This page from Carmody's 1952 critical edition shows the various manuscripts that he considered when producing the "master" version used in my project. He comments on aberrations or differences among the manuscripts in his footnotes. In using printed materials, I have been able to defer to editors' curatorial decisions about which sections of the manuscript are considered authentic or most authoritative in the chain of transmission.

Gathering Data

For this stage of the project, my data has been drawn almost exclusively from printed text. Gathering large amounts of data quickly has been made fairly easy by first locating printed sources and taking a photo of each relevant page, before rendering that information as useable data.

First, I locate sources using either my existing knowledge (since no survey of medieval history of science texts would be complete without Sacrobosco's De Sphaera or d'Ailly's Imago Mundi!) or a database (such as Medieval Sources Bibliography or a research library's catalog). Next, I determine the relevant sections of those texts by skimming the Latin, and take pictures of relevant sections (for which my smartphone's camera has sufficed).

Fortunately, my use of printed sources implicitly addresses a curatorial concern which will reapper when I begin drawing data directly from the archives: which version of a manuscript is considered authoritative? Most printed sources include a list of extant manuscripts in their introduction to the printed text, as shown in the second image on the right. The editors of those critical editions note discrepancies among manuscript versions in the footnotes, and make executive decisions about which manuscript (or family of manuscript)s to privilege when producing their "master" version. Were I to analyze a manuscript held in one of Harvard's libraries, for example, I would have to justify my use of that manuscript over other extant versions, but relying on printed critical editions relieves me of that burden... for now!

One of my current texts, Sacrobosco's Tractatus De Sphaera, was most readily available online, so I simply copied its text directly from a webpage. (Although the site claimed to host a downloadable PDF, the site was not well maintained and the link was broken, so I had to highlight text on the webpage directly and copy-paste it into a word processor.) Unfortunately, this process required me to manually alter words which had been visually incorporated into the text.

Another text, d'Ailly's Imago Mundi, was available as a bilingual digitized book, so I toggled to each Latin page and captured screenshot images, which left me with the same forms of data as most of my printed-and-photographed material.

Locating these printed texts requires a surprising amount of time, so I anticipate that the collection and database-building work that I am doing will serve as the cornerstone for further research.

Correcting enclitic "mm" to "rum" in Hugh of Santalla's Fragmentum Pseudo Aristotelis using regexr.com. Here, I used code (shown in bright blue) to identify a given set of characters ("-mm") and replace occurances of those characters with "-rum" (circled in yellow) so that the text reads properly. In a case such as this with only three examples, this procedure might not seem necessary, but if the document is long, these codes can preserve a lot of time (and sanity)!

Here, I used regexr.com to find irregular OCR results in order to correct them on a case-by-case basis. I first wrote a code instructing the regular expressions software to locate irregular words with a capital letter in the middle (left). Then I compared words it found to the original JPEG image (right) in order to verify what the correct replacement should be. So, in Hugh of Santalla's Fragmentum Pseudo Aristotelis shown here, "muUer" became "mulier."

Making the Text Useful

After identifying and collecting printed material, I run those images through an OCR program and rectify any errors using regular expressions software.

Unfortunately, having images of printed text doesn't allow me to adequately analyze the text in ways that are useful to this project. Fortunately, the technology powering OCR (Optical Character Recognition) software has made significant improvements in recent years, so that I can run my images through a program which turns my JPEG image into searchable plain text within minutes.

However, OCR technology is far from perfect, and many of the characters that the program thinks it recognizes are wrong. One way to rectify OCR programs' errors relatively easily is to make use of Regular Expressions programs, freely available for downloading or directly on the web, which I have used in two ways for this project. Regular expressions software responds to code which you input in order to identify a given set of characters, and to replace them with another specified set of characters (or eliminate them) in a "batch" process, rather than correcting each one-by-one.

First, I used regular expressions programs the way they were intended-- that is, to identify a given set of characters and replace them with another. The first image on the left shows how I attempted to change words that (incorrectly) end in "-mm" to words that (correctly) end in "-rum," so that "fixamm" becomes "fixarum." The most significant diffculty in this step was creating a code that instructed the program to find not just any "mm" (since, for example, the "summam" in "Felicitatis ergo summam..." includes an "mm" that we want to preserve), but only "-mm" at the end of a word. Once I wrote that code, I was able to "batch" replace any enclitic "-mm" with the correct "-rum." In a case such as this with only three examples, this procedure might not seem necessary, but if the document is over 80 pages (as this one was), these codes can streamline a significant portion of tedious work!

Second, I used regular expressions programs NOT as they were intended-- that is, for only their identifying capabilities, and then changed errors on a case-by-case basis rather than "batch" correcting them. For instance, knowing that it is extremely unlikely to have a Latin word with capitalized letters in the middle, I wrote a code telling my regular expressions program to identify words like "aUa" and "muUer." In both these cases, the "-U-" should be replaced by "-li-" but this is not a sufficiently regular error to warrant me "batch" correcting every uppercase "-U-" to an "-li-." So, as shown in the second image on the left, I first used the regular expressions software to locate an irregular word, and then compared it to the original myself in order to verify what the correct replacement should be. So, "muUer" became "mulier."

This side-by-side comparison of Pierre d'Ailly's Imago Mundi, the JPEG image on the left and the OCR text version on the right, shows the difficulty of rendering Latin abbreviations, especially sigla, into text-searchable formats. I am currently working on a series of codes to assist in streamlining these corrections.

Examples of Latin sigla, which illustrate the difficulties faced when trying to OCR a medieval text which preserves sigla and ligatures.

Specific Challenges with Latin

Some modern printed texts preserve medieval Latin character abbreviations such as ligatures and sigla. Ligatures are contractions of two characters into a single one, as when "ae" is combined into "æ"; sigla are scribal abbreviations similar to shorthand markings (see the second figure on the right). Both developed for written Latin in the middle ages as a means to preserve space on a writing surface, and while often rendered in longhand in modern transliterations (so that "ae" is written as "ae"), this practice is not always followed.

As illustrated in the first image on the right, my OCR program was not able to represent Latin ligatures and sigla in any meaningful way. Currently, I am reading about coding language protocols for dealing with Latin abbreviations, and hope to write some regular expressions code to apply to particularly problematic texts in the coming weeks.

Current Developments

Texts currently in production include Pierre d'Ailly's Ymago Mundi (c. 1410); some treatises by William of Conches (c. 1090 - c. 1160); Abraham Ibn Ezra's Reshit Hokma (c.1150); Ripoll Manuscript 255; John of Salisbury's Metalogicon (c. 1160); and Raymond Llull's lengthy Testamentum on alchemy (1332).

by Allyssa Metzger