Building an Open Data Community for Biblical Studies

This blog is devoted to a community of data - freely licensed data for studying the Bible in its original languages. Currently, this data is not well understood outside of a small community who are quite familiar with biblical languages and digital technologies, but it is immensely valuable to anyone who wants to study these texts in the original languages. Unfortunately, some of it is not even available with a user interface, and the systems that work well with one dataset may not be able to use others. To harvest this value, we need to find better ways to present the data to end-users, teach each other how to better leverage the information produced by other groups in the community, and learn better ways to produce and curate our datasets. In this blog, I will explain things I learn along the way.

I have always been fascinated by data modeling and query languages, and have helped develop or design object databases, XML databases, and mixed-model databases that support both XML and JSON. I am also fascinated by human languages and literature, and briefly majored in medieval German literature. But the one book I read most often is the New Testament, and I prefer to read it in Greek. I joined the B-Greek forum (then a mailing list), I found a community of experts who could help me learn, especially Carl Conrad, who poured lots of time into helping beginners. In 1998, when Dave Marotta wanted to step down as administrator of B-Greek, I took over.

In 2001, I attended a social event for members of the B-Greek forum at the Society for Biblical Literature, and Patrick Durusau suggested I do an impromptu presentation on XQuery, an XML query language that I helped invent. I said I couldn’t do that for this audience without biblical texts, so he introduced me to Matt O’Donnell, who handed me a 5 1/4 inch floppy with an analysis of Galatians from the opentext.org project. An hour later, I was giving a presentation that did meaningful queries, discovering various aspects of the text. Ever since then, I have been doing queries on biblical texts, but much of the data I wanted was not freely available due to copyright restrictions, and that limited what I could do with it.

In 2013, Randall Tan of the Asia Bible Society (now known as the Global Bible Initiative) was able to provide me with some of the treebank data he had developed together with Andi Wu. When I asked, they decided they were willing to let Randall and me publish this data on Github. We launched biblicalhumanities.org to publish this data, then started working together to encourage others to publish high-value datasets under open licenses. We now publish treebanks in two different formats, for both the SBLGNT and Nestle 1904, using James Tauber’s morphology. But we quickly realized that our most important work is to foster the work of others who create open resources, helping them design their data in ways that allow it to be used with other datasets, and building a community of open data. This is how we describe our work:

biblicalhumanities.org is a community of computer scientists, Bible scholars, and digital humanists collaborating to create open digital resources for biblical studies. Our emphasis is on open resources for biblical languages, such as morphologically tagged texts, treebanks, and lexicons. We hope that these resources will be used widely for teaching, research, and resources used to read and study the Bible.

We are working to grow a community, not to own it or control it. We try to track resources that exist, create resources that are missing, and help people coordinate with others who are working on similar things to maximize interoperability and minimize duplication of effort. See our dashboard for an overview of these resources. We are now beginning to create standards to maximize interoperability among resources.

As part of that work, we have published a set of guidelines and a dashboard to track resources that are released.

Since then, the Global Education and Research Technology session of SBL and biblicalhumanities.org have worked together to provide opportunities for data providers to present on their datasets, with an annual get-together for data providers at the SBL Annual Meetings.

During this time, several groups have released datasets that dramatically change what we can do. The Eep Talstra Centre for Bible and Computing released their ETCBC Hebrew Text Database under a free license, SIL International released Levinsohn’s Discourse Analysis of the Greek New Testament, a variety of high quality lexicons have been completed for both Greek and Hebrew, Perseus released Cramer’s Catenae, Swete’s Septuagint, and Migne’s Patrologia Graeca, Sefaria released a massive collection of Jewish texts, and Bruce Robertson dramatically improved the quality of ancient Greek OCR and scanned a massive collection of Greek texts, including several important commentaries. Codex Sinaitics released their transcriptions as XML, and Alan Bunning released transcriptions of all major Greek papyrii before AD 400. Until recently, we were dismayed by the shortage of high quality, freely licensed resources. Now we are swimming in them.