Corpus

 

KITAB Corpus

The Arabic Textual Tradition

The medieval Arabic textual tradition is one of the most prolific in human history. Works were produced across a territory stretching from modern Spain to Central Asia, and their subject matter covered Islam but also much more, from rulers, their courts, and administration to literature, biographies, philosophy, medicine, mathematics, geography, travel, and many other topics.

Some highlights of our collection:

  • Approximately 4,200 unique Arabic texts from beginning of Islam up to the early 20th century – covering all genres
  • Inclusion of major digital databases from the Middle East – plus additional texts contributed by partner projects
  • URIs (uniform resource identifier) for all texts – required for corpus building and data analytics
  • Core metadata (author’s name, death date, book title, publisher and year of publication) – with reviews of this data by scholars
  • Pre-processing of texts (including tagging) – to facilitate computational analysis

Recently, KITAB has been working on Arabic-script Optical Character Recognition (OCR) through its partnership with the Open Islamicate Texts Initiative (OpenITI). You can see the results of our work here. Most importantly, our OCR solution, designed by Benjamin Kiessling (Leipzig University), has achieved accuracy rates for classical Arabic texts in the high nineties. We are now in the process of designing a corpus builder that will allow individuals and projects to OCR and post-correct  their own scanned books and to convert them into machine-readable texts.

The system works by training the computer to recognise different typefaces. It involves a substantial investment in a shared database of training data and technical infrastructure that projects can access and feature in their own applications. A test version of the corpus builder is scheduled for completion in late November 2017.

The work is led by the OpenITI team (Sarah Bowen Savant, Maxim Romanov, and Matthew Miller) and is currently being funded by the Roshan Institute for Persian Studies at the University of Maryland, the Alexander von Humboldt chair in Digital Humanities at Leipzig University, the Institute for the Study of Muslim Civilisations at the Aga Khan University, and the Harvard University Law School (through its SHARIASource project). Additional funding is being sought, and collaboration is welcomed from other projects.

Share this: