About the corpus

The KITAB corpus is the Arabic subcorpus of the Open Islamicate Texts Initiative (OpenITI), which was founded in 2016 by Maxim Romanov, Sarah Savant and Matt Miller as a multi-institutional effort to create not only a scientifically curated and open corpus of Islamicate (mainly Arabic and Persian) texts compliant with international data standards, accompanied by scholarly metadata and representative of the diversity of historic traditions but also a full digital text production pipeline (for information on OpenITI, see here).

To date, the KITAB corpus comprises more than 10,200 text files plus their corresponding metadata files. Most of them contain premodern works in line with KITAB’s research focus on the premodern Arabic written tradition from its origins in the second century AH/eighth century CE up to roughly the ninth century AH/fifteenth century CE.

The works come from a variety of sources that fall within three major categories (for a comprehensive list of our sources, see here): (1) digital online libraries such as al-Maktaba al-Shamila and Shia Online, (2) research projects such as the Graeco-Arabic Studies Corpus, the Digital Averroes Research Environment and Ptolemaeus Arabus et Latinus and (3) KITAB’s own Optical Character Recognition (OCR)-pipeline based on Kraken (see here and here).

Further text files are contributed by individual researchers and KITAB team members, most notably in the form of manual transcriptions of manuscripts.

While the bulk of our texts currently come from online libraries and are reproductions of modern print editions, the relative weight of the individual categories of sources is likely to change and, owing to improving OCR tools, more texts will pour into the KITAB corpus through our own OCR pipeline, including OCR’d manuscript sources.

The texts in our corpus stretch across a wide range of subject matter and genres but reflect the focal points of our major source collections. The OCR pipeline that KITAB is building will allow us to address lacunae in our corpus at a broader scale and to include comparatively less represented, or even unrepresented, categories of texts and more marginal subjects. Specialised partner research projects that will contribute their texts to the OpenITI/KITAB corpus as well as individual contributions from the field will further diversify our corpus.