First Open Access Release of Our Arabic Corpus

Scholars working in Arabic can now download the entire corpus used by the KITAB team through Zenodo, an Open Science platform that supports open access.

Users will find in the corpus here: https://doi.org/10.5281/zenodo.3082464

Number of authors: 1,859.

Number of titles: 4,288, totalling 755,689,541 words.

Multiple versions of the same titles: 7,144, totalling 1,520,667,360 words.

The texts sit within the OpenITI corpus (the KITAB project is the major contributor of Arabic texts to the OpenITI). All major versions of the corpus, as well as analytical data sets generated from the corpus with different methods, will be published on Zenodo in the future as part our commitment to open access. For the release notes, including the structure of the data, see here.

The goal of the OpenITI is to build a machine-actionable corpus of premodern texts in Islamicate languages to encourage computational analysis of the Islamicate written tradition. Most of the Arabic texts have been collected from open-access online collections of premodern and modern Arabic texts such as http://shamela.ws/ and http://shiaonlinelibrary.com/ .

If you are using this version of the corpus, please cite it in the following manner:

Maxim Romanov and Masoumeh Seydi, OpenITI: A Machine-Readable Corpus of Islamicate Texts (Version 2019.1.1) [data set] (June 2019), Zenodo, doi:10.5281/zenodo.3082464.

DOI

Share on

Twitter Facebook LinkedIn

Introducing Text Evaluation using LabelStudio

April 9, 2026 4 minute read

Data annotation and evaluation is a critical part of the KITAB-Transform workflow. Without user-friendly but highly customisable software, it will be impossi...

Leveraging the OpenITI Corpus for Text Identification: Two Examples from Geniza Documents

March 11, 2026 12 minute read

Among many other things, the steadily growing OpenITI corpus of machine-actionable texts constitutes a useful tool for identifying hitherto unidentified text...

First Open Access Release of Our Arabic Corpus

Sarah Bowen Savant

Share on

You may also enjoy

Introducing Text Evaluation using LabelStudio

Leveraging the OpenITI Corpus for Text Identification: Two Examples from Geniza Documents

Research Workshop on Miskawayh

Corpus Building Workshop