Citation: please use the citation information (available to export in various formats) on the publication page.
A new version (2025.1.9) of the OpenITI corpus is now available at Zenodo.
This release is the first to contain a Persian subcorpus as well as a new subcorpus (OpenITI-mss) of transcriptions of various hand-written materials: codices, letters, administrative documents, etc. The first batch of documentary materials comes from the Invisible East corpus (https://www.invisible-east.org/); these documents were written in Arabic, New Persian, Middle Persian, Bactrian, and Judaeo-Persian (Persian written in Hebrew script). First manuscript transcriptions were provided by Lorenz Nigst and Stefan Kamola. We hope to grow this part of the corpus through cooperation with libraries, research projects and other scholars.
Contrary to the texts in the Arabic and Persian subcorpus, which are identified by URIs for their authors, titles and digital versions, the texts in the manuscript corpus are identified by URIs for the collection they are held in, the material object they are on, and their transcription. Full documentation of the manuscript corpus URIs and metadata files is available at https://github.com/OpenITI/MSS.
This release includes 9,106 unique books authored by 3,618 authors. When all available versions are taken into account, the corpus now consists of 14,107 text files. The complete collection—comprising books and manuscripts in multiple languages—contains a total of 2,348,893,857 words. Below is an overview of the subcorpora:
Arabic:
- Total texts: 13,320
- Unique authors: 3,373
- Unique titles: 8,755
- Total tokens: 2,310,017,480
- Total characters: 9,547,056,881
Persian
- Total texts: 354
- Unique authors: 262
- Unique titles: 351
- Total tokens: 38,775,041
- Total characters: 145,796,954
Manuscripts
- Total texts: 433
- Total tokens: 101,336
- Total characters: 417,485
The following graphs illustrate the overall changes in the number of books, authors, and words across all published versions. Each graph presents figures for both the full range of Hijri centuries and the period before the 10th century AH. Primary shows the unique works in the corpus (excluding the different versions of the same book).
For an overview of the modified URIs, changes and additions, annotation updates, please see the release notes and the corresponding tsv files in this release.


