A new version of the corpus used by the KITAB team is now available to download at Zenodo, an Open Science platform that supports open access. This is the second release developed by the OpenITI organisation. It is also accessible at GitHub.
The current release features 7,119 books, including all versions and editions (1,464,011,669 words), representing 4,285 unique works written by 1,833 authors. Among these, 446 books are annotated in OpenITI mARkdown. Moreover, the project team has made corrections to the books’ metadata. Major corrections are noted in the release note, which also provides statistics on the corpus, as well as a list of current and past contributors to the corpus. The release note is available here.
The release metadata provides the metadata for the texts in this version. Arabic fields for titles, authors and tags are being added in the current version. In addition to the release metadata, the latest version of the corpus can be searched with KITAB’s metadata application through different fields representing book titles, authors, tags and so on. The application is continuously updated and is available here.
Broadly, OpenITI aims to develop a machine-actionable corpus of premodern texts in Islamicate languages that can facilitate computational analysis of the Islamicate written tradition. The analytical data sets (including text reuse statistics) generated by the KITAB team using this corpus will also be published on Zenodo in the future.
To cite this version, please use the following (the bibliographical export is also available on the publication page):
Lorenz Nigst, Maxim Romanov, Sarah Bowen Savant, Masoumeh Seydi and Peter Verkinderen, OpenITI: A Machine-Readable Corpus of Islamicate Texts (Version 2020.1.2) [data set] (June 2020), Zenodo, http://doi.org/10.5281/zenodo.3891466.