OpenITI release 2022.2.7

The KITAB team has released a new version (2022.2.7) of the OpenITI corpus at Zenodo. The release is open access. It is our seventh release (second release in 2022). You can also access the release through our GitHub repository.

This release features 6,858 unique books written by 2,854 authors. For some of these books, multiple versions are in the corpus (representing different print editions and/or digitizations); including all these versions, the corpus now comprises 11,296 text files (containing a total of 2,262,096,158 words).

Major additions to the corpus:

The KITAB team have OCRed 75 new texts using the improved OCR models for Arabic script developed by the OpenITI-AOCP project. For OCR’ing these texts, the team uses the command-line OCR software Kraken (these texts have an ID that starts with `Kraken`, followed by a date), sometimes through the graphical user interface of the open-source OCR platform eScriptorium (these texts have an ID that starts with `EScr`). Most of these OCR’ed texts have not been post-corrected and still contain a significant number of transcription errors (in the metadata files accompanying these texts, you’ll find an “UNCORRECTED_OCR” tag).
Note: These texts should be used only for types of analysis that are not affected by small transcription errors (for example, our text reuse algorithm passim has no problem with small transcription errors).
The Library of Arabic Literature has generously provided 13 texts (IDs start with `LAL`); more texts from this wonderful collection will be added in future corpus releases. Some of these texts are entirely new additions to the corpus, such as al-Quḍāʿī’s Kitāb al-Shihāb, Ḥannā Diyāb’s Book of Travels and the anonymous story collection A Hundred and One Nights.

A full list of the new additions can be found in the `ids.csv` file in the release notes accompanying the corpus release.

In addition to the new texts, annotation work has been carried out: structural annotation (section headers) was added to 50 texts, and 62 previously annotated texts were vetted by a second annotator. For a full list of the new annotation work, see the `annotation_update.csv` file in the release notes accompanying the corpus release. Currently, the corpus contains 552 books that have been annotated and vetted (extension: `.mARkdown`), and 647 books that have gone through one round of annotation (extension: `.completed`).

Finally, some URIs were updated to correct mistakes. For an overview of these URI changes as well as the statistics on the corpus, please see the release notes and the corresponding csv files in the release notes and the GitHub repository.

To cite this version please include the following information (the bibliographical export is available at the publication page):

Nigst, Lorenz, Romanov, Maxim, Savant, Sarah Bowen, Seydi, Masoumeh, & Verkinderen, Peter. (2023). OpenITI: a Machine-Readable Corpus of Islamicate Texts (2022.2.7) [Data set]. Zenodo. https://doi.org/10.5281/zenodo.7687795

Share on

Twitter Facebook LinkedIn

OpenITI release 2022.2.7

Masoumeh Seydi and Peter Verkinderen

Glossary:

Share on

You may also enjoy

Introducing Text Evaluation using LabelStudio

Leveraging the OpenITI Corpus for Text Identification: Two Examples from Geniza Documents

Research Workshop on Miskawayh

Corpus Building Workshop