Scholars working in Arabic can now download the entire corpus used by the KITAB team through Zenodo, an Open Science platform that supports Open Access.
Users will find in the corpus here: https://doi.org/10.5281/zenodo.3082464
Number of authors: 1,859.
Number of titles: 4,288, totaling 755,689,541 words.
Multiple versions of the same titles: 7,144, totaling 1,520,667,360 words.
The texts sit within the OpenITI corpus (the KITAB project is the major contributor of Arabic texts to the OpenITI). All major versions of the corpus, as well as analytical datasets generated from the corpus with different methods, will be published in the future on Zenodo as part our commitment to Open Access. For the release notes, including the structure of the data, see here.
The goal of the OpenITI is to build a machine-actionable corpus of premodern texts in Islamicate languages to encourage computational analysis of the Islamicate written tradition. Most of the Arabic texts have been collected from open-access online collections of premodern and modern Arabic texts such as http://shamela.ws/ and http://shiaonlinelibrary.com/ .
If you are using the corpus, please cite in the following manner:
Maxim Romanov and Masoumeh Seydi. 2019. “OpenITI: A Machine-readable Corpus of Islamicate Texts”. Zenodo. doi:10.5281/zenodo.3082464.
By Masoumeh Seydi and Maxim Romanov
We propose in this paper a new online Arabic corpus of news articles, named ANT Corpus, which is collected from RSS Feeds. Each document represents an article structured in the standard XML TREC format. We use the ANT Corpus for Text Classification (TC) by applying the SVM and Naive Bayes (NB) classifiers to assign to each article its accurate predefined category. We study also in this work the contribution of terms weighting, stop-words removal and light stemming on Arabic TC. The experimental results prove that the text length affects considerably the TC accuracy and that titles words are not sufficiently significant to perform good classification rates. As a conclusion, the SVM method gives the best results of classification of both titles and texts parts.