Common Questions about the Corpus
What does KITAB analyse?
We focus on the origins of the written Arabic tradition, in the eighth century, up to roughly the fifteenth century, but aim to include as many texts as possible, so you can also find texts written after 1500.
What quality of texts is necessary for computational analysis?
It depends on the type of analysis. Regarding OCR, text reuse detection methods can be adapted to detect patterns even with badly OCRed texts (as has been done for studies of 19th-century newspapers). For our OCRed texts, we aim, however, to obtain digital files that match within 1 or 2% the printed editions upon which they are based.
Are your texts representative of the historical Arabic tradition?
Yes and no. We have a lot of texts. Digitisation efforts in the Middle East over the past twenty years have been extensive and impressive. But equally, we have nowhere near what once existed. The electronic files we have are based on printed editions. Many printed editions have not been converted into machine-readable texts. And many manuscripts have not made it into print.
There are some biases in what we do have and an overrepresentation of works from particular authors and treating particular topics. Texts with religious and legal significance are heavily represented.
A major aim of KITAB is to increase both the number and diversity of texts in our corpus. We seek works treating, for example, philosophy and science.
Can I propose additional texts for KITAB?
Yes. Please check Open Islamicate Texts Initiative to ensure that we do not already have it.
If your text is already available in machine-readable format, we will add it to the corpus.
If your text is not already available in machine-readable format, please add it to our list for books to OCR in the future.
How reliable are texts in Digital Format?
Our digital texts are generally speaking reliable reproductions of modern printed editions. We are finding this as we annotate digital files of books and compare them to their printed counterparts.
There are problems with printed editions and the extent to which any edition serves as witness to a historical written work. One aim of KITAB is to develop our algorithm and text alignment and annotation tools to facilitate comparison between different editions of books. This will help scholars gain a fuller picture of their sources and variations between them. It might also aid in the creation of critical editions.
We will also tackle problems with metadata that arise also in the printed editions. For example, many books are reprinted without acknowledgement to the original editions or their editors. We need to verify bibliographic data.