Analysis of citation networks
Ryan Muther is working with KITAB to develop methods to automatically detect names within isnads. The goal is to use the order of names within isnads and the organisation of isnads within texts to solve complex disambiguation problems. How can we know if a different name in different isnads refers to the same person? How do we decide if similar or the same names refer to different people? This project aims to use computer science methods to tackle these kinds of questions, which are essential for understanding how books were written and how authors cited.
Automatic Quran detection
Peter Verkinderen is working with Alicia González Martínez and Karen Bauer to automatically tag Quranic quotations within the OpenITI Corpus. The goal is to not only identify the quotations, but identify the relevant Sura and Aya. To pursue this they must decide on the complex boundary between Quranic language and Quranic quotation. How many words, for example, is needed for a phrase to be considered a Quranic quotation? We hope to share our initial data for the corpus soon.
Text reuse detection evaluation
Ryan Muther and David Smith are working with KITAB to evaluate and improve upon the performance of the text reuse detection algorithm, passim. Passim was designed for use with English-language newspapers, and so the team are seeking to see how effectively passim detects text reuse in Arabic. We are doing this through annotating ground truth for text reuse in a variety of OpenITI texts. How much text needs to be shared to be considered text reuse? Where do we draw the boundaries of text reuse instances? How should we consider commonly-used phrases found throughout the corpus? These are the sorts of questions that this project deals with when deciding what the ground truth should look like. The good news: our preliminary evaluation data suggests passim performs well with Arabic, but there is still room for improvement!