At present, the OpenITI/KITAB corpus comprises 10,243 text files, 6,268 of which are unique titles.
Such a large, and growing, number of texts makes quality control challenging. But at the same time, it is precisely this large number of texts that can be the basis for quantitative methods of quality control.
In this blog post, I would like to briefly explore how we can make use of a large number of annotated lines of poetry for a simple quality check of such lines.
Many of the texts in the OpenITI/KITAB corpus contain poetic verses. Many of these verses are found in texts that carry the word diwan in the title; others are not. Some of these verses are annotated and tagged as verses; others are not.
Regardless of where verses are found, if they are annotated according to the OpenITI mARkdown system used by OpenITI/KITAB, the tag %~% is inserted either between two of the hemistichs of a verse or before or after a verse in case there are not two hemistichs (see also here under ‘Verses of poetry’).
In many instances, the tag is inserted manually in the course of the annotation work carried out by KITAB team members. In the vast majority of cases, however, the tag is inserted automatically by KITAB when new text files are converted to the OpenITI mARkdown format.
However the tag %~% has been inserted, the number of lines containing it is substantial, and if we search our corpus for lines containing %~%, we get tens of thousands of results.
To give a sense of the scale, for the following exploration, I have limited myself to texts attributed to authors whose date of death falls between the years 1 and 799 of the Hijri calendar.
Even if we limit ourselves to texts from within the said chronological window, there are far more than 300,000 lines that contain the tag %~%!
Such a large number should tell us something rather solid about lines of poetry in our corpus. Not least, it should tell us pretty accurately how long they typically are.
The length of each string tagged as a line of poetry in our corpus can easily be determined computationally, and the resulting hundreds of thousands of values indeed give us a relatively good idea of what to expect from verses in terms of length.
Accordingly, outliers are of considerable interest when it comes to flagging strings that seem to be either too long or too short. The many can be used to spot the few.
Not all of these outliers are problematic, but the chance that they actually are is more than just a remote one.
It is worthwhile plotting the relevant data to get a better sense and overview of how the lengths of strings which contain the tag %~% are distributed across the OpenITI/KITAB corpus (as has been said, for this exploration, the texts used fall within the chronological window 0001–0799 AH). The following figure shows a distribution in which the x-axis shows the AH death dates and the y-axis the lengths of the strings containing the tag %~%:
Two facts are readily palpable: First, the bulk of the strings in our corpus which have been tagged as poetic verses are somewhere between thirty and eighty characters long (which in this exploration also includes non-Arabic characters). Second, not only are there outliers, but many of these outliers seem to accumulate around a few text files.
If we zoom in on strings whose length is between 80 and 300 characters, this can be seen more clearly:
For example, the values eye-catchingly stacked at 0283 AH all belong to the text file containing Ibn al-Rumi’s diwan. The highest character count value here is 278, which is considerably longer than expected from a line of poetry. Indeed, if we go into the text file and check up on this particular string, there are issues with the text file (see the image below).
The reasons some of the strings that contain the tag %~% are too long differ. At the end of the day, it will be up to a human annotator to make a judgement call and fix what caused the problem (should there be any).
Nonetheless, already at this point, the data that we can generate from our corpus allows us to identify files with ‘odd’ lines of poetry and directs us to the affected text files relatively efficiently.
Thus, from the more than 300,000 lines that contain the tag %~%, fewer than 1,000 are longer than eighty characters, and fewer than 300 exceed 100 characters. The latter are scattered across only about thirty-five texts.
KITAB will automatically flag these cases in our yml version metadata files.