As a quick refresher, in the example isnad below, there are five transmitters involved in the chain, each underlined.
حدثنا أبو داود قال: حدثنا هشام، عن قتادة، عن الحسن عن سمرة، أن النبي صلى الله عليه وسلم
Abū Dāwūd told us, saying, ‘Hishām told us, from Qatādah, from al-Ḥasan, from Samurah that the Prophet, may the peace and blessing of God be on him (al-Ṭayālisī, 204AH)
Rather than just thinking of this isnad as text alone, we can think of it as representing a chain of individuals involved in the transmission of information, or a simple linear network with five nodes. These networks from isnads, when taken collectively, can be a valuable resource for solving many problems in book history surrounding the sources a particular writer is drawing upon when composing a text. While understanding such a network is something that a reader can do quite easily (and indeed does) while reading a text for a reasonable number of isnads at once, working with a collection as large as the number of isnads even in just a single text becomes very difficult for a human reader in terms of the effort involved, particularly if the text is long.
One of the common issues involved in understanding isnads is that, even within one work, an author will often refer to the same transmitter by a variety of different names. We collected a dataset of 2,379 isnads from Ibn ‘Asakir’s 12th century CE (6th century AH) historical work Tarikh Madinat Dimashq all involving the 9th century CE (3nd century AH) scholar Muhammad ibn Sa’d. The dataset contains 14,454 names, of which 90% have been manually assigned the to the individuals they referred. We found that, of the 44 individuals involved in transmission, around half had multiple different surface forms of their name, with one individual having as many as 25 surface forms, as can be seen in Figure 1 below.
Figure 1: Frequencies of surface form counts for individuals in the Ibn ‘Asakir dataset
However, the isnad data itself is also the source of a potential solution to this issue of name ambiguity, since the context a name appears in can help us determine who the individual is. Consider the following isnad fragments taken from the Ibn ‘Asakir data, with transmitters underlined:
أخبرنا أبو بكر الحاسب أنا أبو محمد الشيرازي أنا أبو عمر بن حيوية
أخبرنا أبو بكر الحاسب أنا أبو محمد الجوهري أنا أبو عمر بن حيوية
Both contain the transmitters Abu Bakr al-Hasib, Abu Muhammad al-Shirazi, and Abu ‘Umar b. Hayawayya. One will note that Ibn ‘Asakir used two different names to refer to the second transmitter, Abu Muhammad (as al-Shirazi and as al-Jawhari). However, the two names are not only internally very similar to each other but also occur in similar contexts, with identical names on either side of the ambiguous name. Thus, if we had a way to represent names that lets us account for contextual similarity of potentially ambiguous names, we could try to find groups of similar names that are likely to refer to the same individual. The problem then becomes figuring out how to represent names and how to find clusters of similar names. We will tackle each of those in turn.
For representing the names, it would be useful to have representations that are similar when both the word or its context are similar. Helpfully, this property is a central idea behind a lot of work in natural language processing focused on learning good contextual representations of text, called word embeddings, which are essentially long vectors which encode the meaning of a word in its context, that can be used to accomplish a wide variety of tasks, which we can take advantage of when trying to represent names. Using one such model, a multilingual English-Arabic BERT model trained on modern Arabic (Lan et al, 2020) as a base, in conjunction with some additional work to improve the model’s ability to understand Classical Arabic and how names in isnads are constructed, we can represent names using their word embeddings from that model. For multi-word names, we can average the embeddings of each word to obtain a representation of the full name.
With the name representations in hand, we can now turn our attention to clustering them to determine which names refer to the same individual. To that end, we apply the social network analysis tool of community detection. To do this, we will first need to construct a network using the collection of name embeddings as nodes. We experiment with two different methods of constructing the network, using different methods for deciding which mentions should be connected to which other mentions. As an aside, it is important to note that since each vector represents a point, we can calculate distances between names using Euclidean distance which allows us to talk about embeddings as being closer or further away from each other. Additionally, in both cases, the cosine similarity (a measure of the angle between two vectors that ranges from 1 to -1, with higher values denoting greater similarity and perpendicular vectors having 0 similarity) between the two embeddings is used as the weight for any edge added to the network.
The first method is more straightforward, with each embedding being connected to its k nearest neighbors, as demonstrated in Figure 2 below with a simple 2-dimensional example where k=2. While this is a sensible starting point, it has the undesirable property of forcing all mentions to be connected to at least k neighbors. As such, names that are close to fewer than k names will be forced to have at least k neighbors in the network regardless of how similar the names in question are. The second method, which we will call the surface form heuristic and demonstrate in Figure 3, takes a more fine-grained approach to remedy this. For each name, we calculate the distance d from that name to the most-distant identical name and connect it to all names at most that distant in embedding space. That way, more isolated regions of the embedding space can have more sparse networks, rather than being subsumed into a larger cluster with a large number of lower weight edges. One could add a multiplier m to look beyond that radius, linking to any mentions within the distance m*d, but we will demonstrate the method with m=1 in Figure 3 below.
Figure 2: A demonstration of the kNN network creation method. Beginning with the fully unconnected points (top left), each name is in turn connected to its two nearest neighbors (as in the top right and bottom left), until the final network (bottom right) in constructed
Figure 3: A demonstration of the surface form heuristic network creation method. Beginning with the fully unconnected points, (colored according to their surface form for ease of understanding) (top left), each name is connected to any name at most as distant as the most-distant identical name
Once the network is constructed, the only remaining step is to detect communities in it. We experiment with both simple label propagation, which only uses the structure of the network to detect communities, largely ignoring the edge weights, and the Leiden algorithm proposed in (Traang et al, 2019), which iteratively refines the communities in starting from a network in which every node is in its own cluster and combining similar clusters.
To review, we are trying to assign the names to clusters so that the names that refer to the same individual are all in the same cluster. Evaluating our success at this task is somewhat difficult, as clustering problems like this have many evaluation metrics due to the inherent difficulty of quantifying how good a clustering is. We use several metrics, each with their own advantages and disadvantages. For brevity, we will present only the final scores we use to compare models to each other, which is itself the average of three other metrics. A perfect assignment of names to individuals would receive a score of 1.
Table 1: A comparison of different methods of constructing networks from names
From Table 1 above, it is clear that the surface form method does significantly better then constructing the network using k nearest neighbors, especially when using the more advanced Leiden community detection algorithm.
While being able to describe how we solve this process in the abstract is useful, it is perhaps more informative to present a case study in which we’re interested in counting the number of individuals a collection of names refers to solves an actual problem of historical interest. As a case study, we are going to attempt to determine the number of direct transmitters of Ibn Sa’d material to Ibn Asakir. In the language of networks, this means that rather than working with the complete dataset of all names in all isnads, we will consider only the first name in each isnad. In addition to that selection process, we set aside half of the data to select the values of k and m which maximize the scoring metric we use above, and evaluate how well that optimized model performs on the other half of the data. The test set consists of 1,070 names with a total of 63 surface forms for 25 individuals. The results are shown in Table 2 below. For comparison, we also show results of a naive clustering algorithm in which each surface form is used to create its own cluster, which makes no effort to resolve ambiguity across variation in surface forms.
|Model||Score||# of Transmitters|
|kNN, k = 135, Leiden||.661||4|
|kNN, k = 135, LP||.618||3|
|Surface Form m=1.15, Leiden||.791||32|
|Surface Form m=1.05, LP||.779||40|
Table 2: Results for clustering first names in isnads to count the number of direct transmitters to Ibn ‘Asakir
From these results, it is clear that the kNN models tend to vastly over-aggregate the names, resulting in too few clusters, while the reverse problem occurs with the surface form linking heuristic, albeit to a lesser degree. In other words, the kNN models claim there are fewer transmitters than there actually are, while the surface form-based model does the reverse and finds too many transmitters. The latter method does, however, outperform the naive method regardless of the choice of community detection algorithm, and implies that these results on new data would serve as a good starting point for human disambiguation rather than starting from scratch as was done to create this data in the first place.
As with my earlier work on identifying isnads, this research would not be possible without close collaboration with other members of the KITAB team. In particular, much of the work to create labeled data for this research was done by Professor Sarah Savant and researcher Masoumeh Seydi.