From Networks to Named Entities and Back Again: Exploring Isnad Networks
I have previously written for the project blog about detecting isnads, Arabic citations which give a chain of individuals involved in the transmission of information rather than the single source common in other forms of citation, in the OpenITI corpus using machine learning. In this post, I will focus on one of the problems that arise when trying to analyse isnads computationally and one potential way of solving it: the issue of name ambiguity when trying to discern who is who in isnad networks. I also offer an application of these methods to a case study determining the number of direct transmitters of Ibn Saʿd to Ibn ʿAsakir.
As a quick refresher, in the example isnad below, there are five transmitters involved in the chain, each underlined.
حدثنا أبو داود قال: حدثنا هشام، عن قتادة، عن الحسن عن سمرة، أن النبي صلى الله عليه وسلم
Abu Dawud told us, saying, ‘Hisham told us, from Qatada, from al-Hasan, from Samura, that the Prophet, may the peace and blessing of God be on him … (al-Tayalisi, 204 AH)
Rather than just thinking of this isnad as text alone, we can think of it as representing a chain of individuals involved in the transmission of information, or a simple linear network with five nodes. These networks from isnads, when taken collectively, can be a valuable resource for solving many problems in book history surrounding the sources a particular writer was drawing upon when composing a text. While understanding such a network is something that a reader can do (and indeed does) quite easily while reading a text for a reasonable number of isnads at once, working with a collection as large as the number of isnads even in just a single text becomes very difficult for a human reader in terms of the effort involved, particularly if the text is long.
One of the common issues involved in understanding isnads is that even within one work, an author will often refer to the same transmitter by a variety of different names. We collected a data set of 2,379 isnads from Ibn ʿAsakir’s sixth/twelfth-century historical work Taʾrikh madinat Dimashq. all involving the third/ninth-century scholar Muhammad b. Saʿd. The data set contains 14,454 names, of which 90% have been manually assigned to the individuals to whom they refer. Our annotation process found that the annotated names refer to forty-four distinct individuals. We found that of the forty-four individuals involved in transmission, around half were referred to by multiple different surface forms of their names, with one individual having as many as twenty-five surface forms, as can be seen in Figure 1 below.
Figure 1: Frequencies of surface form counts for individuals in the Ibn ʿAsakir data set.
However, the isnad data itself is also the source of a potential solution to this issue of name ambiguity, since the context a name appears in can help us determine who the individual is. Consider the following isnad fragments taken from the Ibn ʿAsakir data, with transmitters underlined:
أخبرنا أبو بكر الحاسب أنا أبو محمد الشيرازي أنا أبو عمر بن حيوية
أخبرنا أبو بكر الحاسب أنا أبو محمد الجوهري أنا أبو عمر بن حيوية
Both contain the transmitters Abu Bakr al-Hasib, Abu Muhammad al-Shirazi and Abu ʿUmar b. Hayawayya. One will note that Ibn ʿAsakir used two different names to refer to the second transmitter, Abu Muhammad (al-Shirazi and al-Jawhari). However, the two names are not only internally very similar to each other but also occur in similar contexts, with identical names on either side of the ambiguous name. Thus, if we had a way to represent names that let us account for contextual similarity of potentially ambiguous names, we could try to find groups of similar names that are likely to refer to the same individual. The problem then becomes figuring out how to represent names and how to find clusters of similar names. We will tackle each of those in turn.
For representing the names, it would be useful to have representations that are similar when either the names or their contexts are similar. Helpfully, this property is central to a lot of work in natural language processing focused on learning good contextual representations of text, called word embeddings, which are essentially long vectors which encode the meaning of a word in its context. These can be used to accomplish a wide variety of tasks, which we can take advantage of when trying to represent names. Using one such model, a multilingual English-Arabic BERT model trained on modern Arabic,1 as a base, in conjunction with some additional work to improve the model’s ability to understand classical Arabic and how names in isnads are constructed, we can represent names using their word embeddings from that model. For multiword names, we can average the embeddings of each word to obtain a representation of the full name.
With the name representations in hand, we can then turn our attention to clustering them to determine which names refer to the same individual. To that end, we apply the social network analysis tool of community detection. To do this, we first need to construct a network using the collection of name embeddings as nodes. We experiment with two different methods of constructing the network, using different methods for deciding which mentions should be connected to which other mentions. As an aside, it is important to note that since each vector represents a point, we can calculate distances between names using Euclidean distance, which allows us to talk about embeddings as being closer or further away from each other. Additionally, in both cases, the cosine similarity (a measure of the angle between two vectors that ranges from 1 to -1, with higher values denoting greater similarity and perpendicular vectors having 0 similarity) between the two embeddings is used as the weight for any edge added to the network.
The first method is more straightforward, with each embedding being connected to its k nearest neighbours, as demonstrated in Figure 2 below with a simple two-dimensional example where k = 2. While this is a sensible starting point, it has the undesirable property of forcing all mentions to be connected to at least k neighbours. As such, names that are close to fewer than k names will be forced to have at least k neighbours in the network regardless of how similar the names in question are. The second method, which we call the surface-form heuristic and demonstrate in Figure 3, takes a more fine-grained approach to remedy this. For each name, we calculate the distance, d, from that name to the most-distant identical name and connect it to all names at most that distant in embedding space. That way, more isolated regions of the embedding space can have more sparse networks, rather than being subsumed into a larger cluster with a large number of lower-weight edges. One could add a multiplier, m, to look beyond that radius, linking to any mentions within the distance m * d, but we demonstrate the method with m = 1 in Figure 3 below.
Figure 2: A demonstration of the kNN network creation method. Beginning with the fully unconnected points (top left), each name is in turn connected to its two nearest neighbours (as in the top right and bottom left), until the final network (bottom right) is constructed.
Figure 3: A demonstration of the surface-form heuristic network creation method. Beginning with the fully unconnected points (top left, coloured according to their surface forms for ease of understanding), each name is connected to any name at most as distant as the most-distant identical name.
Once the network is constructed, the only remaining step is to detect communities in it. We experiment with both simple label propagation, which uses only the structure of the network to detect communities, largely ignoring the edge weights, and the Leiden algorithm proposed by Traang et al.,2 which iteratively refines the communities, starting from a network in which every node is in its own cluster and combining similar clusters.
To review, we are trying to assign the names to clusters so that the names that refer to the same individual are all in the same cluster. Evaluating our success at this task is somewhat difficult, as clustering problems like this have many evaluation metrics due to the inherent difficulty of quantifying how good a clustering is. We use several metrics, each with their own advantages and disadvantages. For brevity, we will present only the final scores we use to compare models to each other, which is itself the average of three other metrics. A perfect assignment of names to individuals would receive a score of 1.
|kNN, k = 100||Leiden||.727|
|kNN, k = 100||LP||.591|
Table 1: A comparison of different methods of constructing networks from names.
From Table 1 above, it is clear that the surface form method does significantly better than constructing the network using k nearest neighbours, especially when using the more advanced Leiden community detection algorithm.
While being able to describe how we solve this process in the abstract is useful, it is perhaps more informative to present a case study in which counting the number of individuals a collection of names refers to solves an actual problem of historical interest. As a case study, we are going to attempt to determine the number of direct transmitters of Ibn Saʿd’s material to Ibn Asakir. In the language of networks, this means that rather than working with the complete data set of all names in all isnads, we will consider only the first name in each isnad. In addition to that selection process, we set aside half of the data to select the values of k and m which maximise the scoring metric we use above, and evaluate how well that optimised model performs on the other half of the data. The test set consists of 1,070 names with a total of sixty-three surface forms for twenty-five individuals. The results are shown in Table 2 below. For comparison, we also show the results of a naive clustering algorithm in which each surface form is used to create its own cluster and which makes no effort to resolve ambiguity across variation in surface forms.
|Model||Score||# of Transmitters|
|kNN, k = 135, Leiden||.661||4|
|kNN, k = 135, LP||.618||3|
|Surface form, m =1.15, Leiden||.791||32|
|Surface form, m = 1.05, LP||.779||40|
Table 2: Results of clustering first names in isnads to count the number of direct transmitters to Ibn ʿAsakir.
From these results, it is clear that the kNN models tend to vastly over-aggregate the names, resulting in too few clusters, while the reverse problem occurs with the surface-form heuristic, albeit to a lesser degree. In other words, the kNN models claim there are fewer transmitters than there actually are, while the surface form–based model does the reverse and finds too many transmitters. The latter method does, however, outperform the naive method regardless of the choice of community detection algorithm, which implies that these results on new data would serve as a good starting point for human disambiguation rather than starting from scratch as was done to create this data in the first place.
As with my earlier work on identifying isnads, this research would not be possible without close collaboration with other members of the KITAB team. In particular, much of the work to create labelled data for this research was done by Sarah Savant and Masoumeh Seydi.
Lan, W., Chen, Y., Xu, W., & Ritter, A. (2020). “An Empirical Study of Pre-trained Transformers for Arabic Information Extraction.” In Proceedings of The 2020 Conference on Empirical Methods on Natural Language Processing (EMNLP). ↩