Our Pilot

Our current main method was authored by David Smith of Northeastern University and is well-suited to work on a large corpus such as we have with Arabic. David’s most noted project so far has been on “viral texts” that circulated widely in 19th century American newspapers. Specifically, he and a team of researchers focused on newspapers in the Library of Congress’s Chronicling America online newspaper archive, comprising 1.6 billion words from 41,829 issues of 132 newspapers. He has also used his algorithm to study how the wording of bills introduced into the US Congress have been reused and circulated.

passim, the software that uses Smith’s algorithm, works specifically by comparing words in one text against words in another text or other texts. The process involves a few simple steps repeated, millions, billions, or even trillions of times and involves the creation of a large data set. The algorithm operates by matching word n-grams and on pruning pairs of texts with small n-gram overlap. Maxim Romanov played a leading role in running passim for the pilot, and started by formatting and normalising all machine-readable files, and then searched with 5-gram units. He ran passim on a computer cluster at Tufts University: it took about 4 days with 10 cores and 64 gigabytes of memory to get our data processed. For our analysis, however, this data still required post-processing, which Maxim did on a server at the Alexander von Humboldt Chair at Leipzig University.

To be a bit more specific: the algorithm ran across 100-word chunks, sliding across them in five-word units, looking for common sequences. It began with text one, the first five words, sliding across the second text looking for these five words within the first 100-word chunk (in sequences or 1-5, 2-6, 3-7, and so forth), until it compared all of the words in every chunk of the first book to all of the words in every chunk of every other book in the corpus. When the computer found a match, it created a record, which was stored in a file pertaining to the pair of books. The computer proceeded like this for every pairing of books in the corpus, generating records and files. From our pilot, comparing about 3,500 books, we generated about 650,000 such files, now stored at the AKU in a database created by Ahmad Sakhi. This means that about 5% of the time, when any two books were compared, at least one record was generated. Most such records and files cannot be reckoned interesting – the overlaps are inconsequential, often with only a small number of records in a file – but tens of thousands of the files do contain common passages that merit analysis.

Our next aim is to explore another type of data generated by passim, focused on clusters of related passages from across books. In this case, the algorithm generates an index from the files of repeated n-grams and then extracts candidate document pairs from this index. Combined with enhanced metadata, this second type of data will especially be useful for studying the networks through which ideas passed.

Click here to explore our first data visualisation (2015).

Click here to explore our second data visualisation (2017).



Share this: