When aggregating publication bibliographic records by collecting from hundreds of repositories, CRIS systems, and web sources, it is very common to encounter so-called duplicates, namely groups of records describing the very same real-world document. Deduplication of a collection of bibliographic records is the process of scanning the whole collection to identify such groups and eventually replace them with only one record (i.e. “merging”), which unambiguously describes the publication in the collection.
With no doubts, humans are the best agents at accomplishing this task and no machinery will ever replace their ability to judge whether two records are indeed duplicates. On the other hand, humans are effective only when the size of the collection is contained and the process is not often repeated, and this is unfortunately not the case for scholarly communication aggregators or infrastructures such as Google Scholar, OpenAIRE, DOAJ, BASE, OAIster, etc.. In fact, extensive literature has been written on automated deduplication methodologies, also known as “record linkage”, “disambiguation”, “named entity recognition”, etc..