A critical task in data cleaning and integration is the identification of duplicate records representing the same real-world entity. Similarity join is largely used in order to detect pairs of similar records in combination with a subsequent clustering algorithm for grouping together records referring to the same entity. Unfortunately, the clustering algorithm is strictly used as a post-processing step, which slows down the overall performance, and final results are produced at the end of the whole process only. Inspired by this critical evidence, in this article we propose and experimentally evaluate SjClust, a framework to integrate similarity join and clustering into a single operation. The basic idea of our proposal consists in introducing a variety of cluster representations that are smoothly merged during the set similarity task carried out by the join algorithm. An optimization task is further applied on top of such framework. Experimental results derived from an extensive experimental campaign show that we outperform previous approaches by an order of magnitude in most settings.

SjClust: A Framework for Incorporating Clustering into Set Similarity Join Algorithms

Cuzzocrea, Alfredo
;
2018-01-01

Abstract

A critical task in data cleaning and integration is the identification of duplicate records representing the same real-world entity. Similarity join is largely used in order to detect pairs of similar records in combination with a subsequent clustering algorithm for grouping together records referring to the same entity. Unfortunately, the clustering algorithm is strictly used as a post-processing step, which slows down the overall performance, and final results are produced at the end of the whole process only. Inspired by this critical evidence, in this article we propose and experimentally evaluate SjClust, a framework to integrate similarity join and clustering into a single operation. The basic idea of our proposal consists in introducing a variety of cluster representations that are smoothly merged during the set similarity task carried out by the join algorithm. An optimization task is further applied on top of such framework. Experimental results derived from an extensive experimental campaign show that we outperform previous approaches by an order of magnitude in most settings.
File in questo prodotto:
File Dimensione Formato  
ribeiro2018.pdf

Accesso chiuso

Tipologia: Documento in Versione Editoriale
Licenza: Copyright Editore
Dimensione 4.44 MB
Formato Adobe PDF
4.44 MB Adobe PDF   Visualizza/Apri   Richiedi una copia
Pubblicazioni consigliate

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/11368/2939025
Citazioni
  • ???jsp.display-item.citation.pmc??? ND
  • Scopus 2
  • ???jsp.display-item.citation.isi??? ND
social impact