Automatic topography of high-dimensional data sets by non-parametric density peak clustering

Data analysis in high-dimensional spaces aims at obtaining a synthetic description of a data set, revealing its main structure and its salient features. We here introduce an approach providing this description in the form of a topography of the data, namely a human-readable chart of the probability density from which the data are harvested. The approach is based on an unsupervised extension of Density Peak clustering and on a non-parametric density estimator that measures the probability density in the manifold containing the data. This allows finding automatically the number and the height of the peaks of the probability density, and the depth of the “valleys” separating them. Importantly, the density estimator provides a measure of the error, which allows distinguishing genuine density peaks from density fluctuations due to finite sampling. The approach thus provides robust and visual information about the density peaks height, their statistical reliability and their hierarchical organization, offering a conceptually powerful extension of the standard clustering partitions. We show that this framework is particularly useful in the analysis of complex data sets.

Automatic topography of high-dimensional data sets by non-parametric density peak clustering

d'Errico M.;Facco E.;Laio A.;Rodriguez Garcia A.

2021-01-01

Abstract

Data analysis in high-dimensional spaces aims at obtaining a synthetic description of a data set, revealing its main structure and its salient features. We here introduce an approach providing this description in the form of a topography of the data, namely a human-readable chart of the probability density from which the data are harvested. The approach is based on an unsupervised extension of Density Peak clustering and on a non-parametric density estimator that measures the probability density in the manifold containing the data. This allows finding automatically the number and the height of the peaks of the probability density, and the depth of the “valleys” separating them. Importantly, the density estimator provides a measure of the error, which allows distinguishing genuine density peaks from density fluctuations due to finite sampling. The approach thus provides robust and visual information about the density peaks height, their statistical reliability and their hierarchical organization, offering a conceptually powerful extension of the standard clustering partitions. We show that this framework is particularly useful in the analysis of complex data sets.

Scheda breve

Scheda completa

	Anno
	
				2021
			
	Stato di pubblicazione
	
				Pubblicato
			
	Rivista
	
				INFORMATION SCIENCES
			
	DOI
	
				https://dx.doi.org/10.1016/j.ins.2021.01.010
			
	URL
	
				https://www.sciencedirect.com/science/article/pii/S0020025521000116
			
	Appare nelle tipologie:
	
				1.1 Articolo in Rivista

File in questo prodotto:

File	Dimensione	Formato
1-s2.0-S0020025521000116-main.pdf Accesso chiuso Tipologia: Documento in Versione Editoriale Licenza: Copyright Editore Dimensione 3.44 MB Formato Adobe PDF Visualizza/Apri Richiedi una copia	3.44 MB	Adobe PDF	Visualizza/Apri Richiedi una copia
1-s2.0-S0020025521000116-main-Post_print.pdf Open Access dal 27/01/2024 Tipologia: Bozza finale post-referaggio (post-print) Licenza: Creative commons Dimensione 4.1 MB Formato Adobe PDF Visualizza/Apri	4.1 MB	Adobe PDF	Visualizza/Apri

Pubblicazioni consigliate

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/11368/3034860

Citazioni

ND

34

36

social impact