The growing availability of large diachronic corpora of scientific literature offers the opportunity of reading the temporal evolution of concepts, methods and applications, i.e., the history of disciplines involved in the strand under investigation. After a retrieval process of the most relevant keywords, bag-of-words approaches produce words  ×  time-points contingency tables, i.e. the frequencies of each word in the set of texts grouped by time-points. Through the analysis of word counts over the observed period of time, main purpose of the study is, after reconstructing the “life-cycle” of words, clustering words that have similar life-cycles and, thus, detecting prototypical or exemplary temporal patterns. Unveiling such relevant and (through expert opinion) meaningful inner dynamics enables us to trace a historical narrative of the discipline of interest. However, different history readings are possible depending on the type of data normalization, which is needed to account for the fluctuating size of texts across time and the general problems of data sparsity and strong asymmetry. This study proposes a methodology consisting of (1) a stepwise information retrieval procedure for keywords’ selection and (2) a functional clustering two-stage approach for statistical learning. Moreover, a sample of possible normalizations of word frequencies is considered, showing that the different concept of curve similarity induced in clustering by the type of transformation heavily affects groups’ composition and size. The corpus of titles of scientific papers published by the American Statistical Association journals in the time span 1888–2012 is examined for illustration.

Learning the evolution of disciplines from scientific literature: A functional clustering approach to normalized keyword count trajectories

Trevisani, Matilde
;
2018-01-01

Abstract

The growing availability of large diachronic corpora of scientific literature offers the opportunity of reading the temporal evolution of concepts, methods and applications, i.e., the history of disciplines involved in the strand under investigation. After a retrieval process of the most relevant keywords, bag-of-words approaches produce words  ×  time-points contingency tables, i.e. the frequencies of each word in the set of texts grouped by time-points. Through the analysis of word counts over the observed period of time, main purpose of the study is, after reconstructing the “life-cycle” of words, clustering words that have similar life-cycles and, thus, detecting prototypical or exemplary temporal patterns. Unveiling such relevant and (through expert opinion) meaningful inner dynamics enables us to trace a historical narrative of the discipline of interest. However, different history readings are possible depending on the type of data normalization, which is needed to account for the fluctuating size of texts across time and the general problems of data sparsity and strong asymmetry. This study proposes a methodology consisting of (1) a stepwise information retrieval procedure for keywords’ selection and (2) a functional clustering two-stage approach for statistical learning. Moreover, a sample of possible normalizations of word frequencies is considered, showing that the different concept of curve similarity induced in clustering by the type of transformation heavily affects groups’ composition and size. The corpus of titles of scientific papers published by the American Statistical Association journals in the time span 1888–2012 is examined for illustration.
File in questo prodotto:
File Dimensione Formato  
Trevisani_Tuzzi_KNOSYS_2018.pdf

Accesso chiuso

Tipologia: Documento in Versione Editoriale
Licenza: Copyright Editore
Dimensione 3.41 MB
Formato Adobe PDF
3.41 MB Adobe PDF   Visualizza/Apri   Richiedi una copia
2921084_Trevisani_Tuzzi_KNOSYS_2018-PostPrint.pdf

accesso aperto

Descrizione: PostPrint VQR3
Tipologia: Bozza finale post-referaggio (post-print)
Licenza: Digital Rights Management non definito
Dimensione 4.02 MB
Formato Adobe PDF
4.02 MB Adobe PDF Visualizza/Apri
Pubblicazioni consigliate

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/11368/2921084
Citazioni
  • ???jsp.display-item.citation.pmc??? ND
  • Scopus 11
  • ???jsp.display-item.citation.isi??? 7
social impact