Scientific collaboration is an important driver of research progress that supports researchers in the generation of novel ideas. It has been also recognized as a key factor in measuring and evaluating scientific performance of scholars. Among the widespread applications of Social Network Analysis (SNA) in the last decades, the study of co-authorship networks, used as a proxy of scholars’ collaborative behavior, is one of the topic that most benefited from SNA perspective. Seminal studies explored co-authorship networks in various fields using data gathered from large online international Digital Libraries (DLs) - general (e.g., ISI-WOS, Scopus) or thematic oriented (e.g., Econlit for Economics or Medline for Medical Sciences) - rather than collected by interviews or questionnaires administered directly to the authors of the papers. Another stream of research focuses on interactions among members of a given target population (e.g., scholars involved in a scientific community or affiliated to a given institution) in order to retrieve the pattern of collaborative behaviors and its effect on the scholars’ scientific performance. In this case, recent literature pointed out that international DLs provide a partial coverage of the entire scholar scientific production as well as under coverage of a target population. The integration of international data sources with more specialized and local bibliographic archives can help in the construction of a complete database. Hence in merging different and heterogeneous archives, several issues must be resolved: i) the definition of network boundaries (affecting the type of nodes to be included in the network); ii) the identification of duplicated publication records (affecting network ties); iii) the treatment of scholar synonyms and homonymies (affecting the number of network nodes); and iv) the author name disambiguation of co-authors external to the target population. In this study, we face these issues reconstructing the co-authorship network of a particular scientific community, that is the Italian academic statisticians. We collect their bibliographic records from an online platform, the Institutional Research Information System (IRIS), available in most of the Italian universities and including international as well as national publications. The platform presents both pros and cons common to other national-based DLs. Even if it guarantees a high coverage rate of our target population and its scientific production, to retrieve co-authorship ties among scholars it is necessary to combine the data contained in different platform deployments available at each university. In addition, data quality is affected by the manual publication data entry made by authors. Moreover, no details are provided on co-authors external to the target population, which implies a huge effort in author name disambiguation. To deal with these aspects, we first propose a web scraping procedure based on a semi-automatic tool retrieving publication metadata from the online platform in order to reduce the manual adjustments. Second, we introduce a network-based approach to deal with author name disambiguation that requires a minimal set of record attributes (identifier, co-authors, venue). Finally, a discussion on the extension of the proposed procedure in related theoretical contexts will be provided.

Web-based Data Collection and Quality Issues in Co-Authorship Network Analysis

Domenico De Stefano
;
Maria Prosperina Vitale;Susanna Zaccarin
2019-01-01

Abstract

Scientific collaboration is an important driver of research progress that supports researchers in the generation of novel ideas. It has been also recognized as a key factor in measuring and evaluating scientific performance of scholars. Among the widespread applications of Social Network Analysis (SNA) in the last decades, the study of co-authorship networks, used as a proxy of scholars’ collaborative behavior, is one of the topic that most benefited from SNA perspective. Seminal studies explored co-authorship networks in various fields using data gathered from large online international Digital Libraries (DLs) - general (e.g., ISI-WOS, Scopus) or thematic oriented (e.g., Econlit for Economics or Medline for Medical Sciences) - rather than collected by interviews or questionnaires administered directly to the authors of the papers. Another stream of research focuses on interactions among members of a given target population (e.g., scholars involved in a scientific community or affiliated to a given institution) in order to retrieve the pattern of collaborative behaviors and its effect on the scholars’ scientific performance. In this case, recent literature pointed out that international DLs provide a partial coverage of the entire scholar scientific production as well as under coverage of a target population. The integration of international data sources with more specialized and local bibliographic archives can help in the construction of a complete database. Hence in merging different and heterogeneous archives, several issues must be resolved: i) the definition of network boundaries (affecting the type of nodes to be included in the network); ii) the identification of duplicated publication records (affecting network ties); iii) the treatment of scholar synonyms and homonymies (affecting the number of network nodes); and iv) the author name disambiguation of co-authors external to the target population. In this study, we face these issues reconstructing the co-authorship network of a particular scientific community, that is the Italian academic statisticians. We collect their bibliographic records from an online platform, the Institutional Research Information System (IRIS), available in most of the Italian universities and including international as well as national publications. The platform presents both pros and cons common to other national-based DLs. Even if it guarantees a high coverage rate of our target population and its scientific production, to retrieve co-authorship ties among scholars it is necessary to combine the data contained in different platform deployments available at each university. In addition, data quality is affected by the manual publication data entry made by authors. Moreover, no details are provided on co-authors external to the target population, which implies a huge effort in author name disambiguation. To deal with these aspects, we first propose a web scraping procedure based on a semi-automatic tool retrieving publication metadata from the online platform in order to reduce the manual adjustments. Second, we introduce a network-based approach to deal with author name disambiguation that requires a minimal set of record attributes (identifier, co-authors, venue). Finally, a discussion on the extension of the proposed procedure in related theoretical contexts will be provided.
2019
9788894312096
File in questo prodotto:
File Dimensione Formato  
abstract.pdf

Accesso chiuso

Tipologia: Documento in Versione Editoriale
Licenza: Digital Rights Management non definito
Dimensione 194.96 kB
Formato Adobe PDF
194.96 kB Adobe PDF   Visualizza/Apri   Richiedi una copia
Pubblicazioni consigliate

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/11368/2942481
Citazioni
  • ???jsp.display-item.citation.pmc??? ND
  • Scopus ND
  • ???jsp.display-item.citation.isi??? ND
social impact