It is often unavoidable to combine data fromdifferent sequencing centers or sequencing platformswhen compiling data sets with a large number of individuals.However, the different data are likely to contain specific systematic errors that will appear as SNPs.Here,wedevise amethod to detect systematic errors in combined data sets. Tomeasure quality differences between individual genomes, we study pairs of variants that reside on different chromosomes and co-occur in individuals. The abundance of these pairs of variants in different genomes is then used to detect systematic errors due to batch effects. Applying ourmethod to the 1000 Genomes data set,we find that coding regions are enriched for errors,where 1%of the higher frequency variants are predicted to be erroneous, whereas errors outside of coding regions are much rarer (<0.001%). As expected, predicted errors are found less often than other variants in a data set that was generated with a different sequencing technology, indicating that many of the candidates are indeed errors. However, predicted 1000 Genomes errors are also found in other large data sets; our observation is thus not specific to the 1000Genomes data set.Our results show that batch effects can be turned into a virtue by using the resulting variation in large scale data sets to detect systematic errors. © 2018 Oxford University Press. All rights reserved.

Turning vice into virtue: Using batch-effects to detect errors in large genomic data sets / Mafessoni, F; Prasad, R. B.; Groop, L.; Hansson, O.; Prüfer, K.. - In: GENOME BIOLOGY AND EVOLUTION. - ISSN 1759-6653. - 15:3(2018), pp. 488-503. [10.1093/gbe/evy199]

Turning vice into virtue: Using batch-effects to detect errors in large genomic data sets

MAFESSONI F;
2018-01-01

Abstract

It is often unavoidable to combine data fromdifferent sequencing centers or sequencing platformswhen compiling data sets with a large number of individuals.However, the different data are likely to contain specific systematic errors that will appear as SNPs.Here,wedevise amethod to detect systematic errors in combined data sets. Tomeasure quality differences between individual genomes, we study pairs of variants that reside on different chromosomes and co-occur in individuals. The abundance of these pairs of variants in different genomes is then used to detect systematic errors due to batch effects. Applying ourmethod to the 1000 Genomes data set,we find that coding regions are enriched for errors,where 1%of the higher frequency variants are predicted to be erroneous, whereas errors outside of coding regions are much rarer (<0.001%). As expected, predicted errors are found less often than other variants in a data set that was generated with a different sequencing technology, indicating that many of the candidates are indeed errors. However, predicted 1000 Genomes errors are also found in other large data sets; our observation is thus not specific to the 1000Genomes data set.Our results show that batch effects can be turned into a virtue by using the resulting variation in large scale data sets to detect systematic errors. © 2018 Oxford University Press. All rights reserved.
File in questo prodotto:
Non ci sono file associati a questo prodotto.
Pubblicazioni consigliate

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/11368/3096303
 Avviso

Attenzione! I dati visualizzati non sono stati sottoposti a validazione da parte dell'ateneo

Citazioni
  • ???jsp.display-item.citation.pmc??? ND
  • Scopus 5
  • ???jsp.display-item.citation.isi??? 5
social impact