We study the problem of making a de Bruijn graph (dBG), constructed from a collection of strings, weakly connected while minimizing the total cost of edge additions. The input graph is a dBG that can be made weakly connected by adding edges (along with extra nodes if needed) from the underlying complete dBG. The problem arises from genome reconstruction, where the dBG is constructed from a set of sequences generated from a genome sample by a sequencing experiment. Due to sequencing errors, the dBG is never Eulerian in practice and is often not even weakly connected. We show the following results for a dBG G(V,E) of order k consisting of d weakly connected components: 1) Making G weakly connected by adding a set of edges of minimal total cost is NP-hard. 2) No PTAS exists for making G weakly connected by adding a set of edges of minimal total cost (unless the unique games conjecture fails). We complement this result by showing that there does exist a polynomial-time (2-2/d)-approximation algorithm for the problem. 3) We consider a restricted version of the above problem, where we are asked to make G weakly connected by only adding directed paths between pairs of components. We show that making G weakly connected by adding d-1 such paths of minimal total cost can be done in 𝒪(k|V|α(|V|)+|E|) time, where α(⋅) is the inverse Ackermann function. This improves on the 𝒪(k|V|log(|V|)+|E|)-time algorithm proposed by Bernardini et al. [CPM 2022] for the same restricted problem. 4) An ILP formulation of polynomial size for making G Eulerian with minimal total cost.

Connecting de Bruijn Graphs

Giulia Bernardini
;
2024-01-01

Abstract

We study the problem of making a de Bruijn graph (dBG), constructed from a collection of strings, weakly connected while minimizing the total cost of edge additions. The input graph is a dBG that can be made weakly connected by adding edges (along with extra nodes if needed) from the underlying complete dBG. The problem arises from genome reconstruction, where the dBG is constructed from a set of sequences generated from a genome sample by a sequencing experiment. Due to sequencing errors, the dBG is never Eulerian in practice and is often not even weakly connected. We show the following results for a dBG G(V,E) of order k consisting of d weakly connected components: 1) Making G weakly connected by adding a set of edges of minimal total cost is NP-hard. 2) No PTAS exists for making G weakly connected by adding a set of edges of minimal total cost (unless the unique games conjecture fails). We complement this result by showing that there does exist a polynomial-time (2-2/d)-approximation algorithm for the problem. 3) We consider a restricted version of the above problem, where we are asked to make G weakly connected by only adding directed paths between pairs of components. We show that making G weakly connected by adding d-1 such paths of minimal total cost can be done in 𝒪(k|V|α(|V|)+|E|) time, where α(⋅) is the inverse Ackermann function. This improves on the 𝒪(k|V|log(|V|)+|E|)-time algorithm proposed by Bernardini et al. [CPM 2022] for the same restricted problem. 4) An ILP formulation of polynomial size for making G Eulerian with minimal total cost.
File in questo prodotto:
File Dimensione Formato  
LIPIcs.CPM.2024.6.pdf

accesso aperto

Tipologia: Documento in Versione Editoriale
Licenza: Creative commons
Dimensione 1.35 MB
Formato Adobe PDF
1.35 MB Adobe PDF Visualizza/Apri
Pubblicazioni consigliate

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/11368/3079058
Citazioni
  • ???jsp.display-item.citation.pmc??? ND
  • Scopus 0
  • ???jsp.display-item.citation.isi??? ND
social impact