An elastic-degenerate (ED) string is a sequence of n finite sets of strings of total length N, introduced to represent a set of related DNA sequences, also known as a pangenome. The ED string matching (EDSM) problem consists in reporting all occurrences of a pattern of length m in an ED text. The EDSM problem has recently received some attention by the combinatorial pattern matching community, culminating in an O~(nmω−1)+O(N)-time algorithm [Bernardini et al., SIAM J. Comput. 2022], where ω denotes the matrix multiplication exponent and the O~(⋅) notation suppresses polylog factors. In the k-EDSM problem, the approximate version of EDSM, we are asked to report all pattern occurrences with at most k errors. k-EDSM can be solved in O(k2mG+kN) time under edit distance, where G denotes the total number of strings in the ED text [Bernardini et al., Theor. Comput. Sci. 2020]. Unfortunately, G is only bounded by N, and so even for k=1, the existing algorithm runs in Ω(mN) time in the worst case. Here we make progress in this direction. We show that 1-EDSM can be solved in O((nm2+N)logm) or O(nm3+N) time under edit distance. For the decision version of the problem, we present a faster O(nm2logm−−−−−√+Nloglogm)-time algorithm. Our algorithms rely on non-trivial reductions from 1-EDSM to special instances of classic computational geometry problems (2d rectangle stabbing or range emptiness), which we show how to solve efficiently.
Elastic-Degenerate String Matching with 1 Error
Bernardini, Giulia
;
2022-01-01
Abstract
An elastic-degenerate (ED) string is a sequence of n finite sets of strings of total length N, introduced to represent a set of related DNA sequences, also known as a pangenome. The ED string matching (EDSM) problem consists in reporting all occurrences of a pattern of length m in an ED text. The EDSM problem has recently received some attention by the combinatorial pattern matching community, culminating in an O~(nmω−1)+O(N)-time algorithm [Bernardini et al., SIAM J. Comput. 2022], where ω denotes the matrix multiplication exponent and the O~(⋅) notation suppresses polylog factors. In the k-EDSM problem, the approximate version of EDSM, we are asked to report all pattern occurrences with at most k errors. k-EDSM can be solved in O(k2mG+kN) time under edit distance, where G denotes the total number of strings in the ED text [Bernardini et al., Theor. Comput. Sci. 2020]. Unfortunately, G is only bounded by N, and so even for k=1, the existing algorithm runs in Ω(mN) time in the worst case. Here we make progress in this direction. We show that 1-EDSM can be solved in O((nm2+N)logm) or O(nm3+N) time under edit distance. For the decision version of the problem, we present a faster O(nm2logm−−−−−√+Nloglogm)-time algorithm. Our algorithms rely on non-trivial reductions from 1-EDSM to special instances of classic computational geometry problems (2d rectangle stabbing or range emptiness), which we show how to solve efficiently.File | Dimensione | Formato | |
---|---|---|---|
978-3-031-20624-5_2.pdf
Accesso chiuso
Descrizione: https://link.springer.com/chapter/10.1007/978-3-031-20624-5_2
Licenza:
Copyright Editore
Dimensione
527.25 kB
Formato
Adobe PDF
|
527.25 kB | Adobe PDF | Visualizza/Apri Richiedi una copia |
pre print.pdf
accesso aperto
Tipologia:
Documento in Pre-print
Licenza:
Digital Rights Management non definito
Dimensione
813.93 kB
Formato
Adobe PDF
|
813.93 kB | Adobe PDF | Visualizza/Apri |
978-3-031-20624-5_2-Post_print.pdf
Open Access dal 30/10/2023
Tipologia:
Bozza finale post-referaggio (post-print)
Licenza:
Digital Rights Management non definito
Dimensione
1.05 MB
Formato
Adobe PDF
|
1.05 MB | Adobe PDF | Visualizza/Apri |
Pubblicazioni consigliate
I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.