Let Substr_k(X) denote the set of length-k substrings of a given string X for a given integer k > 0. We study the following basic string problem, called z-Shortest _k-Equivalent Strings: Given a set _k of n length-k strings and an integer z > 0, list z shortest distinct strings T₁,…,T_z such that Substr_k(T_i) = _k, for all i ∈ [1,z]. The z-Shortest _k-Equivalent Strings problem arises naturally as an encoding problem in many real-world applications; e.g., in data privacy, in data compression, and in bioinformatics. The 1-Shortest _k-Equivalent Strings, referred to as Shortest _k-Equivalent String, asks for a shortest string X such that Substr_k(X) = _k. Our main contributions are summarized below: - Given a directed graph G(V,E), the Directed Chinese Postman (DCP) problem asks for a shortest closed walk that visits every edge of G at least once. DCP can be solved in ̃(|E||V|) time using an algorithm for min-cost flow. We show, via a non-trivial reduction, that if Shortest _k-Equivalent String over a binary alphabet has a near-linear-time solution then so does DCP. - We show that the length of a shortest string output by Shortest _k-Equivalent String is in (k+n²). We generalize this bound by showing that the total length of z shortest strings is in (zk+zn²+z²n). We derive these upper bounds by showing (asymptotically tight) bounds on the total length of z shortest Eulerian walks in general directed graphs. - We present an algorithm for solving z-Shortest _k-Equivalent Strings in (nk+n²log²n+zn²log n+|output|) time. If z = 1, the time becomes (nk+n²log²n) by the fact that the size of the input is Θ(nk) and the size of the output is (k+n²).

On Strings Having the Same Length- k Substrings

Giulia Bernardini
;
2022-01-01

Abstract

Let Substr_k(X) denote the set of length-k substrings of a given string X for a given integer k > 0. We study the following basic string problem, called z-Shortest _k-Equivalent Strings: Given a set _k of n length-k strings and an integer z > 0, list z shortest distinct strings T₁,…,T_z such that Substr_k(T_i) = _k, for all i ∈ [1,z]. The z-Shortest _k-Equivalent Strings problem arises naturally as an encoding problem in many real-world applications; e.g., in data privacy, in data compression, and in bioinformatics. The 1-Shortest _k-Equivalent Strings, referred to as Shortest _k-Equivalent String, asks for a shortest string X such that Substr_k(X) = _k. Our main contributions are summarized below: - Given a directed graph G(V,E), the Directed Chinese Postman (DCP) problem asks for a shortest closed walk that visits every edge of G at least once. DCP can be solved in ̃(|E||V|) time using an algorithm for min-cost flow. We show, via a non-trivial reduction, that if Shortest _k-Equivalent String over a binary alphabet has a near-linear-time solution then so does DCP. - We show that the length of a shortest string output by Shortest _k-Equivalent String is in (k+n²). We generalize this bound by showing that the total length of z shortest strings is in (zk+zn²+z²n). We derive these upper bounds by showing (asymptotically tight) bounds on the total length of z shortest Eulerian walks in general directed graphs. - We present an algorithm for solving z-Shortest _k-Equivalent Strings in (nk+n²log²n+zn²log n+|output|) time. If z = 1, the time becomes (nk+n²log²n) by the fact that the size of the input is Θ(nk) and the size of the output is (k+n²).
File in questo prodotto:
File Dimensione Formato  
LIPIcs-CPM-2022-16.pdf

accesso aperto

Tipologia: Documento in Versione Editoriale
Licenza: Creative commons
Dimensione 988.05 kB
Formato Adobe PDF
988.05 kB Adobe PDF Visualizza/Apri
Pubblicazioni consigliate

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/11368/3024591
Citazioni
  • ???jsp.display-item.citation.pmc??? ND
  • Scopus 0
  • ???jsp.display-item.citation.isi??? ND
social impact