A string is often provided with numerical scores (utilities) which quantify the importance, interest, profit, or risk of the letters occurring at every position of the string. For example, every DNA fragment produced by modern sequencing machines comes with a confidence score per position. Motivated by the abundance of strings with utilities, we introduce Utility-oriented String Mining (USM), a natural generalization of the classic frequent substring mining problem. Given a string S of length n and a threshold V, USM asks for every string R whose utility U(R) is at least V, where U is a function that maps R to a utility score based on the utilities of all letters of every occurrence of R in S. In addition, our work makes the following contributions: (1) We identify a class U of utility functions for which USM admits an O(n^2)-time algorithm. (2) We prove that no listing algorithm solves the USM problem in subquadratic time for every utility function, or even for every function in U. (3) We propose an O(n log n)-time algorithm that solves USM for a class of monotone functions from U. (4) We design another O(n log n)-time algorithm for the same problem that is comparable in runtime but offers drastic space savings in practice when, in addition, a lower bound on the length of the output strings is provided as input. (5) We demonstrate experimentally using publicly available, billion-letter datasets that our algorithms are many times more efficient, in terms of runtime and/or space, compared to an Apriori-like baseline which employs advanced string processing tools

Utility-Oriented String Mining

Bernardini, Giulia;Pisanti, Nadia;
2024-01-01

Abstract

A string is often provided with numerical scores (utilities) which quantify the importance, interest, profit, or risk of the letters occurring at every position of the string. For example, every DNA fragment produced by modern sequencing machines comes with a confidence score per position. Motivated by the abundance of strings with utilities, we introduce Utility-oriented String Mining (USM), a natural generalization of the classic frequent substring mining problem. Given a string S of length n and a threshold V, USM asks for every string R whose utility U(R) is at least V, where U is a function that maps R to a utility score based on the utilities of all letters of every occurrence of R in S. In addition, our work makes the following contributions: (1) We identify a class U of utility functions for which USM admits an O(n^2)-time algorithm. (2) We prove that no listing algorithm solves the USM problem in subquadratic time for every utility function, or even for every function in U. (3) We propose an O(n log n)-time algorithm that solves USM for a class of monotone functions from U. (4) We design another O(n log n)-time algorithm for the same problem that is comparable in runtime but offers drastic space savings in practice when, in addition, a lower bound on the length of the output strings is provided as input. (5) We demonstrate experimentally using publicly available, billion-letter datasets that our algorithms are many times more efficient, in terms of runtime and/or space, compared to an Apriori-like baseline which employs advanced string processing tools
File in questo prodotto:
File Dimensione Formato  
SDM24.pdf

Accesso chiuso

Tipologia: Documento in Versione Editoriale
Licenza: Copyright Editore
Dimensione 856.12 kB
Formato Adobe PDF
856.12 kB Adobe PDF   Visualizza/Apri   Richiedi una copia
Pubblicazioni consigliate

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/11368/3072798
Citazioni
  • ???jsp.display-item.citation.pmc??? ND
  • Scopus 0
  • ???jsp.display-item.citation.isi??? ND
social impact