Utility-Oriented String Mining

Bernardini, Giulia; Chen, Huiping; Conte, Alessio; Grossi, Roberto; Guerrini, Veronica; Loukides, Grigorios; Pisanti, Nadia; Pissis, Solon P.

doi:10.1137/1.9781611978032.22

A string is often provided with numerical scores (utilities) which quantify the importance, interest, profit, or risk of the letters occurring at every position of the string. For example, every DNA fragment produced by modern sequencing machines comes with a confidence score per position. Motivated by the abundance of strings with utilities, we introduce Utility-oriented String Mining (USM), a natural generalization of the classic frequent substring mining problem. Given a string S of length n and a threshold V, USM asks for every string R whose utility U(R) is at least V, where U is a function that maps R to a utility score based on the utilities of all letters of every occurrence of R in S. In addition, our work makes the following contributions: (1) We identify a class U of utility functions for which USM admits an O(n^2)-time algorithm. (2) We prove that no listing algorithm solves the USM problem in subquadratic time for every utility function, or even for every function in U. (3) We propose an O(n log n)-time algorithm that solves USM for a class of monotone functions from U. (4) We design another O(n log n)-time algorithm for the same problem that is comparable in runtime but offers drastic space savings in practice when, in addition, a lower bound on the length of the output strings is provided as input. (5) We demonstrate experimentally using publicly available, billion-letter datasets that our algorithms are many times more efficient, in terms of runtime and/or space, compared to an Apriori-like baseline which employs advanced string processing tools

Utility-Oriented String Mining / Bernardini, G., Chen, H., Conte, A., Grossi, R., Guerrini, V., Loukides, G., Pisanti, N., Pissis, S.P.. - ELETTRONICO. - (2024), pp. 190-198. (2024 SIAM International Conference on Data Mining Houston, TX, U.S. 18-20 April 2024) [10.1137/1.9781611978032.22].