Jaccard-like fuzzy distances for computational linguistics

Franzoi, L.

doi:10.1109/SYNASC.2017.00040

Back in 1967 the Croat linguist Z. Muljacic introduced a fuzzy generalization of crisp Hamming distances between binary strings of length n; he wanted to show that Dalmatic, nowadays extinct, is a bridge between the Western group of Romance languages and the Eastern group, basically Romanian. Each language is described by means of n features Pi which can be present or absent, and so is encoded by a string x(i) ... x(n), where xi is the truth degree of the proposition feature F-i is present in the language; however, presence/absence can be ill-defined: consequentely, each x(i) is rather a truth degree is an element of [0,1] in a multi-valued logic, a crisp value only when x(i) = 0 = false = absent, or x(i) = 1 = true = present, else strictly fuzzy. More recently Longobardi et al. [1], [2] have covered the case when a feature Pi is undefined, because logically inconsistent with truth degrees assigned to features F-1,...,Fi-1 or when a feature is irrelevant because crisply absent or "almost" absent in both languages. The latter fact requires a Jaccard variant of the original distance. We modify the fuzzy Hamming distance, as in Muljacic case [3], [4], going to its Jaccard variant and do the same with fuzzy Hamming distinguishabilities, which are a subtle but meaningful variation of the fuzzy Hamming distance [3]. Using the technical tool of Steinhaus transforms, which serves to obtain the Jaccard-like variant of a given distance, we end up obtaining four metric distances: fuzzy distance and distinguishability without irrelevance, and their corresponding Jaccard variants with both fuzziness and irrelevance. Accordingly, we cluster in four ways Muljacic original data and comment on the differences; all this paves the way towards gauging jointly ill-defined, irrelevant but also conditionally undefined features, as in [1], [2]. The tools developed here for the first time will be used on up-to-date linguistic data within the activities of the Human Language Technologies Research Center, Bucharest University.

ArTS Archivio della ricerca di Trieste