Distance Estimation

MinHash Jaccard estimation

Given \(k\)-mer sets \(A\) and \(B\), the MinHash algorithm provides an estimation of the Jaccard index:

\[j(A_s,B_s) = \frac {\lvert A_s \cap B_s \rvert} s\]

where \(A_s\) and \(B_s\) are subsets such that \(\lvert A_s \cup B_s \rvert\) is equal to the sketch size, \(s\), allowing for a known error bound as suggested by Broder [1]. This is done by using a merge-sort algorithm to find common values between the two sorted sketches and terminating when the total number of hashes seen reaches the sketch size (or all hashes in both sketches have been seen).

Mash distance formulation

For mutating a sequence with \(t\) total \(k\)-mers and a conserved \(k\)-mer count \(w\), an approximate mutation rate \(d\) can be estimated using a Poisson model of mutations occurring in \(k\)-mers, as suggested by Fan et al. [2]:

\[d = \frac {-1} k \ln \frac w t\]

In order to use a Jaccard estimate \(j\) between two \(k\)-mer sets of arbitrary sizes, the Jaccard estimate can be framed in terms of the conserved \(k\)-mer count \(w\) and the average set size \(n\):

\[j \approx \frac w {2n - w}\]

To substitute \(n\) for the total \(k\)-mer count \(t\) in the mutation estimation, this approximation can be reformulated as:

\[\frac w n \approx \frac {2j} {1 + j}\]

Substituting \(\frac w n\) for \(\frac w t\) thus yields the Mash distance:

\[\begin{split}D(k,j)=\Bigg\{\begin{split} &1&,\ j=0\\ &\frac {-1} k \ln \frac {2j} {1 + j}&,\ 0<j\le 1 \end{split}\end{split}\]

Assessing significance with p-values

Since MinHash distances are probabilistic estimates, it is important to consider the probability of seeing a given distance by chance. mash dist thus provides p-values with distance estimations. Lower p-values correspond to more confident distance estimations, and will often be rounded down to 0 due to floating point limits. If p-values are high (above, say, 0.01), the \(k\)-mer size is probably too small for the size of the genomes being compared.

When estimating the distance of genome 1 and genome 2 from sketches with the properties:

\(\Sigma\) := alphabet

\(k\) := \(k\)-mer size

\(l_1\) := length of genome 1

\(l_2\) := length of genome 2

\(s\) := sketch size

\(x\) := number of shared \(k\)-mers between sketches of size \(s\) of genome 1 and genome 2

…the chance of a \(k\)-mer appearing in random sequences of lengths \(l_1\) and \(l_2\) are estimated as:

\[ \begin{align}\begin{aligned}r_1 = 1-(1-\frac{1}{{\lvert\Sigma\rvert}^k})^{l_1} \approx \frac{l_1}{l_1+{\lvert\Sigma\rvert}^k}\\r_2 = 1-(1-\frac{1}{{\lvert\Sigma\rvert}^k})^{l_2} \approx \frac{l_2}{l_2+{\lvert\Sigma\rvert}^k}\end{aligned}\end{align} \]

The expected Jaccard index of the sketches of the random sequences is then:

\[j_r = \frac{r_1 r_2}{r_1 + r_2 - r_1 r_2}\]

…and the probability of observing at least \(x\) shared \(k\)-mers can be estimated with the tail of a cumulative binomial distribution:

\[p = 1 - \sum\limits_{i=0}^{x-1} \binom{s}{i} j_r^i (1-j_r)^{s-i}\]
[1]Broder, A.Z. On the resemblance and containment of documents. Compression and Complexity of Sequences 1997 - Proceedings, 21-29 (1998).
[2]Fan, H., Ives, A.R., Surget-Groba, Y. & Cannon, C.H. An assembly and alignment-free method of phylogeny reconstruction from next-generation sequencing data. BMC genomics 16, 522 (2015).