Supporting Data

Mash: fast genome and metagenome distance estimation using MinHash

RefSeqSketches.msh.gz: Mash sketch database (k=16, s=400) for RefSeq release 70 (48MB)

RefSeqSketchesDefaults.msh.gz: Mash sketch database (k=21, s=1000) for RefSeq release 70 (255MB)

Escherichia.tar.gz: Names and accessions for 500 selected Escherichia genomes, pairwise ANI, and pairwise Jaccard indexes for various k-mer and sketch sizes (24MB)

mash-1.0.tar.gz: Mash version 1.0 codebase (93KB)

SRR2671867.BaAmes.poretools.fastq.gz: Nanopore 1D + 2D sequences generated by poretools (157MB)

SRR2671868.Bc10987.poretools.fastq.gz: Nanopore 1D + 2D sequences generated by poretools (250MB)

Mash Screen: High-throughput sequence containment estimation for genome discovery

Custom scripts and intermediate data:


Data files:

Mash Sketch databases for RefSeq release 88:

art.fastq.gz: Simulated reads for Shakya experiment

Figure 5:

Screen of SRA metagenomes vs. RefSeq

These files have a line for each RefSeq genome listing all metagenomic SRA runs (as of August 2018) with Mash Containment Scores above the specified threshold. They are provided for two screen modes:

  • nucl: Genomic RefSeq sequences
  • prot: Proteomic RefSeq sequences (combined amino acid sequences per organism). NOTE: Protein tables above are not p-value filtered and thus large (> ~50Gb) runs may have spurious hits. They also do not contain plasmids. Updates coming soon!

…and at two thresholds:

  • 95idy: 95% Mash Containment Score, any coverage. Useful for finding runs containing a specific genome.
  • 80idy_3x: 80% Mash Containment Score, at least 3x median k-mer multiplicity. Useful for finding related, but novel, sequences.

The files are tab separated, with each line beginning with a RefSeq assembly accession, followed by SRA accessions, for example:

GCF_000001215.4       SRR3401361      SRR3540373
GCF_000001405.36      SRR5127794      ERR1539652      SRR413753       ERR206081
GCF_000001405.38      SRR5127794      ERR1539652      ERR1711677      SRR413753       ERR206081

We also provide simple scripts for searching these files: search.tar

Public data sources

The BLAST nr database was downloaded from*.

HMP data were downloaded from, reads from the Ilumina/ directory and coding sequences from the HMGI/ directory. Within these folders, sample SRS015937 resides in tongue_dorsum/ and SRS020263 in right_retroauricular_crease/.

SRA runs downloaded with the SRA Toolkit.

RefSeq genomes downloaded from the genomes/refseq/ directory of

Public data products

Quebec Polyomavirus is submitted to GenBank as BK010702.