Mash: fast genome and metagenome distance estimation using MinHash¶
RefSeqSketches.msh.gz: Mash sketch database (k=16, s=400) for RefSeq release 70 (48MB)
RefSeqSketchesDefaults.msh.gz: Mash sketch database (k=21, s=1000) for RefSeq release 70 (255MB)
Escherichia.tar.gz: Names and accessions for 500 selected Escherichia genomes, pairwise ANI, and pairwise Jaccard indexes for various k-mer and sketch sizes (24MB)
mash-1.0.tar.gz: Mash version 1.0 codebase (93KB)
SRR2671867.BaAmes.poretools.fastq.gz: Nanopore 1D + 2D sequences generated by poretools (157MB)
SRR2671868.Bc10987.poretools.fastq.gz: Nanopore 1D + 2D sequences generated by poretools (250MB)
Mash Screen: High-throughput sequence containment estimation for genome discovery¶
- Mash Sketch databases for RefSeq release 88:
art.fastq.gz: Simulated reads for Shakya experiment
Screen of SRA metagenomes vs. RefSeq¶
These files have a line for each RefSeq genome listing all metagenomic SRA runs (as of August 2018) with Mash Containment Scores above the specified threshold. They are provided for two screen modes:
nucl: Genomic RefSeq sequences
prot: Proteomic RefSeq sequences (combined amino acid sequences per organism). NOTE: Protein tables above are not p-value filtered and thus large (> ~50Gb) runs may have spurious hits. They also do not contain plasmids. Updates coming soon!
…and at two thresholds:
95idy: 95% Mash Containment Score, any coverage. Useful for finding runs containing a specific genome.
80idy_3x: 80% Mash Containment Score, at least 3x median k-mer multiplicity. Useful for finding related, but novel, sequences.
The files are tab separated, with each line beginning with a RefSeq assembly accession, followed by SRA accessions, for example:
GCF_000001215.4 SRR3401361 SRR3540373 GCF_000001405.36 SRR5127794 ERR1539652 SRR413753 ERR206081 GCF_000001405.38 SRR5127794 ERR1539652 ERR1711677 SRR413753 ERR206081
We also provide simple scripts for searching these files: search.tar
Public data sources¶
nr database was downloaded from
HMP data were downloaded from
ftp://public-ftp.ihmpdcc.org/, reads from the
and coding sequences from the
HMGI/ directory. Within these folders, sample SRS015937 resides in
tongue_dorsum/ and SRS020263 in
SRA runs downloaded with the SRA Toolkit.
RefSeq genomes downloaded from the
genomes/refseq/ directory of
Public data products¶
Quebec Polyomavirus is submitted to GenBank as BK010702.