Supporting Data¶
Mash: fast genome and metagenome distance estimation using MinHash¶
RefSeqSketches.msh.gz: Mash sketch database (k=16, s=400) for RefSeq release 70 (48MB)
RefSeqSketchesDefaults.msh.gz: Mash sketch database (k=21, s=1000) for RefSeq release 70 (255MB)
Escherichia.tar.gz: Names and accessions for 500 selected Escherichia genomes, pairwise ANI, and pairwise Jaccard indexes for various k-mer and sketch sizes (24MB)
mash-1.0.tar.gz: Mash version 1.0 codebase (93KB)
SRR2671867.BaAmes.poretools.fastq.gz: Nanopore 1D + 2D sequences generated by poretools (157MB)
SRR2671868.Bc10987.poretools.fastq.gz: Nanopore 1D + 2D sequences generated by poretools (250MB)
Mash Screen: High-throughput sequence containment estimation for genome discovery¶
Custom scripts and intermediate data:¶
Data files:¶
- Mash Sketch databases for RefSeq release 88:
- RefSeq88n.msh.gz: Genomes (k=21, s=1000), 1.2Gb uncompressed
- RefSeq88p.msh.gz: Proteomes (k=9, s=1000), 1.1Gb uncompressed
art.fastq.gz: Simulated reads for Shakya experiment
Screen of SRA metagenomes vs. RefSeq¶
- sra_meta_nucl_95idy.tsv.gz (2.3Gb uncompressed)
- sra_meta_nucl_80idy_3x.tsv.gz (6.7Gb uncompressed)
- sra_meta_prot_95idy.tsv.gz (2.1Gb uncompressed)
- sra_meta_prot_80idy_3x.tsv.gz (8.3Gb uncompressed)
These files have a line for each RefSeq genome listing all metagenomic SRA runs (as of August 2018) with Mash Containment Scores above the specified threshold. They are provided for two screen modes:
nucl
: Genomic RefSeq sequencesprot
: Proteomic RefSeq sequences (combined amino acid sequences per organism). NOTE: Protein tables above are not p-value filtered and thus large (> ~50Gb) runs may have spurious hits. They also do not contain plasmids. Updates coming soon!
…and at two thresholds:
95idy
: 95% Mash Containment Score, any coverage. Useful for finding runs containing a specific genome.80idy_3x
: 80% Mash Containment Score, at least 3x median k-mer multiplicity. Useful for finding related, but novel, sequences.
The files are tab separated, with each line beginning with a RefSeq assembly accession, followed by SRA accessions, for example:
GCF_000001215.4 SRR3401361 SRR3540373
GCF_000001405.36 SRR5127794 ERR1539652 SRR413753 ERR206081
GCF_000001405.38 SRR5127794 ERR1539652 ERR1711677 SRR413753 ERR206081
We also provide simple scripts for searching these files: search.tar
Public data sources¶
The BLAST nr
database was downloaded from ftp://ftp.ncbi.nlm.nih.gov/blast/db/nr.*
.
HMP data were downloaded from ftp://public-ftp.ihmpdcc.org/
, reads from the Ilumina/
directory
and coding sequences from the HMGI/
directory. Within these folders, sample SRS015937 resides in
tongue_dorsum/
and SRS020263 in right_retroauricular_crease/
.
SRA runs downloaded with the SRA Toolkit.
RefSeq genomes downloaded from the genomes/refseq/
directory of ftp.ncbi.nlm.nih.gov
.
Public data products¶
Quebec Polyomavirus is submitted to GenBank as BK010702.