Simple distance estimation

Download example E. coli genomes:


mash dist genome1.fna genome2.fna

The results are tab delimited lists of Reference-ID, Query-ID, Mash-distance, P-value, and Matching-hashes:

genome1.fna   genome2.fna     0.0222766       0       456/1000

Saving time by sketching first

mash sketch genome1.fna
mash sketch genome2.fna
mash dist genome1.fna.msh genome2.fna.msh

Pairwise comparisons with compound sketch files

Download additional example E. coli genome:

Sketch the first two genomes to create a combined archive, use mash info to verify its contents, and estimate pairwise distances:

mash sketch -o reference genome1.fna genome2.fna
mash info reference.msh
mash dist reference.msh genome3.fna

This will estimate the distance from each query (which there is one of) to each reference (which there are two of in the sketch file):

genome1.fna   genome3.fna     0       0       1000/1000
genome2.fna   genome3.fna     0.0222766       0       456/1000

Querying read sets against an existing RefSeq sketch

Download and gunzip the pre-sketched RefSeq archive (reads not provided here; 10x-100x coverage of a single genome with any sequencing technology should work):


Concatenate paired ends (this could also be piped to mash to save space by specifying - for standard input, zipped or unzipped):

cat reads_1.fastq read_2.fastq > reads.fastq

Sketch the reads, using -m 2 to improve results by ignoring single-copy k-mers, which are more likely to be erroneous:

mash sketch -m 2 -k 16 -s 400 reads.fastq

Run mash dist with the RefSeq archive as the reference and the read sketch as the query:

mash dist RefSeqSketches.msh reads.fastq.msh > distances.tab

Sort the results to see the top hits and their p-values:

sort -gk3 distances.tab | head

Building a custom RefSeq database

To create the RefSeq Mash database, genomes were downloaded from NCBI (ftp.ncbi.nlm.nih.gov/refseq/release/complete, fasta sequence and GenBank annotations for genomic), and the refseqCollate utility was used to collate contigs/chromosomes into individual fasta files per genome. Groups of these files were sketched in parallel and then pasted together with mash paste. This process could be repeated for more current or custom databases.