Software for genomic discovery

Biology is a highly data-intensive discipline, particularly in fields that utilize genomics technologies. With the generation of large-scale experimental datasets no longer being a challenge, the obstacle is the analysis and interpretation of the data. This requires computational advances to handle the scale and complexity of the data and differentiate biological signals from experimental noise. We craft open-source software that simplifies complex analyses and enables hypothesis testing in genomics.

📖 Documentation

📜 Manuscript

Collectively, the bedtools utilities are a swiss-army knife of tools for a wide-range of genomics analysis tasks. The most widely-used tools enable genome arithmetic. That is, set theory on the genome. For example, bedtools allows one to intersect, merge, count, complement, and shuffle genomic intervals from multiple files in widely-used genomic file formats such as BAM, BED, GFF, VCF. While each individual tool is designed to do a relatively simple task (e.g., intersect two interval files), quite sophisticated analyses can be conducted by combining multiple bedtools operations on the UNIX command line.

📜 Manuscript

Fast BAM/CRAM depth calculation for WGS, exome, or targeted sequencing. mosdepth can output…

per-base depth about 2x as fast samtools depth–about 25 minutes of CPU time for a 30X genome.
mean per-window depth given a window size–as would be used for CNV calling.
the mean per-region given a BED file of regions.
the mean or median per-region cumulative coverage histogram given a window size
a distribution of proportion of bases covered at or above a given threshold for each chromosome and genome-wide.
quantized output that merges adjacent bases as long as they fall in the same coverage bins e.g. (10-20)
threshold output to indicate how many bases in each region are covered at the given thresholds.
a summary of mean depths per chromosome and within specified regions per chromosome.
a d4 file (better than bigwig)

cyvcf2

cyvcf2 is a cython wrapper around htslib built for fast parsing of Variant Call Format (VCF) files.

covviz

:demo: Demo

The aim of covviz is to highlight regions of significant (passing the user’s z-score threshold) and sustained (beyond user specified distance) deviation of coverage depth from the majority of samples. Significance is determined using z-scores for all samples at all points using median absolute deviation. In order for regions to be highlighted, points must be significant consecutively throughout a user specified distance.

📜 Manuscript

peddy is a python library for querying, QC’ing, and manipulating pedigree files.

somalier

📜 Manuscript

Extract informative sites, evaluate relatedness, and perform quality-control on BAM, CRAM, BCF, VCF, and GVCF. somalier makes checking any number of samples for identity easy directly from the alignments.

slivar

📜 Manuscript

slivar is a set of command-line tools that enables rapid querying and filtering of VCF files. It facilitates operations on trios and groups and allows arbitrary expressions using simple javascript.

📖 Documentation

📜 Manuscript

Search and install genomic data packages. Build and check new ggd data packages. ggd provides easy access to processed genomic data. It removes the difficulties and complexities with finding and processing the data sets and annotations germane to your experiments and/or analyses. You can quickly and easily search and install data package using ggd. ggd also offers tools to easily create and contribute data packages to ggd.

samplot is a command line tool for rapid, multi-sample structural variant visualization. samplot takes SV coordinates and bam files and produces high-quality images that highlight any alignment and depth signals that substantiate the SV.

📜 Manuscript

Crazy fast genome coverage estimates! The BAM and CRAM formats provide a supplementary linear index that facilitates rapid access to sequence alignments in arbitrary genomic regions. Comparing consecutive entries in a BAM or CRAM index allows one to infer the number of alignment records per genomic region for use as an effective proxy of sequence depth in each genomic region. Based on these properties, we have developed indexcov, an efficient estimator of whole-genome sequencing coverage to rapidly identify samples with aberrant coverage profiles, reveal large scale chromosomal anomalies, recognize potential batch effects, and infer the sex of a sample.

📖 Documentation

📜 Manuscript

STRling (pronounced like “sterling”) is a method to detect large STR expansions from short-read sequencing data. It is capable of detecting novel STR expansions, that is expansions where there is no STR in the reference genome at that position (or a different repeat unit from what is in the reference). It can also detect STR expansions that are annotated in the reference genome. STRling uses kmer counting to recover mis-mapped STR reads. It then uses soft-clipped reads to precisely discover the position of the STR expansion in the reference genome.

📜 Manuscript

LUMPY is a novel and general probabilistic SV discovery framework that naturally integrates multiple SV detection signals, including those generated from read alignments or prior evidence, and that can readily adapt to any additional source of evidence that may become available with future technological advances.

📖 Documentation

📜 Manuscript

vcfanno annotates a VCF with any number of sorted and tabixed input BED, BAM, and VCF files in parallel. It does this by finding overlaps as it streams over the data and applying user-defined operations on the overlapping annotations.

d4

📜 Manuscript

We sought to improve on existing formats such as BigWig and compressed BED files by creating the Dense Depth Data Dump (D4) format and tool suite. The D4 format is adaptive in that it profiles a random sample of aligned sequence depth from the input BAM or CRAM file to determine an optimal encoding that minimizes file size, while also enabling fast data access. We show that D4 uses less disk space for both RNA-Seq and whole-genome sequencing and offers 3 to 440 fold speed improvements over existing formats for random access, aggregation and summarization for scalable downstream analyses that would be otherwise intractable.

seqcover

seqcover is a tool for viewing and evaluating depth-of-coverage with the following aims…

show a global view where it’s easy to see problematic samples and genes
offer an interactive gene-wise view to explore coverage characteristics of individual samples within each gene
not require a server (single html page)
be responsive for up to 20 samples * 200 genes and be useful for a single-sample see how we do this
highlight outlier samples based on any number of (summarized) background samples