Software for genomic discovery
Biology is a highly data-intensive discipline, particularly in fields that utilize genomics technologies. With the generation of large-scale experimental datasets no longer being a challenge, the obstacle is the analysis and interpretation of the data. This requires computational advances to handle the scale and complexity of the data and differentiate biological signals from experimental noise. We craft open-source software that simplifies complex analyses and enables hypothesis testing in genomics.
π» Source
π Documentation
π Manuscript
Collectively, the bedtools
utilities are a swiss-army knife of tools for a wide-range of genomics analysis tasks. The most widely-used tools enable genome arithmetic. That is, set theory on the genome. For example, bedtools
allows one to intersect, merge, count, complement, and shuffle genomic intervals from multiple files in widely-used genomic file formats such as BAM, BED, GFF, VCF. While each individual tool is designed to do a relatively simple task (e.g., intersect two interval files), quite sophisticated analyses can be conducted by combining multiple bedtools
operations on the UNIX command line.
π» Source
π Manuscript
Fast BAM/CRAM depth calculation for WGS, exome, or targeted sequencing. mosdepth
can outputβ¦
- per-base depth about 2x as fast samtools depthβabout 25 minutes of CPU time for a 30X genome.
- mean per-window depth given a window sizeβas would be used for CNV calling.
- the mean per-region given a BED file of regions.
- the mean or median per-region cumulative coverage histogram given a window size
- a distribution of proportion of bases covered at or above a given threshold for each chromosome and genome-wide.
- quantized output that merges adjacent bases as long as they fall in the same coverage bins e.g. (10-20)
- threshold output to indicate how many bases in each region are covered at the given thresholds.
- a summary of mean depths per chromosome and within specified regions per chromosome.
- a d4 file (better than bigwig)
cyvcf2
π» Source
cyvcf2
is a cython wrapper around htslib built for fast parsing of Variant Call Format (VCF) files.
covviz
π» Source
:demo: Demo
The aim of covviz
is to highlight regions of significant (passing the userβs z-score threshold) and sustained (beyond user specified distance) deviation of coverage depth from the majority of samples. Significance is determined using z-scores for all samples at all points using median absolute deviation. In order for regions to be highlighted, points must be significant consecutively throughout a user specified distance.
π» Source
π Manuscript
peddy
is a python library for querying, QCβing, and manipulating pedigree files.
somalier
π» Source
π Manuscript
Extract informative sites, evaluate relatedness, and perform quality-control on BAM, CRAM, BCF, VCF, and GVCF. somalier makes checking any number of samples for identity easy directly from the alignments.
slivar
π» Source
π Manuscript
slivar
is a set of command-line tools that enables rapid querying and filtering of VCF files. It facilitates operations on trios and groups and allows arbitrary expressions using simple javascript.
π Documentation
π Manuscript
π» Source
Search and install genomic data packages. Build and check new ggd
data packages. ggd
provides easy access to processed genomic data. It removes the difficulties and complexities with finding and processing the data sets and annotations germane to your experiments and/or analyses. You can quickly and easily search and install data package using ggd
. ggd
also offers tools to easily create and contribute data packages to ggd
.
π» Source
samplot
is a command line tool for rapid, multi-sample structural variant visualization. samplot
takes SV coordinates and bam files and produces high-quality images that highlight any alignment and depth signals that substantiate the SV.
π» Source
π Manuscript
Crazy fast genome coverage estimates! The BAM and CRAM formats provide a supplementary linear index that facilitates rapid access to sequence alignments in arbitrary genomic regions. Comparing consecutive entries in a BAM or CRAM index allows one to infer the number of alignment records per genomic region for use as an effective proxy of sequence depth in each genomic region. Based on these properties, we have developed indexcov
, an efficient estimator of whole-genome sequencing coverage to rapidly identify samples with aberrant coverage profiles, reveal large scale chromosomal anomalies, recognize potential batch effects, and infer the sex of a sample.
π» Source
π Documentation
π Manuscript
STRling (pronounced like βsterlingβ) is a method to detect large STR expansions from short-read sequencing data. It is capable of detecting novel STR expansions, that is expansions where there is no STR in the reference genome at that position (or a different repeat unit from what is in the reference). It can also detect STR expansions that are annotated in the reference genome. STRling uses kmer counting to recover mis-mapped STR reads. It then uses soft-clipped reads to precisely discover the position of the STR expansion in the reference genome.
π» Source
π Manuscript
LUMPY is a novel and general probabilistic SV discovery framework that naturally integrates multiple SV detection signals, including those generated from read alignments or prior evidence, and that can readily adapt to any additional source of evidence that may become available with future technological advances.
π» Source
π Documentation
π Manuscript
vcfanno
annotates a VCF with any number of sorted and tabixed input BED, BAM, and VCF files in parallel. It does this by finding overlaps as it streams over the data and applying user-defined operations on the overlapping annotations.
d4
π» Source
π Manuscript
We sought to improve on existing formats such as BigWig and compressed BED files by creating the Dense Depth Data Dump (D4) format and tool suite. The D4 format is adaptive in that it profiles a random sample of aligned sequence depth from the input BAM or CRAM file to determine an optimal encoding that minimizes file size, while also enabling fast data access. We show that D4 uses less disk space for both RNA-Seq and whole-genome sequencing and offers 3 to 440 fold speed improvements over existing formats for random access, aggregation and summarization for scalable downstream analyses that would be otherwise intractable.
seqcover
π» Source
seqcover
is a tool for viewing and evaluating depth-of-coverage with the following aimsβ¦
- show a global view where itβs easy to see problematic samples and genes
- offer an interactive gene-wise view to explore coverage characteristics of individual samples within each gene
- not require a server (single html page)
- be responsive for up to 20 samples * 200 genes and be useful for a single-sample see how we do this
- highlight outlier samples based on any number of (summarized) background samples