This post describes how to leverage functionality now availability in GEMINI to study genetic variation from any build of the human genome, as well as any other species.
As we’ve described in a previous post, we have been working on making gemini much more flexible so that researchers have much greater control over the genome annotations that are integrated into the resulting database. We have also tried to provide fine scale control over which attributes from each annotation file are added (e.g., which INFO field(s) do you want to include from the ExAC or ClinVar VCF files?).
Our solution depends upon an entirely new workflow, consisting of the following steps:
vcf2dbuses the INFO fields in the VCF to drive which columns are created in the GEMINI database. In this workflow, new INFO tags are added by
vcfannofor each annotation file and attribute.
vcf2dbsteps make any assumptions about the species or genome build. Therefore, this is a general solution to variant annotation and exploration.
At this point, we have some additional work to do with helping to vet quality annotation files for various species about build, but it’s possible to use this improved workflow today to support.
Below, I’ll demonstrate its use on human (GRCh37) annotations but it
can just as easily be used for any other build (e.g., GRCh38) or non-human species.
All that is required is the researcher collect the annotation files relevant to the
species and build of interest and create a
vcfanno configuration file that dictates
the exact annotation files and attributes that are desired.
We recognize that curating annotation files is a challenge itself — we will discuss our general solution to this problem in a future post.
We assume that you have a working gemini installation. First, you’ll need to download
vcfanno binary for your system. Even if you have a previous version of
need to download the version
here. If you’re on 64-bit linux
you can grab just the binary.
You’ll also need the new loader vcf2db:
git clone https://github.com/quinlan-lab/vcf2db cd vcf2db conda install -y gcc snappy # install the C library for snappy conda install -c nlesc python-snappy=0.5 pip install -r requirements.txt
Here, we assume that you’ve already annotated with VEP or snpEff and
that vcfanno-0.0.11-beta is on your path as
You’ll then need to identify the path in which you installed the gemini
annotation files. It is a directory that contains the ExAC VCF,
the clinvar VCF, and the other annotation files used by gemini. The example
vcfanno configuration file that we use here assumes that you have already installed gemini
annotation files. You can easily add new files to this configuration file, or replace
them with the annotations relevant to your genome build or species. This merely serves
as a guide.
Here is an example of an annotation entry for the ClinVar VCF in the
rare-disease.conf configuration file we will provide to
[[annotation]] file="clinvar_20160203.tidy.vcf.gz" fields=["CLNSIG", "CLNDBN"] names=["clinvar_pathogenic", "clinvar_disease_name"] ops=["self", "self"] # convert 5 to 'pathogenic', 255 to 'unknown', etc. [[postannotation]] fields=["clinvar_pathogenic"] op="lua:clinvar_sig(clinvar_pathogenic)" name="clinvar_sig" type="String"
The first “annotation” block tells
vcfanno to extract the
CLNDBN attributes from the
clinvar_20160203.tidy.vcf.gz file. These columns will be created as INFO attributes
in the resulting VCF and renamed as
second “postannotation” block creates a
derived INFO attribute called clinvar_sig
whose value is computed by a Lua function clinvar_sig()
provided in rare-disease.lua`.
This function converts ClinVar pathogenicity codes to human readable descriptions.
In the following code block, we download the complete
vcfanno configuration and Lua files,
vcfanno on your input VCF using these configurations. Note that the
environment variable should be set to the location of the installed gemini annotation files.
You will see some error messages, but those simply occur when a variant in the VCF doesn’t overlap with an annotation in the annotation files.
wget https://raw.githubusercontent.com/brentp/vcfanno/master/example/rare-disease.conf wget https://raw.githubusercontent.com/brentp/vcfanno/master/example/rare-disease.lua GEMINI_ANNO=/path/to/gemini/annos vcfanno -p 4 -base-path $GEMINI_ANNO -lua rare-disease.lua \ rare-disease.conf $VCF \ | bgzip -c > anno.vcf.gz
Now, assuming that you have a ped file describing the samples in your study, you can load the VCF from the above annotation steps using the
vcf2db.py script. By default, this creates a SQLite database, but you may also create a MySQL or Postgres database (see the docs. Note: if you have gemini from github, you can remove the
--legacy-compression tag to get a nice speedup in both loading and querying along with a decrease in database size):
PED=/path/to/ped/file python vcf2db.py --legacy-compression anno.vcf.gz $PED my.db
At this point, the database is gemini-compatible though some of the columns are named differently. For rare disease research, a useful query is:
gemini auto_rec \ --filter "max_aaf_all < 0.005 AND impact_severity != 'LOW'" \ --columns "chrom,start,end,gene,impact,impact_severity,clinvar_pathogenic,clinvar_disease _name" \ my.db
The new loader can load about 1350 variants / second on a single-core. This means it can load an exome trio with 20K variants in about 15 seconds.
I ran a test load on a dataset we have with 16.4 million variants in 78 samples. It loads with a single core in about 3 hours and the database is half the size of the one currently produced by gemini (that took much longer with multiple CPUs to load).
The querying is also dramatically faster because of the new genotype storage format used
by the new loader and supported in gemini master branch. Running
de novo for a single trio
out of the family of the 78 runs in 10 seconds on the new database vs 47 seconds before. This
is a worst-case scenario for gemini and we’ll have further improvements here soon.
These will also translate to more modest improvements in query speed for small databases with a few samples.
Note that while we have used human and current gemini annotations as an example, this will work for any diploid organism provided that:
We are very eager to get feedback about this workflow for other species and genome builds (e.g., GRCh38). In a future post, we will demonstrate: