GEMINI is a software framework for exploring and prioritizing genetic variation in the the context of human disease. This post describes our software development efforts to extend GEMINI to other species and to current and future builds of the human genome.
When I started my lab at the University of Virginia in 2011, I had the great fortune to start a collaboration with Pat Concannon (now at the University of Florida) to study the genetic basis of hypersensitivity to ionizing radiation (IR). Thanks to years of effort by Pat and his longtime collaborator Richard Gatti, we sought out to identify new genes underlying IR among patients that were IR hypersensitive, yet had no loss of function variants in known IR sensitivity genes such as ATM or NBN. The basic plan was to exome sequence each of ~ 90 such patients, find the causal variants in each, and in so doing, find new DNA damage repair genes. Simple, right?
In hindsight, what we quickly found is now obvious. This is not simple at all. We started out with ad hoc scripts written by myself and Uma Paila, the first postdoctoral scientist in my lab and partner in crime in conceiving the early ideas behind GEMINI. Because of an unfocused analysis approach that can be summarized as “VCF files plus a bunch of bedtools commands”, our efforts were scattered, difficult to reproduce, and therefore rather unproductive initially. I quickly realized that we desperately needed a standardized framework to integrate genetic variation with the wealth of genome annotations and reference databases that are so crucial for prioritizing genes and variants in the context of disease (e.g., imagine rare disease research without the ExAC or ClinVar resources!!!).
This now obvious insight led to the creation of our GEMINI software in 2011 (manuscript). As outlined in the figure below, by placing genetic variants, sample phenotypes and genotypes, as well as genome annotations into an integrated database framework, GEMINI provides a simple, flexible, and powerful system for exploring genetic variation in the context of disease.
We ended up using the GEMINI framework to identify our first new gene underlying IR sensitivity (MTPAP; manuscript). We have been working pretty much non-stop on GEMINI since 2012 and it has benefited from substantial contributions from our user community. Notable effort comes from Uma Paila, Brad Chapman, and Rory Kirchner (all authors on the original paper), as well as Bjorn Gruning, Daniel Gaston, and Colby Chiang. We have also benefited from tremendously valuable feedback and contributions from Jessica Chong and the entire University of Washington Center for Mendelian Genomics, which uses GEMINI as part of the analysis pipeline to solve Mendelian disorders.
Last June, I moved my lab to the University of Utah to be a part of the USTAR Center for Genetic Discovery, along with Mark Yandell, Lynn Jorde and Gabor Marth. I was incredibly fortunate to convince Brent Pedersen to join the lab. Brent has done yeoman’s work to overhaul GEMINI to be faster, leaner, and more powerful. Heretofore, however, GEMINI has been strictly limited to build 37 of the human genome. That is, it solely supports human research. Worse still, it is constrained to a single build of the human genome.
The first exciting advance that Brent has developed is a very powerful new tool called vcfanno, which allows one to annotate a VCF file with one or more annotations from many diverse annotation files in BED, VCF, GFF, or BAM format. One specifies a simple configuration file that defines which annotation files should be used and which fields or operations should used from those annotations files to yield new annotations in the INFO field of the resulting VCF file (for more details about this process, see the vcfanno documentation and our previous blog post. For example, the following figure illustrates how vcfanno converts a “naked” VCF file to a VCF where each variant is fully-dressed with the alternate allele frequency (
AF) and the number of heterozygotes (
AC_Het) from the ExAC VCF, as well as computing the mean GERP score observed for bases overlapping each variant. Vcfanno automatically updates the header and packs the new annotations into the INFO fields of the resulting VCF.
More recently, we have created vcf2db, a new tool that takes advantage of SQLAlchemy to allow a GEMINI-compatible database to be created DRECTLY FROM A VCF FILE! Why am I so excited about this? Well, the beauty is that so long as one can annotate one’s VCF with the relevant annotations using vcfanno, one can create a GEMINI database for subsequent analysis. The vcf2db script uses the data types defined for the INFO fields in the VCF file’s header to infer what the corresponding data types should be for the annotation columns that will be created in the database file. In other words, the VCF header defines, in part, the database schema!
As a quick example, consider the
gerp_mean attributes that vcfanno places in the INFO field of the example figure above. Vcfanno also defines those attributes in VCF header as a Float, Integer, and Float, respectively. The vc2db script will therefore create three new columns in the GEMINI database’s
variants table called
gerp_mean using the datatypes that are appropriate to the database backend that you are using. This is where SQLAlchemy comes in — currently, GEMINI only reads a SQLite database, but in the future, it will work with MySQL, PostgreSQL, and other databases supported by SQLAlchemy.
And here is really exciting part: since vcf2db doesn’t care about how the VCF was annotated (so long as it meets the VCF specification), it can create GEMINI databases for ANY SPECIES OR ANY GENOME BUILD!. Of course, not all batteries are included, since one needs to collect the relevant annotations for your species or build of interest, but once you have done that, you are off and running!
In summary, by decoupling the annotation of a VCF file from the loading of the GEMINI database, we are very close to a new future for GEMINI that provides support for non-human species and provides support for researchers using GRCh38 (and beyond) of the human genome. Our focus over the coming months with be to simplify and integrate all of this flexibility directly into GEMINI. For example, we plan to provide vcfanno configuration files that are pre-populated for different species and builds. This will simplify the hardest aspect of all of this — finding, downloading, and standardizing all of the annotations that are germane to one’s analysis. Stay tuned, as we are very excited about the future of GEMINI. Now to find funding to further support its future…