National Institutes of Health Award to Support Integrated Pathogen and Vector Genomics Information Resources


A team led by scientists at the University of Pennsylvania and University of Georgia provides thousands of researchers around the world with access to the Eukaryotic Pathogen Genomics Database (, a collection of resources for analyzing large-scale datasets associated with microbial pathogens. These include the parasites responsible for malaria, sleeping sickness, and toxoplasmosis; the fungi responsible for thrush, aspergillosis and Valley Fever; and many other important diseases. In parallel, a team led by investigators at the University of Notre Dame has been responsible for similar resources covering invertebrate vectors of disease (, including the mosquitoes transmitting malaria, Zika, and yellow fever, the ticks responsible for Lyme disease and Rocky Mountain Spotted Fever, and others.

To ensure that this important work continues, the National Institute of Allergy and Infectious Diseases, a part of the National Institutes of Health, has awarded a new contract to integrate these resources, worth up to $7.2 million in 2019-2020. The five‐year award for this project, now rebranded (The Eukaryotic Pathogen, Host & Vector Genomics Resource), could total as much as $38.4 million if all associated options are exercised.

This award is based at Penn and directed by David S Roos, E. Otis Kendall Professor of Biology in the School of Arts and Sciences. Key subcontracts include the University of Georgia (Joint PI Jessica C. Kissinger, Distinguished Professor of Genetics and Bioinformatics in the Franklin College of Arts and Sciences), University of Notre Dame (Joint PI Mary Ann McDowell, Associate Professor of Biological Sciences at the Eck Institute for Global Health). Additional co-investigators include Professors Christian Stoeckert of Penn’s Perelman School of Medicine, Mark Caddick of the University of Liverpool, George K. Christophides of Imperial College London, and Paul Flicek, Associate Director of the EMBL-EBI (European Bioinformatics Institute).

In recent years, advances in technology have dramatically reduced the cost and effort required for genome sequencing and related analyses. It is now routine for biologists to collect immense datasets—not just individual genome sequences, but patterns of expression for every gene encoded by the genome and population-level analyses of genome diversity for thousands of individuals. The patterns revealed by such “Big Data” provide insight into important diseases, permit the development of diagnostic methods, and define drug and vaccine targets. But to be useful, these immense datasets must be sensibly organized and made conveniently accessible to the researchers worldwide. The integrated VEuPathDB database hosts data on thousands of genomes, representing hundreds of species, along with extensive information on isolate provenance, gene function, and more.

This project began more than 20 years ago, with support from the Burroughs Wellcome Fund. The Plasmodium Genome Database was released in 2000, providing access to the first malaria parasite genome sequences and functional genomics data. Eight Bioinformatics Resource Centers for Infectious Disease (BRCs) were funded by NIH in 2004, but over time succeeding contracts have supported fewer, larger centers, learning from what works, and ensuring cost-effective economies of scale. VEuPathDB represents the merger of EuPathDB and VectorBase. A second contract supports bacterial and viral pathogens.

Frank Collins, George & Winifred Clark Professor of Biology at the University of Notre Dame, and the originator of VectorBase, sees many advantages of the new architecture. “Each contributing group brings strengths that can be leveraged to generate a single resource serving the host, vector and eukaryotic pathogen communities,” he says, “improving the efficiency of data processing, and providing enhanced functionality for viewing & interrogating genomic data.” Entomologist Dan Masiga, Principal Scientist and Head of Animal Health at the International Centre for Insect Physiology & Ecology, in Nairobi concurs: “By ensuring reliable web-based access to important genomic-scale datasets, VEuPathDB helps to provide a level playing field for international public health research, engaging talented scientists around the world.”

The new NIH contract complements Resource Grants from the Wellcome Trust to the University of Liverpool (United Kingdom) emphasizing trypanosomatid parasites and fungal pathogens, and support from the Bill & Melinda Gates Foundation designed to ensure that key epidemiological datasets are captured from major clinical studies around the world. Aggregate expenditure on these resources over more than 20 years exceeds $100 million, but has proved remarkably cost-effective in expediting biological discovery and translational research worldwide: An average of 69,000 unique users use these databases every month, from more than 100 countries. The impact of these resources has been immense, with more than twenty thousand citations in the scientific literature to date.

 “Innumerable investigators, including my own laboratory, rely on daily access to the high quality genomic and functional datasets made available by the VEuPathDB Project,” says Keith Gull, Professor of Molecular Microbiology at Oxford University. “Sustainable support for such resources is imperative if we are to capitalize on the promise of modern technologies for scientific discovery and translational application.” Joe Heitman, James B. Duke Professor and Chair of Molecular Genetics & Microbiology at Duke University, agrees: “Inclusion of fungal pathogens under the BRC umbrella has greatly enhanced our ability to study important human mycoses. Cross-species comparisons provide insights into the biology and pathogenesis of these fascinating organisms, which can be deadly—but can also serve as workhorses for valuable biotechnology development.”

Dyann Wirth, Richard Pearson Strong Professor of Infectious Diseases at Harvard’s T.H. Chan School of Public Health, is excited to see the integration of genomic and epidemiology resources. “Multifactorial analysis of integrated host, vector and pathogen datasets will be increasingly important for addressing complex public health problems, such as the challenge of malaria eradication.” And Steve Kern, Deputy Director for Quantitative Sciences at the Bill & Melinda Gates Foundation is convinced that “robust, reliable, and sustainable resources integrating diverse datatypes will be critical for ensuring that important datasets are Findable, Accessible, Interoperable, and Reusable [FAIR], facilitating evidence-based decision making in international public health.”

This project is supported in part by Federal funds from the National Institute of Allergy and Infectious Diseases, National Institutes of Health, Department of Health and Human Services, under contract 75N93019C00077.