RNA perform a variety of functions in cells, helping with everything from regulating genes to building proteins. In recent years, it has become clear that chemical modifications to RNA help guide these functions, but only a handful of these modifications have been identified in plants.
On July 24, Brian Gregory, an associate professor in Penn’s Biology Department in the School of Arts & Sciences, and collaborators received a $2,022,004 award from the National Science Foundation to identify and infer the functional significance of dozens of different types of RNA modifications in 15 diverse model and crop species. Resources developed by the project will make it easier for plant scientists to use and expand upon the discoveries. The project also places a strong focus on building undergraduate curricula teaching biology as a data-driven science.
The potential of the data generated by the project is vast, Gregory says. “Hopefully, this large-scale resource will allow us and others to focus on the RNA modification sites that are truly important to crop plant stress responses, in turn allowing us to utilize the knowledge for future crop improvement.”
Gregory is joined in this effort by project co-leaders Andrew Nelson, an assistant professor at the Boyce Thompson Institute (BTI) whose institution is leading the project, Rebecca Murphy, an associate professor of biology at Centenary College of Louisiana, and Eric Lyons, an associate professor of plant sciences at University of Arizona.
The first step will be led by Nelson at BTI. His team will map more than a petabyte of publicly available RNA sequence data from at least 15 different species back to their respective genomes, including important crops like corn, rice, wheat, and cotton. For perspective, a petabyte is approximately the same amount of data it would take to stream a playlist of music for 2,600 years.
“The amount of publicly available RNA sequencing data for these 15 plants has tripled in the last two years,” Nelson says. “It’s an incredible resource.”
After the data are processed, Gregory’s team will run them through two different algorithms. The first, called HAMR, was developed by the Gregory lab. HAMR capitalizes on flaws in RNA sequencing technologies, and can identify up to 45 different modifications based on the pattern of mistakes. The second algorithm, called PEA, identifies two important RNA marks that HAMR cannot detect.
“Our computational approach allows us to speed up the process of identifying RNA modifications transcriptome-wide. It gives us a very broad and speedy look at this important layer of post-transcriptional regulation,” Gregory says.
Once the modifications have been identified, Nelson will develop a pipeline for identifying the context in which they occur. Do the RNA modifications show up only in roots? Are they present on the same gene in many related species? Do certain genes get modified by a specific mark only under drought conditions? By answering questions like these, he hopes to identify specific RNA modifications that underlie critical cellular processes.
All of these data, as well as the workflow used to process them, will be made available to scientists and the public. This effort, along with additional data analysis and management, will be headed by Lyons.
Lyons explains, “We are going to release our data as a curated list that researchers can use to generate hypotheses. In addition, we will be releasing our code and workflows for others to replicate and reuse our work. One of the key challenges of this project will be to process approximately 1 petabyte of raw data. The data processing systems, which will use a combination of local and national cyberinfrastructure resources such as CyVerse and XSEDE, will be useful for others wanting to process biological data on scales rarely reached by individual research groups.”
“If these RNA modifications have the impact that we think they will,” explains Nelson, “researchers will be able to do some very targeted gene editing in their favorite species and potentially make more stress-tolerant crops, which is becoming increasingly important because of the effects of climate change.”
Undergraduate involvement will be a key element of the project. Murphy will introduce students to bioinformatics, RNA sequencing, and genomics through course work at her primarily undergraduate institution. During the summer, a number of these students will travel to BTI to participate in immersive bioinformatic training as well as in vivo biomolecular work.
“Students will be able to hone their computational and data analysis skills while making real contributions to cutting-edge science,” says Murphy.
Teaching coding skills to undergraduates is imperative, Nelson adds: “Bioinformatics used to play a supporting role in plant biology. Now it is actually driving much of the discovery.”
Gregory stresses that collaborative funding opportunities such as those offered by the NSF make ambitious projects like this practical, adding, “This project will hopefully uncover the modifications we should be focusing on from a crop-improvement perspective. More generally, our project serves as a model for large-scale data reanalysis projects that should be a major focus in the future given the incredible data resources that are already available for reanalysis.”