The world’s largest repository of raw genomic sequences from wild plants, animals and fungi is missing critical data necessary to monitor and protect the Earth’s biological diversity, according to a new study.
The missing data includes the time and location the organism’s sample was collected, which is needed for monitoring the genetic diversity of populations.
The research was published recently in the journal the Proceedings of the National Academies of Sciences.
“A lot of money is pumped into generating these genomic data, yet most are not useful for biodiversity monitoring due to a lack of metadata,” says Michelle Gaither, an assistant professor in UCF’s Department of Biology and co-author of the study. “The lost investment from missing spatiotemporal metadata totals tens of millions of U.S. dollars and this amount will only grow.”
The repository — Sequence Read Archive (SRA) of the International Nucleotide Sequence Database Collaboration — is the leading collection of raw genomic sequences which contains over 600 terabytes of data from wild species of plants, animals and fungi from across the globe. Scientists continue to deposit genomic data into the SRA at an exponential rate.
“However, without time and location metadata we can’t monitor changes in genetic diversity,” Gaither says.
For the study, the researchers conducted a search of the publicly available data in the SRA. They evaluated the potential use of the SRA data for monitoring biodiversity and found that most archived genomic datasets lacked the time and spatial metadata necessary for genetic biodiversity surveillance with only 14% of the SRA datasets containing information about when and where the organisms were sampled.
The researchers followed up with a labor-intensive scouring of more than 800 datasets from wild populations for which latitude and longitude coordinates were missing in order to fill in that missing data. They event reached out to scientists who made the contributions to the repository in their quest to collect the data. Despite these efforts, the team could only obtain geospatial coordinates and collection years for about 33% of the SRA datasets.
Gaither says the goal of the article is to bring the issue to the forefront and make researchers aware, since the time to plug the metadata gap is now.
“Really we are asking for community standards that preserve not only genetic sequence data but also the invaluable metadata tied to each sequence,” she says.
Gaither said the work came about as research plans were being adjusted during the COVID-19 pandemic.
“Last summer, with our research plans in jeopardy due to the pandemic, we gathered a group of graduate students and researchers from across the United States, Australia and New Zealand and started working to address the problem via an online, remote ‘datathon,’” she says. “Over about three months, students scoured the primary research articles and reached out to lead authors and researchers to fill in the missing data.”
UCF Biology graduate students who were part of the study were doctoral students Emily Farrell and Maryam Ghoojaei and alum Thienthanh Trinh ’20MS.
Gaither received her doctorate in zoology from the University of Hawaii at Manoa and joined UCF’s Department of Biology, part of UCF’s College of Sciences, in 2017. She is also a member of UCF’s Genomics and Bioinformatics Cluster.