Biological dark data in times of viral spillover

The world changed forever in 1989 when Tim Berners-Lee invented the World Wide Web. Today, immediate access to massive amounts of content, information, photos, videos, opinions and daily narration is a normal aspect of everyday life. 

Berners-Lee later said of his invention that “it is the unexpected re-use of information, which is the value added by the web.” 

Scientists and researchers around the world found that to be truer than ever when March 2020 lockdowns due to the COVID-19 pandemic effectively cut off access to laboratories, physical libraries and collections. 

This stanching of the flow of information highlighted an issue of growing concern in scientific and academic communities — the staggering volume of human knowledge that is digitally inaccessible, and therefore effectively useless for synthetic analysis.

Nathan Upham, assistant research professor in Arizona State University's School of Life Sciences, along with other scientists from across the world, recently published a position paper in The Lancet Planetary Health highlighting the importance of this issue, and the obstacles it posed to researchers combating COVID-19. 

“The COVID-19 pandemic shows that siloed science does not serve society as efficiently as does open, interconnected science,” Upham said. 

Serving public health

In April 2020, Upham joined a task force formed by the Consortium of European Taxonomic Facilities (CETAF) and the Distributed System of Scientific Collections (DiSSCo). 

The goal was to bring together experts from both the biological informatics and collections communities with those in virology and public health to lend perspective on the wild host organisms implicated in the COVID-19 pandemic, and to identify infrastructure to prepare for future outbreaks.

Upham led a subgroup of the task force through ASU’s Biodiversity Knowledge Integration Center (BioKIC). Early information on the coronavirus SARS-CoV-2 suggested it originated from a bat. But which bat? 

“We didn’t know a lot about it, not in terms of taxonomy nor ecology,” Upham said. “Not all bats are created equal.”

There are over 1,400 species of bats worldwide, and they can be found in every part of the planet.

What’s more, of the roughly 6,500 described species of mammals, all of them, as well as many birds, can host viruses that could potentially be passed to humans — what is called a "spillover" event. 

The task force immediately saw that the wide diversity of mammals was not being treated with the same attention as were viruses and their health consequences for humans.

Viruses are primarily studied by immunologists and virologists with relatively less input from zoologists and disease ecologists. Host-mammal knowledge represents a significant data gap.

ASU School of Life Sciences professor Nate Upham conducting populations genetics fieldwork

ASU School of Life Sciences Associate Professor Nate Upham conducts populations genetics fieldwork.

“The evidence of the species — its DNA, preserved specimens, tissues and geography — should be linked to the evidence of the virus,” Upham said. “And that’s the link that’s been missing.” 

Virginia M. Ullman Professor of Ecology and biocollections director Nico Franz said, “Data science innovations can resolve tensions between evolving taxonomic knowledge and societal needs to reliably integrate information on mammalian viral vectors.

“Dr. Upham is leading the way towards building a data language that better prepares us for fully expected future disruptions in taxonomic knowledge.”

The National Institutes of Health recently awarded ASU a $300,000 grant for “intelligently predicting viral spillover risks from bats and other wild mammals,” which will continue until May 2023. 

Upham is the principal investigator on the NIH project, leading a team from the School of Life Sciences that includes Franz, Associate Professor Beckett Sterner and Professor Arvind Varsani, who also works with the Biodesign Center for Fundamental and Applied Microbiomics.

“We realized that the public health people needed to know more about mammals, and they didn’t have the most accurate information,” Upham said. 

To further complicate matters, large quantities of existing data on mammal taxonomy — including about bats — remain inaccessible. 

Biological dark data

Physicists use the term "dark matter" to refer to large amounts of as-yet-unmeasurable materials that make up the building blocks of the universe. Biodiversity scientists have adopted a similar name referring to scientific data that, though published, is cut off from digital knowledge resources. 

Some data is inaccessible for the simple fact that it is not digitized. This includes old, rare, physical collections or archives, and printed publications — also called “gray literature.” Some may technically be in digital format but are just as inaccessible, locked behind paywalls or trapped in unstructured formats, such as PDFs. 

“The old goal of publishing was to print it — send it out,” said Upham. “We need a new publishing paradigm where the objective is to connect pieces of data to form knowledge.”

In order for data to form digital knowledge, Upham and his task force associates agree it needs to conform to FAIR Data Principles — a set of guiding principles proposed by a consortium of scientists and organizations to support the reusability of digital assets, as published in Scientific Data in 2016. 

According to the consortium’s proposal, data is “FAIR” when it is Findable, Accessible, Interoperable and Reusable. 

However, even when data is FAIR, it can still be behind paywalls and inside PDFs, rather than digitally connected and ready for analysis. Consequently, even though these guidelines have been adopted by many research institutions worldwide since 2016, illuminating the zoonotic origins of COVID-19 is exactly the kind of unexpected reuse of data that biodiversity science was not prepared for at the start of the pandemic. 

Upham and his associates outline several efforts that are underway to address the issue, emphasizing that the needed solution is two-pronged: 

  • First, existing biological data needs to be brought out of the darkness — printed publications need to be not only digitized, but key findings extracted and connected in openly accessible formats across all taxonomic groups, beginning with groups of immediate biomedical concern.
  • Second, steps should be taken to stop contributing to the problem by switching to open-access publishing formats that adhere to FAIR data principles, and digitally tag data types that are relevant for reuse, especially where related to host-pathogen and host-host relationships. 

Taxonomists, ecologists, data scientists and policymakers all hold essential roles in this paradigm shift toward digital knowledge.  

Evolving terminology 

“People think taxonomy is boring, that it’s known,” Upham said. “But it’s not known. It’s a constantly evolving knowledge space that needs to be treated like its own science.”

Taxonomy is the scientific study of naming, defining and classifying groups of organisms based on shared characteristics. And while it may seem simple, straightforward and possibly even boring, it can be anything but.

Clear and consistent terminology in digital metadata is a key element in ensuring findable, accessible, interoperable and reusable data. But as our understanding of biodiversity expands over time, new traits are observed, classifications grow more specific, species are split (or lumped) and new language is introduced.

This complicates matters when the majority of data records are arranged by species name, and these names often change or shift in meaning over time. 

“A good example would be the North American deer mouse, Peromyscus maniculatus,” Upham said. “Until a few years ago that single name encompassed what is now five species. With the species being split, the old name doesn’t just go away, it takes on a new meaning, which in this case is for deer mice east of the Mississippi River.”  

ASU School of Life Sciences professor Nate Upham Great Basin Desert June 2006

Nathan Upham and colleagues on a research outing in the Great Basin Desert in June 2006. Photo courtesy of Nate Upham

Upham got his start in biocollections and fieldwork in the Great Basin of Nevada, driving out to the desert to do population genetic surveys on the weekends while working on his undergraduate degree in Los Angeles. He then moved on to graduate school, where he studied fossils and DNA of the rodent group that includes South American capybaras — the world’s largest living rodent. 

“I really like the field-to-lab aspect of doing research,” he said. “It opens your eyes to nature in a different way — it’s your job to be out there, not just because you happen to be camping or something. And you take those insights back to the lab.” 

He came to ASU in February 2020 to work with BioKIC and the ASU Natural History Collections, excited for the opportunity to work on, “frameworks for moving taxonomy to a next-generation space, where it is no longer an obstacle but rather the main tool we use for linking together biodiversity observations,” he said.  

Dominique Perkins
ddperki2@asu.edu