homeresearchteamresourceslab news

New paper on how artifacts drive statistical correlation in environmental samples

September 14, 2023

A new preprint on environmental sampling has been published. Crits-Cristoph and colleagues describe how metatranscriptomics data can be used to identify animals and animal viruses in Huanan Seafood Wholesale Market at the beginning of the COVID-19 pandemic. Zach helped analyze statistical correlations in the data, demonstrating Carr et al’s claim that it can be difficult, if not impossible, to infer direct association of abudandance from statistical association. The supplemental figure here shows that combining samples from two different days, with two different sampling strategies, can transform a lack of correlation into an apparent, unrealistically strong correlation. Co-corresponding author Florence Débarre noted that this is a demonstration of Simpson’s Paradox.

Artifactual correlation

Zach first applied an early version of this analysis to another data set to examine what drives apparent positive and negative correlation between nucleic acid abundance in environmental sequencing. As he discussed for an article, the appearance of unrealistic correlation between species that aren’t susceptible to SARS-CoV-2 and viral RNA is not a reason to dismiss this data entirely. Rather, correlation in the data set largely arises from clear artifacts of combining samples from different days with different sampling strategies. This is the largest, but not the only artifact impacting this data. As discussed by Saccenti et al, correlated, multiplicative error also plays a role in addition to nucleic acid degradation over time and so on.

Perhaps we can move beyond assigning blame for the pandemic to learning from it to reduce the likelihood of the next one. This paper analyzes data from the epicenter of the COVID-19 pandemic and identifies additional risks posed by the wildlife industry there that could be considered by countries drafting the WHO Pandemic prevention, preparedness and response accord. An equivalent of what happened at the end of 2019 could have happened, and still could happen again, in many places in the world and with many different pathogens, so it’s worth thinking about realistic, low-cost ways to reduce the likelihood that this happens.


homeresearchteamresourceslab news