‘Correlation does not imply causation’ is a statistical mantra. Most good high school and undergraduate statistics courses teach this, and most good science bloggers, journalists and scientists repeat it over and over again. But when and how far does that mantra extend into regression model territory? And what of the no-man’s-land surrounding this mysterious terra statistica?
Causal language refers to definitive statements that describe a cause and effect between two variables. It is in the same camp as the active voice, which is increasingly being promoted as the ‘way to write’ for scientists. Passive voice and non-directional language, once the standard of scientific writing, are now seen by some as vague, ambiguous and open to misinterpretation. But in our rush to be active, confident and ‘own’ our research results, are we risking misinterpretation and misunderstanding of science at the other end of the scale? “Building more roads increases bee abundance” might sound dramatic, convincing and galvanising…but it doesn’t mean quite the same thing as “Bee abundance was associated with the number of grassy road verges in the landscape”.*
I’ve just reviewed a manuscript (for a pretty good general science journal) in which the authors made a simple mistake that most of us have made at some point: they measured a handful of small-scale interacting variables at various sites across an ecosystem type, ran a generalised linear model with those variables, and then implied causation across broadly-relevant landscapes, animal communities and ecosystem functions. They used a sampling method that was not specifically designed to measure the variable they were interested in (so they were missing some of the information they needed to make the claims they wanted to make). Plus, the handful of variables they measured were based on ecological interactions that are known to be influenced by multiple, or more significant, environmental drivers that weren’t considered in their study.
I am a relative academic novice, but this is not the first manuscript I’ve reviewed that has taken this approach. The general topic and study aims are always interesting and topical, and the Introduction well-written and reasonable, so I usually get halfway through the Methods with positive thoughts before alarm bells ring. The Discussion then becomes a wild affair, claiming the moon and stars with little or no evidence to back up the connection.
Am I crazy, or being too picky? I don’t believe so. Like most peer reviewers, I am not a statistician, and I don’t claim to be any kind of expert in advanced statistics or modelling techniques. But I know the basic tenets of ecological data analysis, and implying causation from correlation is not one of them. Why do apparently experienced researchers submit papers that do this, especially where multiple confounding factors are involved, but not acknowledged by the study design or analysis methods? And why do editors send these papers out for review?
Is it because this is how we are taught science from an early age? When I was studying my undergraduate science degree, I wrote lots of ‘cause and effect’ papers as assignments. Simple, controlled experiments are the most effective way to teach science at this level, where results need to be obtained within a 3 hour practical session, or a few weeks of the teaching semester. Light affects seedling growth in a greenhouse, oxygen levels affect fish-tank communities etc. But this isn’t always reality, and it rarely applies to field ecology studies. Like many young researchers, I learned this the hard way when I started my PhD.
Data collected through field ecology studies are rarely controlled, due to many factors, including time, money and the impossibility for a researcher to be in multiple places at once. This is completely fine (as long as the data are analysed appropriately), and this is what makes ecology so exciting. But the study design and data analyses need to acknowledge this and factor it into the study. Even then, it can still be difficult to make definitive statements.
Is it because of the modern obsession with predictive modelling? As the number of published papers based on data modelling increases, could newer researchers assume that they need to use similar modelling techniques to remain ‘current’ or contribute to ecological theory and knowledge? This type of modelling is not suited to all data types or all research questions, but few contemporary modelling-focused studies clarify this. If you do a quick Google search of regression and causality, you will find many apparently reputable blogs and statistical sites that appear to promote the use of regression to make ‘strong’ assumptions about research data, without clarifying the dirty details.
Or is it because of the modern obsession with novelty and certainty in science? Although there are great journals that don’t use novelty as the main criterion for acceptance of a paper, the current culture around popular acceptance of science belies this. It is ingrained in researchers, through media, academia, politics and institutional goals that we need to produce ‘wow’ research to get ahead and have an impact. This article suggests that this might have unintended consequences by encouraging researchers to claim novelty where there is none, simply to meet this criterion.
Is it a combination of all of these factors? Most likely. Because we live in a complex world, and very rarely is one isolated factor ever the single cause of anything. Of course, it’s nigh impossible to identify every interaction and causal factor in a single system. But ecology benefits from studies that acknowledge limitations, state what they didn’t study and identify peripheral factors that need further attention. Too much generalisation and overuse of ‘active’ causal language can increase misunderstanding about the process of scientific research and reduce emphasis on proven causal relationships.
* This is a fictional statement based on fact.
© Manu Saunders 2015