On this Groundhog Day, I should be focused on how many weeks are left in winter. Instead, I still find myself in that period of reflection spurred on by the arbitrary change of number associated with our calendar. As for many, 2020 was marked by much change in my life. I attribute much of the positive to attempts to build various mindfulness practices into my life. As a statistician, I often find myself in a tension between caring and not caring about results of research on such practices. Why the tension?
As I navigate the information out there about mindfulness, meditation, awareness, theories of consciousness, etc., I am struck again and again by how often the phrase “research shows…” is used as justification for potential benefits – often without much to back it up. Phrases such as “evidence based,” “significant effect,” and “proven” are also tossed around as selling points. And this is where the tension starts to build in me. I consider myself a scientist – I value research and I care about quality of that research and the information it may provide. I am all for studying mindfulness-based interventions (in their various forms) that hold potential for positive effects on overall health for individuals.
However, using common research methodology to study effects of mindfulness training on outcomes such as “overall wellness,” “stress,” or “happiness” feels to me like wading into a murky and complex environment to measure something with tools designed for a simpler and far less turbid environment. We all know, scientists or not, how complex human behaviors and perceptions can be – and as consumers of news and science, it can’t hurt to remind ourselves of that complexity … often.
I want to be very clear that my message here is not anti-research, but it is a message of caution about assuming research can provide quick answers, particularly in complex social science contexts. “Research shows…” should not be automatically be interpreted as “the answer is …” There are many paths we could take to examine the problems and potential implications of such statements. For example, they are often made in reference to the conclusion from a single study – which always has limitations. For this post, I want to focus on a few issues that I don’t see getting enough attention in the context of human-related research.
Early research about interventions, such as mindfulness techniques, tends to be framed under the goal of deciding “whether or not there is an effect.” I expect the phrase to sound familiar to you, as variations show up in peer-reviewed scientific writing, as well as science journalism. Statistical methodology and reasoning are often assumed to provide tools to help answer the question – which is why my job often leads me to obsessing about the phrase and its two parts: “whether or not” and “has an effect.” What do they imply and how are they interpreted (even subconsciously)? I try to be aware, but still catch myself lapsing, even if momentarily, into comfortable acceptance of some of the implications!
Whether or not
Framing a research question as “whether or not” severely oversimplifies a problem. The phrase implies all or nothing — it either works for everyone or it does not work for anyone. While it often makes an attractive title or headline, such “yes or no” wording may also lead us to trust a headline (whether positive or negative) more than we would otherwise — maybe because it leaves so little room or invitation for follow up questions. It’s nice when at least the ‘how’ is included with the ‘whether or not’ – like in this Washington Post article.
From a Statistics viewpoint, much has been written about the mistakes we (including scientists) make with our tendency to dichotomize (e.g., significant or not significant, effect or no effect, assumptions met or not met, etc.). We seem to naturally steer away from having to deal with gray area and uncertainty. Categorizing and simplifying aren’t always bad – they can be done with purpose and justification – but, we best beware of how they can take attention away from deeper questions. “Is the situation as simple as whether or not?” “What criteria are used to decide between the ‘whether’ and the ‘not’?” “Are ‘maybe’ or ‘it depends’ possible conclusions?” “How well does the instrument measure what we’re after?” “I wonder how effects might differ across individuals and why?”
Has an effect
I find the implications hidden in the phrase “has an effect” to be more subtle and harder to convey. In my life as a statistician, the “whether or not” started to bother me very early on, but the issues with “has an effect” were slower to show themselves — I think because it’s so deeply embedded in how statistical methodology is typically taught. There are many questions that now immediately surface for me when I see the phrase. The first is usually the “causal” implication of the word “effect,” but I will focus on a couple of other issues for this post.
Do results apply to me?
The first issue is perhaps more about what is not said than what is said. The an effect with no qualifiers implies the treatment is expected to apply in the same way to anyone and everyone. At the very least, it encourages sweeping a huge question about the results under the rug: “Who, if anyone, might the conclusion apply to and why?” When we are trying to decide what to take away from the information for our own lives, this is a crucial question.
Egos are strong and I don’t always find my first reaction to be “What are the reasons this conclusion might not apply to me?” I guess you can test this on your own by checking your immediate reaction upon reading such headlines. The consequences of not asking the question certainly depend on risks associated with assuming it does apply to you. Risks are minimal regarding meditation for most people, but that is certainly not the case for medical treatments with possible side effects, such as hormone replacement therapy for women in menopause. And the varied effects of COVID-19 across individuals has made us all painfully aware of individual differences and how hard it can be to predict or explain them.
Does describing a single effect make sense?
There are other more subtle issues related to “an effect” that have deep connections to the use of Statistics. We often refer to an effect; we try to measure it on individuals, we try to estimate it with our models, and we report research as if we captured it. This all assumes there actually is some true underlying common effect of the treatment on individuals that can be measured. But, what if the premise of being able to learn about an effect is too oversimplified and provides a shaky foundation? What if we miss important parts of the story because of a default mindset to pursue a common, or overall, effect?
For many (most?) mind-based interventions it seems pretty safe to expect that people will respond differently to a treatment depending on the prior conditions of their lives – and differences can be substantial – not reasonably chalked up to “random errors.” And things get murkier when we also acknowledge the challenges of measuring what we want to measure and designing controlled experiments on humans.
The assumption of “a common effect” is deeply embedded into current research culture — in both language and methods. And, it is integrally connected to our addiction to using averages. I tend to put a good sized chunk of blame onto pressure to use popular (or at least historically widespread) statistical methods – even in the context of trying to describe complex human feelings.
The statistical methods we are first introduced to depend on the common effect assumption being a reasonable one – yet we’re not really taught to grapple with the reasonableness of that assumption. Before walking through a hypothetical example to make things more tangible, it’s worth asking – Why do we rely so heavily on common effects and averages?
People are not plants
There is a strong historical connection between agriculture experiments and the history of common text-book statistical methods (e.g., t-tests). Methods that make sense for capturing effects in the context of plants may not extend well to the study of things like human behavior and feelings. Put simply — people are not plants. While this statement may not seem very profound, I believe it’s worth a little consideration.
There may be value in borrowing methods that work well in the context of agricultural experiments, but there is also value in reflecting on how different a group of human participants is from a collection of plants growing in a field or greenhouse. For genetically similar plants in a controlled environment, it is much easier to envision the reasonableness of assuming a common effect to something like a fertilizer. With plants, the focus is on physical characteristics that are straightforward to measure and averages often do say something meaningful about a “typical” plant. I think we can all agree that measuring the height of a plant is far different than attempting to measure something like stress or happiness in a human.
There is a lot to say on this topic and in lieu of more words in an already long post, I leave you with two contrasting images I hope might stick in your mind more than just the phrase “people are not plants.”
First image: A few sunflower plants are growing near each other. They are facing one direction, except one plant whose flower points in a noticeably different direction. Is this surprising to you? Why?
Second image: A few humans, maybe even genetically similar, are sitting next to each other in the same environment. They are facing different directions and focusing on different aspects of the environment – and likely feeling and thinking different things. Are you surprised? Why?
A hypothetical example
To make things a bit more tangible, let’s walk through a hypothetical example. Suppose a researcher wants to study potential changes in stress associated with 4 weeks of guided daily meditations. The researcher chooses a popular survey instrument designed to attempt to measure overall stress by translating the survey responses into a numeric score between 0 and 100. Participants are volunteers who say they have never meditated before; they are randomly assigned to either receive the daily meditations via an app on their phone (treatment group) or a “control” group of no meditation (for ethical reasons, they will be given access to the app after the end of the study!).
All participants are given the survey before and after the 4 week period. Before the study, researchers and practitioners agreed that – for an individual – a change in score of about 10 points is considered meaningful because it represents enough change that an individual typically notices it in their lives. (Note – even this gets tricky because it assumes a 10 point change is meaningful regardless of where the person starts … but I’m dodging that rabbit hole, along with others, for the sake of a shorter long post).
Now – to the point — suppose that about half the people in the meditation group had changes in their scores close to or greater than 10 points, while the other half had changes between about -3 and 3 (consistent with changes observed among people in the control group). If the researcher follows data analysis norms (STAT 101 methods), they will focus on comparing the average score change in the meditation group to the average score change in the “control” group (relative to observed variability among scores – thank you Statistics). The average for the meditation group might be around 5 after combining the scores near zero and the scores near 10 — a number that doesn’t do a good job describing the response of any individual in the study.
What does the average even represent for the meditation group? Is it a meaningful summary if it doesn’t describe a score change observed by any of the individuals in the study? Does the criteria of a 10 point change developed for individuals hold when thinking about the average over a group of people? What are the broader implications if readers (or worse, researchers) don’t ever look at the individual responses enough to recognize the two groups of outcomes in the data — because use of averages is so entrenched in our methodology and expectations?
It’s not unrealistic to follow this hypothetical scenario further and imagine the researcher using statistical methods to conclude there “is no effect” of the meditation program which then may show up as a headline “Research shows meditation does not lessen stress” (I will dodge another rabbit hole here, but note they should avoid confusing lack of evidence with no evidence).
The often hidden assumption that if the intervention works it should work the same way on all individuals ends up misleading readers. What happens to the valuable information contained in the fact it did appear to work (according to the criterion) for about 1/2 of the participants? Sometimes such information is conveyed, but from personal experience working with researchers, I suspect it is lost in many cases — particularly if the raw data are never plotted and aren’t publicly shared (yet more rabbit holes I will dodge).
I think it is worth considering the consequences of never digesting the information. A person who could have benefited from the treatment may log the headline away as a reason not to spend the time. Researchers may put time and effort into designing a larger study with more participants (with the stated goal of increasing statistical power) or they may work to change the meditation treatment before trying again to detect a difference based on averages.
What would happen if, instead, if we decide to go beyond just a focus on averages to spend time speculating about the split in outcomes observed for the participants in the treatment group. There are many potential explanations for why the treatment seemed to elicit meaningful change in the score for some participants and not others.
Perhaps it reveals something as boring (though important!) as a fundamental problem with measurement – maybe the instrument was picking up stressful daily events for some participants rather than any effect of the treatment. Or, maybe because the participants knew the researchers hoped they would benefit from the treatment (no blinding in the design), some participants unknowingly answered the survey in a way consistent with expectations. Or, maybe upon asking follow up questions, the researcher realizes the split coincides with self-reports of how often individuals seriously engaged with the meditations, so that it might reflect a sort of “dose”. Or, maybe the split is actually reflective of the how different individuals might respond the 4 weeks of meditation for yet unknown reasons. And I could go on. The point is – letting go of tunnel vision on averages and common effects opens up so much for consideration and future research.
I hope awareness that “an effect” relies on assumptions to be questioned might invite something different — from readers, scientists, and journalists. We might choose not to downplay (or completely ignore) individual differences when choosing words to summarize research. We don’t have to be stuck using averages and common effects – there is room for creativity and more nuanced interpretations. I notice New York Times journalists inserting the words like “can” and “may” into headlines; small changes I believe can make a big difference in what our brains first take away. Even those three letter words provide a subtle invitation to ask follow up questions.
Back to meditation
I started writing this now long post in response to hearing and reading the phrase “research shows …” followed by words implying an effect that applies to anyone. I am really interested in meditation research, but my motivation is more a desire to understand how practices of the mind are associated with physical changes in the body, than trying to use research to decide if the practice is worth doing for myself. It is not a medical decision with the potential for serious negative risks – in fact, the risk of not trying it may be greater than any risk of trying it.
Would a few headlines with the flavor “research shows no effect of meditation on …” affect my decision to spend time practicing? Definitely not. I know my positive experiences and I know how much we, by necessity, simplify complex human behavior to carry out research. This is a setting where I doubt the ability to measure what we’re really after and question how meaningful a reported estimate of “an effect” is. The potential effects of mindfulness practices (like many other human interventions) are complicated and likely vary substantially across individuals and even over time within the same individual! There’s a lot of uncertainty, and I’m okay with that.
4 comments on this post
A couple of comments:
STAT 101 methods –> outside the US and maybe the UK, no one is getting what it means. But I agree with the point.
As a martial art practitioner (26 years in 2021) I can tell that a practice of the mind and the body may totally change your mind. However, as a statistician (almost 30 years in 2021) I thought often about how to design an experiment to verify my experience in a general population. I never came up with a proper solution.
Thanks nice post
Thanks for your comment and catching the issue with the STAT 101 reference. Just to clarify — I am referring to a typically one-semester introductory level stats course, typically in a university/college setting.
Thanks, great post, although as a biologist I must report that I found measuring the height of a plant more difficult than measuring the height of a human. And I guess my colleagues in plant physiology would stress that measuring stress in plants is rather complex! They all react differently to their stressors.
Definitely good points – didn’t mean to imply all things on plants are easy to measure! And, I just left out non-human animals all together. I just hope the phrase and image provides a little reminder of the challenges inherent in carrying statistical methods commonly used in agriculture over into social-science-like settings.