Around the middle of July 2020, the Italian National Institute of Health released statistics on the Covid-19 situation in the country, and it appeared that a substantial number of confirmed cases were concentrated in 19 to 50 year olds, making up 47% of the total, and the average age of those who tested positive was 46 years old — compared to 61 at the beginning of the pandemic. Should we conclude younger people are more likely to get infected today? Possibly yes, but it depends.
A not so good random sample
In October 1936, the magazine Literary Digest ran a poll on peoples’ voting intention for the upcoming presidential election. The poll predicted that republican Alfred M. Landon would get 57% of the votes against the incumbent President Roosevelt’s 43%. Nevertheless, the conservative Roosevelt won the election with the 60% of the vote. The magazine was not even close to the actual result. At the same time, the statistician George Gallup had managed to correctly predict the result of the same election, and, to add insult to injury, with a much smaller sample. For the useless poll run by the Literary Digest, 10 million people were polled – as compared to 50 thousand by Gallup. How could this be? The magazine used telephone and automobile registries from which to randomly select the sample—a list inevitably including more Republicans who, at the time, tended to be richer and more likely to own a telephone or an automobile.
All that we cannot see
The Literary Digest poll is often cited as an example of what can go wrong when only the size of the sample is taken into account, and not its other characteristics related to quality. In Statistics, we often aim to learn about a population from a part it — the sample. It seems therefore only reasonable to give more attention to it.
Let us imagine having a large urn filled with pink and green balls; we want to know how many balls are pink and how many are green. There is only one way to know it for sure: empty the urn and count all of them. On the other hand, if we have limited time and are satisfied with an estimate (basically an educated guess), we may take a handful of balls (a sample) and estimate the proportion of green and pink balls in the urn (what we cannot see) from the proportion we observe in our hand (what we can see).
The simplest case would be to assume that the composition of urn is the same as that observed in the sample. This assumption is critical; what we cannot see might describe a different reality compared to what we observe. The quality of the estimate depends on characteristics of the sample. What if the pink balls were added last and not mixed and a handful was taken from the top (or bottom) of the urn? What if the balls were thoroughly mixed before grabbing a handful? The reasonableness of the above assumption depends on such details.
Often, we cannot see something for reasons that are not entirely due to chance and can even be related to what we are trying to learn about. For instance, in the story Silver Blaze by Sir Arthur Conan Doyle, Sherlock Holmes solves the mystery in part by recognizing that watchdog did not bark during the night.
“To the curious incident of the dog in the night-time”, Holmes says.
“The dog did nothing in the night-time.”
“That was the curious incident.”
The fact that the dog did not bark was informative in itself, implying that the evildoer was not a stranger, but someone the dog recognized. In the same way, in a survey about drug abuse, we are more likely to see responses from people who do not use drugs; if you are using illegal substances you may not want to say it aloud for fear of being prosecuted.
Why fewer younger people do not appear in the early statistics?
While it is entirely possible that the reported increase in number of young people with positive tests is due to a real increase in the number infected in the population, it may also be due to a change in the sample of people being tested. As the number of people in intensive care and dying due to Covid-19 decreased, the priorities of who to test changed and the capabilities to test expanded.
Not only patients with symptoms are now tested to confirm the presence of the virus, but also more healthy people are tested to confirm their health and return to a normal life. This means that the demographic of people who were tested at the beginning of the epidemic is different to that of the people tested today. During the emergency, those who experienced severe symptoms, in most cases older people, were more likely to get tested. Today, testing is extended also to those who do not experience any symptoms and as younger people are more likely to be asymptomatic or only mildly symptomatic, we can reason that the decrease in the reported average age of those positive is not because younger people were not infected to begin with, but because they did not tend to present severe symptoms.
It is also reasonable to assert that the young people may be more likely to get sick now than during the lockdown. When everything was closed, the virus spread mostly in hospitals and nursing homes, where people are likely to be older. Today, as people have returned to face-to-face socialization, it is increasingly likely that the virus will pass through younger people because they tend to have more social contact. It is hard to conclude though, from the data available, that the number of young people infected increased after the lockdown, as we do not really know how many were infected at that time given the selective testing (like grabbing a handful of balls from just the top of the urn).
It will be risky – and scientifically wrong – to argue that either of these claims can be deduced from the available data — samples which come from two demographically different populations. Such an attempt would not be too far from the one made by a fortune teller who claims, on the basis of some summary questions, to predict our present and future life.
The ways we choose how and to whom to administer the swabs can manipulate the perception of the reality we live in. The American President Trump understood that and said “Don’t forget, we have more cases than anybody in the world. But why? Because we do more testing. When you test, you have a case. When you test, you find something is wrong with people. If we did not do any testing, we would have very few cases.” But cases do not increase with testing, when we test more we only see more of them. If there is actually a relationship between testing and cases it is in the opposite direction. The purpose of testing is not just to estimate the number of infected people, but also to monitor and isolate the infected to stop the spread of the virus: the more testing we do, the less cases we will expect to have.
Today, we see the younger to be more affected than before, since more young people have been tested. Nothing wrong with this way of reasoning, as long as we understand that the sample we select does not affect the true number of positive cases or the true number of young people infected, only the perception we might have of it. So, as also President Trump should know, even if we did less testing, the United States would still hold the record for number of positive cases — we just would not know it.
Thinking about what we cannot see
The real problem is to describe what we cannot see, for example the number of infected people which have not being tested. We might have some intuitions from what we observe – that the average age of infected people has decreased since the beginning of the pandemic – but to test their plausibility is not always so easy – we do not know how different the sample observed today is from the one observed in early March. And, even when our intuitions appear rational, we should remember that we are not testing a claim’s credibility because it seems right on the basis of what we see, but using our ability and technical skill to objectively describe what we do not observe – we cannot say that the younger are more likely to get infected today than before just because that is what we observe in the sample. We should also ask ourselves how many young people infected we do not see now and we have not seen before.
That is to say that when we base our results on the data as-they-are, without considering the mechanism that has led to select them, we are only calculating numbers without trust in their validity. Justifying their validity is done by rethinking and taking into account the whole process of inference, which includes both sampling and estimation.
We could make a guess, and in many cases common sense is enough to make a reasonable one. The role of the statistician is to justify the scientific validity of the intuition behind this guess. As Laplace said:
“The theory of probabilities is at bottom nothing but common sense reduced to calculus; it enables us to appreciate with exactness that which accurate minds feel with a sort of instinct for which ofttimes they are unable to account.“
This article was originally written for the facebook page Coronavirus – Dati e Analisi Scientifiche.