The Human Operating Manual

Understanding Statistics

Contents

I. Start With the Kind of Study It Is

II. The p-value

III. Significance Is Not Importance

IV. Confidence Intervals

V. Sample Size and Power

VI. Relative versus Absolute Risk

VII. Confounding

VIII. How Honest Findings Go Wrong

IX. A Field Guide to Reading a Claim

X. Cross-Links

How to read a study without being fooled by the numbers.

 

By the end of this page, you should be able to pick up almost any research claim, in a paper, a headline, a podcast, a supplement advertisement, and ask the handful of questions that separate a finding you can lean on from one that is dressed as knowledge. The previous page explained the logic of how science establishes things. This one is about the layer where most real-world deception and self-deception happen: the numbers and the ways they mislead, by accident and on purpose.

Statistics is genuinely counterintuitive, and being fooled by it is not a sign of stupidity. Surveys repeatedly find that practising researchers, doctors, and statistics teachers misinterpret basic measures like the p-value. One study found the correct definition of a p-value appears in only around one in ten introductory psychology textbooks. So if some of what follows overturns what you thought these numbers meant, you are in a large and well-credentialed company. The goal here is not to make you a statistician. It is to give you a small set of reliable habits of suspicion that catch the large majority of misleading claims.

 

I. Start With the Kind of Study It Is

Before any individual number, the first question is what kind of evidence you are looking at, because study design sets a ceiling on how much the numbers can tell you. Evidence-based medicine ranks study types in a rough hierarchy, and while the hierarchy is not absolute, it is the right place to start.

Near the bottom sit anecdotes, case reports, and mechanistic or “it makes sense” reasoning: a story about one person who recovered, or an argument that a treatment should work because of some biological pathway. These are not worthless; they generate hypotheses, and a single well-documented case can occasionally overturn a confident belief. But they cannot establish that something works because they have no comparison and no control for the dozens of other things that could explain the outcome. Animal studies and test-tube (“in vitro”) studies sit here too: useful for working out mechanisms, but a long way from showing that something helps an actual human, since most things that work in a mouse or a petri dish do not pan out in people.

Above these come observational studies, which watch what happens to people without intervening: cohort studies that follow groups over time, case-control studies that compare people who have an outcome with those who do not. These can examine large populations and questions you could never ethically run an experiment on, and they are the source of much of what we know about nutrition and lifestyle. But they carry an unavoidable weakness that the next section returns to: confounding. Because the researcher did not assign who got the exposure, the groups being compared usually differ in many other ways, and untangling which difference caused the outcome is genuinely hard.

Higher still sits the randomised controlled trial, the closest thing science has to a test in messy human reality. By randomly assigning who gets the treatment and who gets a control or placebo, an RCT does something powerful: randomisation tends to balance out not only the confounding factors you know about but the ones you have never even thought of, distributing them roughly evenly between the groups. This is why a well-run RCT can support a causal claim (“the treatment caused the improvement”) that an observational study usually cannot. At the top sit systematic reviews and meta-analyses, which gather all the studies on a question, appraise their quality, and (in a meta-analysis) statistically pool their results, increasing the effective sample size and smoothing out the flukes of any single study.

Two cautions keep this from being a mechanical ranking. First, the hierarchy is not absolute: a large, well-conducted observational study can be more trustworthy than a small, sloppy RCT, and a meta-analysis is only as good as the studies it pools (pool rubbish, get rubbish, now laundered through impressive-looking maths and a tidy forest plot). The modern GRADE approach reflects this by rating the actual certainty of evidence, starting from the study design but adjusting up or down for quality, consistency, and other factors, into ratings of high, moderate, low, or very low. Second, the right design depends on the question; you cannot run a placebo-controlled trial on whether smoking causes cancer, and some questions about human experience need qualitative methods entirely. But as a first filter, asking “what kind of study is this, and what can that kind of study actually establish?” eliminates a huge amount of nonsense before you even reach the numbers. When a dramatic health claim rests on a mouse study or a single observational correlation, you already know how much weight it can bear.

 

II. The p-value

The p-value is the most cited and most misunderstood number in science, and getting it right is the single biggest upgrade to your statistical literacy. Here is what it actually is: assuming there is genuinely no real effect (the “null hypothesis”), the p-value is the probability of getting a result at least as extreme as the one observed. That is all. By a long-standing and somewhat arbitrary convention, a p-value below 0.05 is called “statistically significant,” meaning that a result this strong would happen less than 5 percent of the time if there were really nothing going on.

What the p-value is not: It is not the probability that the result is due to chance. It is not the probability that the null hypothesis is true. It is not the probability that the finding is a fluke, and one minus the p-value is not the probability that the finding is real. These misreadings are everywhere, including in the mouths of scientists, and they all make the same error of flipping the conditional: the p-value tells you the probability of the data given that there is no effect, not the probability that there is no effect given the data. Those are very different things, the way “the probability of wet ground if it rained” differs from “the probability it rained if the ground is wet.” A useful corrective is to think of the p-value not as a verdict but as a measure of how compatible your data are with the hypothesis of no effect. A low p-value says “this data would be surprising if nothing were going on.” It does not say “therefore my preferred explanation is true.”

The 0.05 threshold gets treated as a magic line between real and not-real, and it is nothing of the sort. A result at p = 0.04 and one at p = 0.06 are almost identical in what they tell you, yet one gets called “significant” and published and the other gets buried. Worse, the threshold invites a specific abuse, covered below, where researchers nudge their analysis until they cross the line. Treat “statistically significant” not as “true” but as “this passed one weak, gameable filter,” and you will be ahead of most readers of the scientific literature.

 

III. Significance Is Not Importance

Here is the error that statistical significance invites, and it is everywhere in health reporting. Statistical significance tells you, roughly, whether an effect is likely to be real. It tells you nothing about whether the effect is big enough to care about. These are completely separate questions, and conflating them is one of the most common ways numbers mislead.

The key concept is effect size: not “is there a difference?” but “how big is the difference?” With a large enough sample, almost any difference becomes statistically significant, including differences far too small to matter to a human being. A study of a hundred thousand people might find that some drug lowers your risk of an outcome by a statistically rock-solid amount that, in real terms, means almost nothing for any individual. The p-value will look spectacular; the effect size will be trivial. Whenever you see a “significant” finding, the next question is always: significant, fine, but how big is the effect, and is it big enough to change anything? A great deal of supplement marketing and diet-study hype rests on real but minuscule effects, technically significant and practically pointless.

 

IV. Confidence Intervals

If a single number like a p-value is a blunt instrument, the confidence interval is a sharper and more honest one, and learning to read it is worth the small effort. Rather than a yes/no verdict, a confidence interval gives you a range: the study’s estimate of the effect, plus a sense of how precisely it was pinned down. A 95 percent confidence interval is, loosely, the range of values reasonably compatible with the data.

Two things to read off it. First, the width tells you about precision: a narrow interval means the study has pinned the effect down tightly, while a wide one means there is a lot of uncertainty and the true effect could be anywhere across a broad range, a sign of a small or noisy study whose headline number should not be trusted too far. Second, what the interval includes tells you about significance in a more informative way than a p-value: if a confidence interval for a difference includes zero (no effect), the result is, in the usual terms, not statistically significant, but you can also see how far to each side it stretches, which a bare p-value hides. A study reporting “a 30 percent improvement” sounds impressive until you notice the confidence interval runs from “barely any improvement” to “enormous improvement,” which really means the study was too small to know. The confidence interval keeps the uncertainty visible. Prefer sources that report it, and be wary of those that report only a single triumphant number with no sense of the range around it.

 

V. Sample Size and Power

Much of what determines whether a study is worth anything comes down to size. Small studies are unreliable in a specific, under-appreciated way: they are not just less certain, they are actively prone to producing dramatic, misleading results. A tiny study is at the mercy of chance and a few unusual individuals, and small studies that happen by luck to show a big effect are exactly the ones that get published and make headlines, while the equally small studies showing nothing quietly vanish. This is why so many exciting findings from small studies evaporate when someone runs a larger one.

The technical concept is statistical power: a study’s ability to detect a real effect if one exists. Underpowered studies (too small to reliably find the effect they are looking for) plague whole fields, and they cause two opposite errors at once. They miss real effects, and, more insidiously, when they do hit significance, they tend to overestimate the size of the effect, sometimes wildly. As a rough habit: a striking result from a small sample is a hypothesis, not a finding. Wait for it to be repeated at scale before rearranging your life around it.

 

VI. Relative versus Absolute Risk

Imagine a drug that cuts your risk of some disease “by 50 percent.” Impressive. But 50 percent of what? Suppose that without the drug, 2 people in 1,000 get the disease, and with the drug, 1 person in 1,000 does. That is, truthfully, a 50 percent reduction, the relative risk reduction. But in absolute terms, your risk dropped from 0.2 percent to 0.1 percent: an absolute risk reduction of one tenth of one percentage point. The same fact stated two ways: “halves your risk!” or “helps one person in a thousand.” Both are accurate. One is marketing.

This gap between relative and absolute risk is everywhere, and the pattern is consistent: benefits are advertised in relative terms (the bigger-sounding number) while risks, when mentioned, are often given in absolute terms (the smaller-sounding number), a mismatched framing that makes interventions look better than they are. Study abstracts and news reports overwhelmingly favour the relative figure, and research has shown this framing changes the decisions of patients and doctors alike. The defence is a single reflex question, to be asked every single time you hear a percentage improvement: “of what?” A relative figure without its baseline is close to meaningless, because a 50 percent reduction can be a huge deal (if the risk goes from 80 percent to 40 percent) or trivial (if it goes from 0.002 percent to 0.001 percent), and you cannot tell which from the relative number alone.

A closely related tool is the number needed to treat: how many people have to take a treatment for one person to benefit. It is just the flip side of absolute risk reduction, and it is concrete. A number needed to treat of 100 means 100 people take the drug for 1 to be helped, while the other 99 get no benefit and whatever side effects come along. Drugs in wide use have numbers needed to treat in the dozens or hundreds for many outcomes, which is not an argument against them, but is a very different picture from “cuts your risk in half.” When a source gives you relative risk but hides the absolute risk and the number needed to treat, that is not an accident, and it tells you something about the source.

 

VII. Confounding

Statistics is where you learn to ask why a given correlation might not be causal, and the main answer is confounding: some third factor driving both things you are looking at.

The classic pattern: a study finds that people who do X are healthier, and the headline declares X healthy. But the people who do X might differ from those who do not in a hundred other ways, wealthier, younger, more health-conscious, more able to afford and access X in the first place, and any of those could be the real cause of the better health. This last one, where healthier people are simply more able to take up the supposed cause, is so common in nutrition and lifestyle research that it has earned the name “healthy user bias.” Observational studies try to correct for confounders statistically, but they can only adjust for the ones they measured and thought of; the unknown and unmeasured ones remain. This is precisely the weakness that randomisation in an RCT overcomes by scattering all confounders, known and unknown, evenly across the groups. So when an observational study reports that some food or habit is associated with better health, the right reflex is not “X is good for me” but “X is associated with better health in this group, and I wonder what kind of person does X and whether that, rather than X itself, is the real story.” A staggering amount of nutritional advice has been built on confounded observational correlations later contradicted by trials.

 

VIII. How Honest Findings Go Wrong

Two mechanisms quietly corrupt the published record without anyone necessarily lying, and knowing them explains why so much of what gets published later fails to hold up.

The first is p-hacking, sometimes called the garden of forking paths. When researchers analyse data, they face countless small choices: which outliers to exclude, which subgroups to look at, which variables to control for, when to stop collecting data. If they keep adjusting those choices, consciously or not, until the result crosses the magic p = 0.05 line, they can manufacture a “significant” finding out of pure noise, and they will often genuinely believe they did nothing wrong. Run enough analyses on enough variables, and something will cross the line by chance alone. A particular warning sign is the study that found its result in a subgroup (“the drug did not work overall, but it worked in left-handed women over 60”), which is very often the residue of slicing the data many ways until something turned up. The modern defence is pre-registration: researchers publicly commit to their hypothesis and analysis plan before collecting data, so the choices cannot be quietly tuned after the fact. The presence of pre-registration is now one of the better signs that a finding is trustworthy.

The second is publication bias, the file-drawer problem. Studies that find an exciting positive result get published; studies that find nothing tend to sit in a drawer, unpublished and unseen. The consequence is that the published literature is a biased sample of all the research actually done, systematically over-representing positive findings. If twenty teams test a useless treatment, one will likely get a “significant” result by chance, and that one gets published while the nineteen null results disappear, leaving a literature that “shows” the treatment works. This is why a single positive study, even in a good journal, means relatively little on its own, and why the systematic review (which tries to find the unpublished and null results too) and independent replication matter so much. A finding is only as trustworthy as its ability to show up again when someone else, ideally someone sceptical, goes looking for it.

 

IX. A Field Guide to Reading a Claim

Run any research claim, including every claim in this manual, through these.

  • What kind of study is it? Anecdote or mechanism or mouse study (hypothesis only), observational (can show association, struggles with causation), or RCT and systematic review (can support causation). Match the strength of the claim to the strength of the design.
  • How big was it, and is the effect size meaningful? Small studies overstate; large samples make trivial effects “significant.” Ask not just “is it real?” but “is it big enough to matter?”
  • Relative or absolute? When you hear a percentage improvement, ask “of what?” Demand the absolute risk and, ideally, the number needed to treat. This one question defuses most health hype.
  • What is the confidence interval? Is the estimate precise (narrow) or basically a guess (wide)? Does it include “no effect”?
  • What might be confounding it? Especially for observational findings: what kind of person does this, and could that, rather than the thing itself, explain the result? Watch for healthy-user bias.
  • Could this be p-hacked or cherry-picked? Is the headline result from a subgroup? Was it pre-registered? Is this one positive study, or a replicated and reviewed body of evidence?
  • Who benefits from this framing, and what is not being shown? Follow the incentive. Missing absolute numbers, missing confidence intervals, missing null results, and missing conflicts of interest are all tells.
  • Has it replicated? A single study is a candidate for truth, not truth. The finding that survives other people trying and failing to knock it down is the one to trust.

None of this requires advanced mathematics. It requires the habit of asking these questions before believing, and the willingness to sit with “the evidence here is weaker than the headline suggests,” which is the honest verdict on a great deal of what gets published. This is exactly the discipline the manual has tried to apply to itself throughout, marking what is well supported, what is contested, and what is overblown, and these are the tools that discipline is built from. Turn them on this manual as readily as on anything else; that is what they are for.

 

X. Cross-Links

Resources

  • Wasserstein, R.L., & Lazar, N.A. (2016). The ASA statement on p-values: Context, process, and purpose. The American Statistician, 70(2), 129–133.
  • Greenland, S., et al. (2016). Statistical tests, P values, confidence intervals, and power: A guide to misinterpretations. European Journal of Epidemiology, 31(4), 337–350.
  • Gigerenzer, G. (2018). Statistical rituals: The replication delusion and how we got there. Advances in Methods and Practices in Psychological Science, 1(2), 198–218.
  • Schünemann, H., et al. (2013). GRADE handbook for grading quality of evidence and strength of recommendations. The GRADE Working Group.
  • Ioannidis, J.P.A. (2005). Why most published research findings are false. PLoS Medicine, 2(8), e124.
  • Goldacre, B. (2008). Bad science. Fourth Estate.
  • Reinhart, A. (2015). Statistics done wrong: The woefully complete guide. No Starch Press.