Scientific Methodology Cheat Sheet

I. The Reasoning Core

II. What Kind of Study Is It?

III. The Bias Catalogue

IV. The Numbers: What They Actually Mean

V. The Risk and Effect-Measure Decoder

VI. Base Rates and Diagnostic Tests

VII. How Good Findings Get Manufactured or Spun

VIII. The Mind’s Own Errors

IX. The Systems-Thinking Toolkit

X. The Pseudoscience and Quackery Detector

XI. Running Your Own Experiments (n-of-1)

XII. Reading a Paper, in Order

XIII. Quick Rules of Thumb

XIV. The Whole Thing in One Line

XV. Cross-Links

Run any claim, including every claim in this manual, through the relevant parts below.

I. The Reasoning Core

You are the easiest person to fool: Every other tool exists to stop you doing it. Begin from the assumption that you are biased and build the checks in.
Seek disproof, not confirmation: Ask “what would prove this wrong, and have I genuinely looked?” not “what supports it?” A belief you have never tried to break is attachment, not knowledge.
Falsifiable, or not even wrong: If no possible observation could ever count against a claim, it is empty, not strong. “Works, but only when untested” is the signature of something false.
Extraordinary claims need extraordinary evidence: The weight of evidence required scales with how much the claim contradicts what is already well established. Prior plausibility counts.
Isolate the variable: Change one thing, hold the rest constant, compare against a group that did not get the change. No comparison, no finding.
Correlation is not causation: Co-movement can mean A→B, B→A, C→both, or coincidence. Default to “I don’t yet know which,” then look for what would tell you.
Risky predictions beat safe ones: A claim that predicts something specific and surprising and survives has earned something. One that explains any outcome after the fact has earned nothing.
Reductionism and systems thinking are two lenses (see Part 9): Isolate parts when a question can be cleanly isolated; watch the whole when the behaviour lives in the interactions. Using the wrong lens is a main source of bad thinking.
Hold conclusions provisionally: Every finding is the current best account, not a final truth. Strong enough to act on, loose enough to update.
The process beats the person: Trust replication and the self-correcting machinery over any single study, scientist, institution, or consensus. See The History of Science.

II. What Kind of Study Is It? (Design Taxonomy)

Design sets the ceiling on what a study can possibly show. Identify it first.

Strength of evidence, roughly low to high:

Anecdote/case report: One story. Generates hypotheses; proves nothing. No comparison group.
Mechanistic/”it makes sense” reasoning: A plausible biological pathway. Necessary but not sufficient; most mechanisms that should work don’t pan out in people.
In-vitro (test tube) and animal studies: Essential for working out mechanism; a long way from a human result. “Cures cancer in a dish” is not a health claim.
Cross-sectional study: A snapshot of a population at one moment. Shows associations, cannot establish time order (did the cause precede the effect?).
Case-control study: Starts with people who have the outcome and compares them with those who don’t, looking backwards at exposures. Efficient for rare diseases; prone to recall bias; yields an odds ratio, not relative risk.
Cohort study: Follows groups over time, comparing exposed with unexposed. Can establish time order; still observational, so confounding looms. The backbone of nutrition and lifestyle evidence, and its weakness.
Randomised controlled trial (RCT): Randomly assigns who gets the intervention. Randomisation balances confounders both known and unknown, which is why an RCT can support a causal claim an observational study usually can’t. Watch for: blinding, allocation concealment, intention-to-treat analysis, dropout.
Systematic review/meta-analysis: Gathers and appraises all the studies, and (in a meta-analysis) pools them statistically. Strongest, when done well, but only as good as the studies inside it. Pool rubbish, get laundered rubbish.

Two cautions: the hierarchy is not absolute (a large, clean cohort beats a small, sloppy RCT), and the right design depends on the question (you cannot run a placebo trial on whether smoking causes cancer). Use GRADE thinking: start from the design, then rate the actual certainty as high, moderate, low, or very low by adjusting for study quality, consistency across studies, directness, and precision.

III. The Bias Catalogue

Where findings go wrong without anyone lying.

Selection bias (who got in, who stayed):

Sampling/selection: The people studied aren’t representative of the people the claim is about.
Healthy-user bias: Those who take up a treatment or habit are healthier and more conscientious to begin with; the habit gets credit that belongs to the kind of person who adopts it.
Attrition bias: People drop out unevenly between groups, and dropouts differ from those who stay.
Survivorship bias: You only see the survivors (the planes that came back, the funds still open, the patients still enrolled), so you misjudge the whole.
Immortal time bias: A subtle, common killer in drug studies: a stretch of follow-up during which the outcome could not happen by design gets misassigned, manufacturing an illusion of benefit.

Information bias (how things were measured):

Recall bias: People with the outcome remember exposures differently (usually more) than those without. Plagues case-control and any survey of the past.
Detection/observation bias: One group is watched, tested, or diagnosed more closely, so more is found in it.
Misclassification: Exposure or outcome is recorded wrongly, blurring or inflating effects.
Performance bias: Groups are treated differently (beyond the intervention) because people know who is in which group. This is what blinding prevents.

Confounding (a lurking third cause):

A factor linked to both the exposure and the outcome creates a spurious association. Age, wealth, baseline health, and lifestyle are the usual suspects. Observational studies can only adjust for confounders they measured and thought of; randomisation handles the unmeasured ones too.

Screening-specific traps:

Lead-time bias: Earlier diagnosis makes survival look longer even if death comes at the same time.
Length-time bias: Screening preferentially catches slow, indolent cases, making the screened group look better.

System-level:

Publication bias/the file drawer: Positive results get published; null results vanish. The literature over-represents “it works.”
Funding/sponsorship bias: Industry-funded studies disproportionately favour the funder’s product. See The History of Science.

IV. The Numbers: What They Actually Mean

The core statistics, decoded. Full version with worked examples on Understanding Statistics.

p-value: The probability of a result at least this extreme if there were genuinely no effect. Not the probability the result is a fluke, not the probability the claim is false. “p < 0.05” = “passed one weak, gameable filter,” not “true.”
Statistical significance ≠ importance: Significant means “probably real,” not “big enough to matter.” Big samples make trivial effects significant. Always ask the next question.
Effect size: The number that matters: how big is the difference, and is it big enough to care about?
Confidence interval (CI): The range compatible with the data. Narrow = precise; wide = the study barely knows. If it crosses “no effect” (zero for differences, 1 for ratios), the result is shaky.
Statistical power: A study’s ability to detect a real effect. Underpowered (too-small) studies both miss real effects and, when they do hit significance, overstate the size. A dramatic result from a small sample is a hypothesis.

V. The Risk and Effect-Measure Decoder

The single richest source of deception, especially in health. Learn to convert everything into absolute terms.

Absolute risk: The actual chance of the outcome (e.g. 2 in 1,000). The number that tells you what to expect.
Relative risk (RR)/risk ratio: The ratio of risk between two groups. From cohort studies and RCTs. “Twice as likely” tells you nothing about the underlying size.
Relative risk reduction (RRR): The proportional drop. The big-sounding number used to sell benefits.
Absolute risk reduction (ARR): The actual percentage-point drop. The honest number, usually much smaller.
The reflex: “of what?” A “50% reduction” can be a 40-point drop (risk 80%→40%, huge) or a 0.1-point drop (0.2%→0.1%, trivial). You cannot tell from the relative figure alone. Always demand the absolute.
The mismatch tell: Benefits quoted in relative terms while harms are quoted in absolute terms is a deliberate framing trick. Watch for it.
Number needed to treat (NNT): How many must be treated for one to benefit. NNT = 1 ÷ ARR. An NNT of 100 means 1 helped, 99 not, all exposed to side effects. Concrete and hard to spin.
Number needed to harm (NNH): Same maths for adverse effects. Compare NNT against NNH to weigh a treatment honestly.
Odds ratio (OR): Ratio of odds, not risk. The only measure available from case-control studies (they can’t track risk over time). Approximates relative risk only when the outcome is rare; for common outcomes it exaggerates. Treat a scary OR for a common event with suspicion.
Hazard ratio (HR): The relative risk at any given instant, from time-to-event (survival) studies. HR 0.5 = at any moment, half the event rate of the control group.
Cohen’s d: A standardised effect size for differences in means. Rough reading: ~0.2 small, ~0.5 medium, ~0.8 large. Lets you compare effects measured on different scales.

VI. Base Rates and Diagnostic Tests

Where intuition fails hardest, and where a lot of medical and screening anxiety is manufactured.

The base rate is everything: How common is the thing to begin with? Most errors in judging probability come from ignoring it. A symptom “strongly associated” with a rare disease still usually means you don’t have it.
Sensitivity: Of those who truly have the condition, the fraction the test catches. A property of the test.
Specificity: Of those who truly don’t, the fraction the test correctly clears. Also a property of the test.
Positive predictive value (PPV): Of those who test positive, the fraction who truly have it. This is what you actually care about, and it depends on prevalence, not just the test.
The trap: Apply even a very good test to a rare condition and most positives are false positives, because the few true cases are swamped by the larger pool of healthy people generating occasional false alarms. This is why mass screening for rare diseases produces floods of frightening, wrong results. Sensitivity and specificity stay fixed; PPV collapses when the base rate is low.
The fix: Always ask: how common is this in someone like me (the prior), before I read the test result? Then update from there.

VII. How Good Findings Get Manufactured or Spun

The honest-researcher and dishonest-press failure modes.

p-hacking/the garden of forking paths: Slicing data many ways (which outliers, which subgroups, when to stop collecting) until something crosses p < 0.05. Often unconscious. Defended against by pre-registration (committing to the analysis before seeing the data).
The subgroup warning: “It didn’t work overall, but it worked in left-handed women over 60” is usually the residue of slicing, not a real effect.
HARKing: Hypothesising After the Results are Known: dressing up a fishing-expedition finding as if it were the original prediction.
Multiple comparisons: Test enough outcomes and some hit “significance” by chance. Twenty tests at p < 0.05 yields about one false positive on average.
Overfitting: A model tuned so tightly to one dataset that it captures its noise and fails on new data.
Spin: The conclusion claims more than the results show. Check that the stated takeaway matches what was actually measured (a surrogate marker is not a real outcome; “may,” “linked to,” and “in mice” are doing heavy lifting).
Meta-analysis red flags: High heterogeneity (the pooled studies disagree wildly, so the average is meaningless), an asymmetric funnel plot (a sign of missing null studies, i.e. publication bias), and “garbage in, garbage out” (pooling weak studies doesn’t create strong evidence).

VIII. The Mind’s Own Errors

The biases and fallacies you bring to every claim before any statistics enter.

Cognitive biases:

Confirmation bias: Seeking and weighting evidence that fits what you already believe.
Motivated reasoning: Reasoning harder to reach the conclusion you want. The beliefs that flatter you, your tribe, or your purchase deserve the most scrutiny.
Anchoring: Over-relying on the first number or idea encountered.
Availability: Judging likelihood by how easily examples come to mind (vivid ≠ common).
Dunning-Kruger: The least skilled are often the most confident, because the skill needed to do the thing is the same skill needed to see you can’t.
Survivorship (cognitive version): Studying only the winners and inferring how to win.
Sunk cost: Continuing because of what you’ve already spent, not what’s still worth it.

Logical fallacies that wreck claims:

Appeal to authority: “An expert said so” is not evidence; the evidence is. (Nor is the reverse, “they’re hiding it,” evidence.)
The Galileo gambit: “They mocked me, they mocked Galileo, so I’m right.” Being doubted is not being correct; most of the doubted are doubted because they’re wrong. See The History of Science.
Ad hominem/genetic fallacy: Judging a claim by its source rather than its content. (Funding is a reason to scrutinise, not to dismiss outright.)
Appeal to nature: “Natural” therefore good or safe. Arsenic is natural.
Appeal to antiquity/novelty: “Ancient wisdom” or “cutting-edge” as a substitute for evidence.
False dichotomy: Two options presented when more exist.
Straw man: Refuting a weakened version of the opposing claim.
Gish gallop: Burying a debate under so many weak claims that none can be answered in time.
Moving the goalposts: Raising the bar each time the evidence is met.
Begging the question/circularity: The conclusion smuggled into the premise.

IX. The Systems-Thinking Toolkit

The complement to reductionism, for when the thing you care about lives in the interactions (bodies, ecosystems, economies, minds). See Emergence & Complexity.

Map the whole, not just the part: Ask what the system is, where its boundary is, and what’s inside versus outside.
Find the feedback loops: Balancing loops resist change and stabilise (a thermostat, hunger, homeostasis). Reinforcing loops amplify (compound interest, addiction, viral spread). Most system behaviour is loops, not lines.
Expect delays: Cause and effect are often separated in time, which fools the eye and invites the wrong fix (and overcorrection).
Watch for non-linearity: Doses, thresholds, and tipping points mean small changes can have large effects and vice versa. Straight-line intuition misleads.
Emergence: The whole has properties no part has. Studying the part in isolation can destroy the very thing you wanted to understand.
Downward causation: The system shapes its parts, not only the reverse. A clean bottom-up, single-variable story often misses half the causation. (See Noble, in Emergence & Complexity.)
Beware the single magic variable: Reducing a feedback-rich condition to one cause because that’s what’s cleanly testable is a classic error. So is the opposite: vague “it’s all connected, balance your energy” hand-waving to dodge a question a clean test could settle. Match the lens to the problem.
Look for unintended consequences: Intervene in a coupled system, and effects ripple, compensate, and rebound. Ask “and then what?”

X. The Pseudoscience and Quackery Detector

Hallmarks that something is dressed as science but isn’t. Any one is a yellow flag; several together, a red one.

Unfalsifiable: No possible result would change the claim. Failures get explained away rather than counted.
Reverses the burden of proof: “Prove it doesn’t work.” The claimant owes the evidence.
Cherry-picked and anecdote-driven: Testimonials and selected hits; the misses dropped.
Stuck and isolated: Doesn’t build, doesn’t connect to the wider body of knowledge, hasn’t moved in decades, avoids independent testing.
Conspiracy as shield: Rejection by experts is reframed as proof of a cover-up (“they don’t want you to know”).
Persecution as proof: Leans on the Galileo gambit.
Jargon as decoration: Borrows the words of real science (quantum, frequency, energy, detox, immune-boosting) without their content.
Sells something: The claim and the product are the same entity. Follow the money.
All upside, no limits: Cures many unrelated conditions, has no side effects, works for everyone. Real interventions have boundaries and trade-offs.
Pathological science: Real, well-meaning scientists fooling themselves: an effect always at the edge of detectability that never gets cleaner as methods improve. Catches even Nobel laureates.

XI. Running Your Own Experiments (n-of-1)

For personal decisions you are the only subject who matters, which makes self-experiment useful and dangerous in equal measure.

Protocol:

One variable at a time: Three changes in a week teach you nothing about which did what.
Baseline first: Track the thing for a week or two before changing anything.
Pre-commit to duration: Decide the trial length up front so you can’t stop on a good day and call it a win.
Measure objectively and daily: Recorded numbers beat end-of-month memory, which rewrites itself to match what you expected.
On-off-on if you can: Add it, remove it, re-add it. A real effect tends to track the intervention; coincidence usually won’t.

Traps that will fool you:

Placebo: Expecting improvement causes improvement. Real, but it means a positive result doesn’t prove the thing worked.
Regression to the mean: You start when things are worst; worst tends to improve on its own, and you credit whatever you were doing.
Confounding: Season, workload, sleep, mood all changed too. You can’t randomise yourself.
Recall bias: Memory bends to the story. Write it down in the moment.

What it can and can’t say: suggests what works for you, now; cannot establish a general truth or a mechanism. “It worked for me” in the plural is not “data.” Treat a positive as “worth continuing,” not “discovered.”

XII. Reading a Paper, in Order

The sequence that catches most of what’s wrong before you believe the abstract.

What kind of study? (Part 2.) Match the strength of the claim to the strength of the design.
Who funded it, and who profits? Check before the findings. Then read the declared conflicts of interest.
Methods before results: The abstract is the sales pitch; the methods are the product. What did they do, to whom, measuring what, for how long?
Who was studied? Does the sample resemble the people the claim is about, or you? (Mice, 22-year-old men, the very sick?)
How big, and how long? Small or short → treat drama with suspicion.
Effect size and confidence interval? Meaningful, and precise, or a wide guess?
Relative or absolute? Recompute into absolute terms and NNT. Ask “of what?”
What could confound or bias it? Run Part 3.
Pre-registered? Replicated? A single, unreplicated, unregistered positive is a candidate for truth, not truth.
Does the data support the stated conclusion? Check the takeaway against what was actually measured. Watch for surrogate outcomes and spin.
Who disagrees, and why? Find the strongest criticism before accepting the strongest claim.

XIII. Quick Rules of Thumb

The compression of the compression. When you have ten seconds.

“Of what?” (always, for any percentage).
“Compared to what?” (no comparison, no claim).
“How would I know if I were wrong?” (about your own belief).
“What would change my mind?” (if nothing: not evidence, faith).
“Who paid, and who profits?”
“Has anyone else reproduced it?”
One study proves little. One anecdote proves nothing.
Significant ≠ large. Large ≠ important. Important ≠ true for me.
Absence of evidence isn’t evidence of absence (but it isn’t evidence of presence either).
The more you want it to be true, the harder you check.
Trust the process, not the person, the institution, or the consensus as such.

XIV. The Whole Thing in One Line

Assume you are biased; seek what would prove you wrong; ask “how big,” “of what,” “compared to what,” and “who paid”; identify the design and its biases; convert everything to absolute terms; trust the replicated process over any single study or person; match reductive and systems lenses to the problem; and hold every conclusion firmly enough to act and loosely enough to drop. That is the method. It works on health headlines, supplement labels, your own convictions, and this manual alike. Turn it on all of them.

XV. Cross-Links

Science for the section overview
The Scientific Method for the full logic
Understanding Statistics for the numbers with worked examples
The History of Science for why the process beats the person
The Science Rabbit Hole for the deeper questions
Resources for the reading list

The connections across the manual:

Mental Models for the wider thinking toolkit
Emergence & Complexity for the systems-thinking lens
Discovery for the curiosity that drives it all