Häggström hävdar: De empiriska vetenskapernas desperata behov av statistisk kompetens

måndag 2 april 2012

De empiriska vetenskapernas desperata behov av statistisk kompetens

På senare år har jag kommit att bli alltmer övertygad om att inkompetent bruk av statistiska metoder utgör ett omfattande hinder för skapandet av god forskning inom ett brett spektrum av vetenskaper - kanske rentav flertalet empiriska vetenskaper. Därför behöver ämnet matematisk statistik flytta fram sina positioner i universitetsvärlden, och därför har jag vid två tillfällen det senaste året - på KVVS och på Göteborgs universitets statsvetenskapliga institution - givit föredrag med samma rubrik som denna bloggpost. Nu har jag dessutom skrivit en uppsats om dessa saker, med rubriken Why the empirical sciences need statistics so desperately och avsedd för publicering i den engelskspråkiga vetenskapliga litteraturen. Som en aptitretare serverar jag uppsatsens inledande avsnitt nedan.

* * *

What is science? Despite what some adherents of Popperian falsificationism [25] may claim, it seems unlikely that we can find a single short definition of science that captures all important aspects. See, e.g., Haack [13] for a sensible discussion on some of its many facets. The complexity notwithstanding, I hope most of us can agree on the somewhat vague statement that science consists of systematic attempts by us humans to extract reliable information about the world around us.

Since science is carried out by humans, it is in practice dependent on our cognitive capacities. Evolution has equipped us with impressive abilities to observe and draw conclusions about the world around us, necessary for finding food and sexual partners and to avoid predators. On the other hand, since Darwinian evolution by natural selection is not a perfect optimization algorithm, it should not come as a huge surprise that we have some striking cognitive biases. Some of these form serious obstacles to the scientific endeavor. In particular the following two spring to mind.

Hinson and Staddon [16]

Wolford et al. [28]

Alpert and Raiffa [2]

Yudkowsky [29]

Because of these cognitive biases and several others, we need, in order to perform good science, to set up various safeguards against our spontaneous tendency towards faulty and overconfident conclusions. Randomized, double blind and placebo-controlled clinical trials is a typical example of a formalized protocol for precisely this purpose. The theory of statistical inference offers plenty of others, including a variety of important techniques for telling pattern from noise and for quantifying the amount of confidence in a given conclusion that a given data set warrants – that is, for circumventing biases (a) and (b) above.

Statistical techniques are indispensable for doing high-quality and trustworthy science. Fortunately, the use of such techniques are wide-spread, to the point of permeating the empirical sciences. Unfortunately, they are often used in erroneous ways and in situations where they simply do not apply, leading to unwarranted conclusions.

In Section 2, I will try to argue the seriousness of the situation by pointing out some indications – some of them quite shocking – about how widespread this misuse is. In Sections 3 and 4 I offer a couple of concrete examples of erroneous application and interpretation of statistical arguments. In an unabashed attempt to catch the readers’ attention, I take them from two of the most hotly debated (in public discourse) research areas: climate science and gender studies. Then, in Section 5, I will exemplify how the lack of statistical expertise in many empirical sciences has given room to a population of self-proclaimed and mostly self-taught statistical “experts” giving erroneous advice to their colleagues. Finally, in Section 6, I will offer a few thoughts on how it might be possible to improve the situation in the future.

Continue reading here!

6 kommentarer:

Emil Karlsson2 april 2012 kl. 14:48
En välbehövlig artikel i många avseenden, men jag skulle tänka mig följande problematiseringar.

1. Du väljer att ta medicin som exempel i din diskussion om felslutet att tolka p värde som sannolikheten att nollhypotesen är sann givet erhållen data. Kanske var detta tänkt som ett exempel bland många, men är just medicin en typisk NHST disciplin? Brukar man inte ofta snarare rapporterar effektstorlekar och konfidensintervall?

2. Du väljer inte att gå så långt som vissa kritiker till NHST utan resonerar att

"While they are absolutely right that single-minded focus on statistical significance is bad practice, throwing out the use of significance tests would be a mistake, because it is a crucial tool for concluding with confidence that what we see really is a pattern, as opposed to just noise (cf. item (a) in Section 1). To be able to conclude that we have reasonable evidence in favor of an important deviation from the null hypothesis, we need both
statistical and subject-matter significance."

Kan du komma på något man kan få ut av p-värde som man inte kan få ut av t. ex. konfidensintervall? Jag har svårt att hitta viktiga exempel. Kan du komma på något som man får ut av konfidensintervall som man inte (direkt) kan få ut av p-värde? Vi kan säkert massor av exempel här.

Man kan utföra ett signifikanstest med konfidensintervall, men detta är ju lite som att använda en guldtacka som pappersvikt. Visst, det funkar, men verkar slöseri. Dessutom så spelar det inte så stor roll om konfidensintervallet precis överlappar t. ex. den okända med fixerade populationsparametern eller om konfidensintervallet precis inte gör det, för trovärdigheten ("relative likelihood") varierar inom ett konfidensintervall och skillnaden mellan de två är mycket liten.

I slutet av dagen kommer vi behöva göra ett beslut om vårt resultat ska anses vara en viktig avvikelse från nollhypotesen, men borde inte en sådan slutsats vara baserad på så mycket belägg det går?

Om NHST ofta ger felaktig guidning, är förpestad med missförstånd, leder till svartvitt tänkande och publikationsbias etc. medan saker som effektstorlek och konfidensintervall presenterar nästan all relevant tillgänglig evidens och undviker de flesta av bristerna med NHST, finns det verkligen någon poäng att klamra sig fast vid NHST?

3. Endast mycket låga p värden t. ex. *** resultat ger rimligen användbar information. Större värden på p (även om de är, säg, < 0.05) ger nästan ingen relevant information och är en dålig bas för statistisk inferens. Detta beror på att variabiliteten i p värde vid replikation är relativt stor. Cumming (2008) menar att "In one simulation of 25 repetitions of a typical experiment, p varied from <.001 to .76, thus illustrating that p is a very unreliable measure. This article shows that, if an initial experiment results in two-tailed p = .05, there is an 80% chance the one-tailed p value from a replication will fall in the interval (.00008, .44), a 10% chance that p <.00008, and fully a 10% chance that p >.44. Remarkably, the interval—termed a p interval—is this wide however large the sample size. p is so unreliable and gives such dramatically vague information that it is a poor basis for inference"

Cumming, G. (2008). Replication and p Intervals: p Values Predict the Future Only Vaguely, but Confidence Intervals Do Much Better. Perspectives on Psychological Science, 3(4), 286-300. doi: 10.1111/j.1745-6924.2008.00079.x
SvaraRadera
Svar
Olle Häggström2 april 2012 kl. 17:34
Tack Emil för kommentar!

1. Jag uppfattar medicin som en "typisk NHST disciplin". Detta hindrar inte att det också finns andra mycket vanliga statistiska begrepp och metoder inom medicinsk statistik, och naturligtvis kan praxis variera mellan olika subdiscipliner.

2. Du har rätt i att mycket vore vunnet om statistikpraktiker i många fall kunde förmås övergå från NHST till konfidensintervall - eller till konfidensmängd som det mer generellt blir. Men någon universallösning är det inte. Konfidensmängden är visserligen strikt mer informativ än p-värdet, men den låter sig inte alltid beräknas. I enkla skolboksexempel, som stickprovsbaserad skattning av väntevärde i endimensionell normalfördelning, är det i princip lika lätt som att beräkna p-värdet, men i mer komplicerade situationer kan det vara närmast ogörligt, i och med att det kräver att man beräknar (eller i alla fall uppskattar) teststatistikans fördelning inte bara i den parameterpunkt som svarar mot nollhypotesen utan i samtliga punkter. Dessutom händer det ju att man testar en nollhypotes i ett sammanhang där man inte ens bäddat in den i ett parameterrum, och då är ju inte konfidensmängdsbegreppet applicerbart. Slutligen kan det vara så (speciellt i flerdimensionella parameterrum) att även om en konfidensmängd i princip är välbestämd så blir den så risig att den inte låter sig presenteras på något lättbegripligt vis.

3. Jag håller i stort sett med dig (även om du uttrycker dig lite för svepande i de två första meningarna). Att p<0.05 i sig skulle vara en stark indikation på att nollhypotesen fallerar är ett (tyvärr väldigt utbrett) missförstånd. Händelser som har 5% sannolikhet händer allt som oftast. En diskussion för ett par år sedan som jag kort dök ned i illustrerar hur starkt många överskattar kraften i p<0.05. Min avslutande kommentar i den diskussionen, som tyvärr verkade göra föga intryck på de övriga, löd så här:

"Jane, you are absolutely right that the appropriate choice of significance level is context-dependent. However, it it very rarely the case that translating p<.05 into 'beyond reasonable doubt' is appropriate. As a general translation, something like 'data suggest something might be going on here, worth investigating further' would be better. At what point I’d be prepared to use language like 'beyond reasonable doubt' again depends on circumstances (how much is at stake, what do we have prior reasons to expect, etc), but typically perhaps around p<0.0001."
SvaraRadera
Svar
Dan Simpson2 april 2012 kl. 18:42
As a statistician who is currently "for my sins" working in a Ecology department, I really enjoyed this paper!

My only problem with it is the last sentence of Section 5, where Bayesian statistics is somewhat thrown under a bus! It has always puzzled me why Bayesian statistics always seems to have an "Achilles heel" (prior specification), while frequentist inference has "underlying assumptions that are needed to justify the procedure". (Can you tell this is a pet hate?)

On the up side, you also decimated one of my other pet hates, namely the use of idiotic prior distributions in induce artificial Bayesian/Frequentist "paradoxes".

This reminds me slightly of a conversation I had with a (mathematical) statistician last week who seemed surprised when I suggested that presence/absence of priors was probably not the most important difference between Bayesian and frequentist statistics (in that, for simple cases, there are straightforward weakly informative priors), but rather the interpretation of results. I've always (always!) felt that the choice of inferential framework should be, at least partially, driven by the question that you're trying to answer and the data that you have.
SvaraRadera
Svar
Olle Häggström2 april 2012 kl. 18:54
Thanks, Dan!

Sharing the pragmatic view on statistics that you express in the final sentence, I have almost never felt any urge to join either side of the old Bayesian vs frequentist quarrel.
SvaraRadera
Svar

Lägg till kommentar