“McCloskey and Ziliak have been pushing this very elementary, very correct, very important argument through several articles over several years and for reasons I cannot fathom it is still resisted. If it takes a book to get it across, I hope this book will do it. It ought to.” —Thomas Schelling, Distinguished University Professor, School of Public Policy, University of Maryland, and 2005 Nobel Prize Laureate in Economics
“With humor, insight, piercing logic and a nod to history, Ziliak and McCloskey show how economists—and other scientists—suffer from a mass delusion about statistical analysis. The quest for statistical significance that pervades science today is a deeply flawed substitute for thoughtful analysis. . . . Yet few participants in the scientific bureaucracy have been willing to admit what Ziliak and McCloskey make clear: the emperor has no clothes.” —Kenneth Rothman, Professor of Epidemiology, Boston University School of Health
The Cult of Statistical Significance shows, field by field, how “statistical significance,” a technique that dominates many sciences, has been a huge mistake. The authors find that researchers in a broad spectrum of fields, from agronomy to zoology, employ “testing” that doesn’t test and “estimating” that doesn’t estimate. The facts will startle the outside reader: how could a group of brilliant scientists wander so far from scientific magnitudes? This study will encourage scientists who want to know how to get the statistical sciences back on track and fulfill their quantitative promise. The book shows for the first time how wide the disaster is, and how bad for science, and it traces the problem to its historical, sociological, and philosophical roots.
Stephen T. Ziliak is the author or editor of many articles and two books. He currently lives in Chicago, where he is Professor of Economics at Roosevelt University. Deirdre N. McCloskey, Distinguished Professor of Economics, History, English, and Communication at the University of Illinois at Chicago, is the author of twenty books and three hundred scholarly articles. She has held Guggenheim and National Humanities Fellowships. She is best known for How to Be Human* Though an Economist (University of Michigan Press, 2000) and her most recent book, The Bourgeois Virtues: Ethics for an Age of Commerce (2006).
Recently in another GR review I made a remark about the absence of a coherent aesthetic in statistics. In response I received several messages criticising my choice of words. ‘What does aesthetics have to do with the practical use of statistical analysis,’ was the polite form of the question. The answer is ‘quite a bit really.’ In fact it is the lack of aesthetic coherence which makes statistical analysis so dangerous as well as wrong. McCloskey’s book, by demonstrating the utter arbitrariness of statistics, gives part of problem from the point of view of economics. But the issue is actually more profound than she sets out.
Many years ago I had a teacher, Russell Ackoff, at the University of Pennsylvania. Early in his career he had written a textbook called Scientific Method: Optimizing Applied Research Decisions. In it he laconically criticised social scientists for using tests of statistical significance which were unrelated to the cost of error in their analysis. His point was obvious to all but statisticians: tiny errors affect consequential decisions far more than large errors affect trivial ones. If you don’t know what the consequences of being wrong are, you can’t conduct responsible statistical analysis.
No one paid much attention to Ackoff in 1962. And I doubt whether many have paid much attention to McCloskey since her book’s publication in 2008. One possible reason for the lack of impact is that both Ackoff and McCloskey present statistical method as an economic issue. While the issue is indeed in part economic it is not one that the discipline of economics chooses to address. To do so would undermine most of the empirical research by the discipline for the last century. Besides any kind of cost benefit analysis of statistics has to be carried out through the same statistical techniques that are being held to account by those who are most invested in them. Prospects for a breakthrough have always been slim therefore.
A fundamental term in all economics is ‘value.’ Classical economics derives value from what it calls preferences, typically expressed in terms of what it calls ‘utility’ (a transparently aesthetic concept), or in the case of business, ‘profit’ (less obviously so, but also an aesthetic idea). In order to make its scholastic logic work, it fixes preferences and its derived factors as some sort of divinely revealed and protected order of things, regardless of the obvious fact that preferences change continuously and no one actually has any definite idea of what constitutes corporate profits.
In fact ‘value’ is actually not fundamentally an economic concept but an aesthetic one. That is to say economic value is a sub-species of the aesthetic. More specifically, value is a criterion of aesthetic choice as it is employed in economics. An ‘aesthetic’ is the more general term for such a criterion and applies not just to economics but to all choices of consequence. An aesthetic is more than a preference, it is a rule of choice about what is more important, more desirable, more inherently valuable. It is also articulated to some degree, not necessarily by the one employing it, not merely a response to the presentation of alternative courses of action. And through its articulation an aesthetic evolves continuously.
From an aesthetic perspective, therefore, the statistical problem of economics is no longer one of some sort of complex, quite likely impossible, cost/benefit analysis. Rather it is one of aesthetic compatibility - or lack of it. This is easy to understand without complex analysis about the significance of significance tests or the costs of being wrong. Only the most elementary understanding of what statistics are is necessary.
A statistic is a partial description of some set of numbers, usually the result of some sort of social science observation or experiment. I say ‘partial’ because there are an infinite number of statistics that might be used to describe any such set of numbers. When it is used to judge, that is to evaluate, a set of numbers, a statistic is an aesthetic.
The most common statistic is the mean, or average, of the set. The mean is what is called the first order statistic. The second order statistic is that of the standard deviation, or variance, of the spread within the set of numbers. The greater the variance, the less certainty there is about the mean of the set. Much of statistical analysis is directed toward establishing just how certain, how significant, a given statistic is.
But it is crucial to recognise that the mean and the variance are only two possible statistics. Many others may be relevant to a meaningful description of the set of numbers. The third order statistic, called skew, for example captures the prevalence of the extremes in the set. Skew is highly relevant for estimating things like ‘worst case’ conditions, what happens if things really go wrong. The fourth order static is that of heteroskedasticity, not something one reads about on the back of a box of cereal but important for understanding how the numbers in the set might defined in each other.
There are fifth, sixth, seventh, and higher order statistics, each giving slightly more information about the set of numbers in hand. And each of potential relevance in arriving at a decision about what to do. The mean expected return on a business venture for example might be very high. Its variance might be low as well so it looks like a winner. But if its downside extreme, its skew, could bankrupt the whole company, perhaps it’s not such a good idea.
And that’s the nub of the aesthetic problem in statistics: how much mean is worth how much variance, is worth how much skew, etc., etc.? There certainly is no rational, objective, universal answer, no criterion for equating various statistics. Nor even is there a criterion, a rule of judgement, for establishing which descriptive statistics are relevant at all. Each order statistic is a distinct aesthetic which is completely independent of and contrary to every other statistic. Nothing in the science of statistics suggests even the possibility that these contrary aesthetics are theoretically or practically reconcilable. And no statistician nor economist has ever been bold enough to propose that there might be such a reconciliation. They choose instead to duck the issue completely.
Furthermore, from my own experience I can attest that any attempt to establish subjective preferences for ‘trading off’ the various statistics against each other, in the manner of a utility function for example, is doomed to failure. Not only do people not experience life (including business) in terms of statistical categories, any such preferences are altered as soon as they are articulated. The end result is contradictory and nonsensical. No one recognises much less believes the resulting aesthetic.
In short, the insignificance of statistical significance can be most effectively demonstrated not by the fact the measures of significance are unrelated to the cost of error but by the fact that the statistics used in analysis have no significance to each other. I challenge any statistician or economist to show that this conclusion is unwarranted. Nevertheless I defer to Deirdre if she insists in making the case for cost.
Postscript: The approach outlined above is, I believe, also one that is compatible with the growing trend called Behavioural Economics. Traditional or Classical Economics starts out from a set of first principles and then logically derives its conclusions as virtual economic ‘laws.’ One law for example is that consumers and corporate executives are ‘rational’, that is, they act according to the implications of first principles. If they are empirically observed not to act this way, they are irrational and will eventually be forced to conform to rational norms or they will be economically punished.
This method of traditional economics has a marked resemblance to medieval scholastic method. It is not rational but rationalistic. It presumes the validity of its own analysis and has no real way of testing itself or learning. Behavioural Economics is equivalent to the Baconian revolution in medieval thought. It seeks first the way in which people conduct themselves economically, and then posits the question ‘Why.’ In other words, Behavioural Economics, presumes that there is a possibly undisclosed rationale, a purpose, and then tries to articulate what that purpose might be.
It is perhaps in the area of Financial Economics that the use of statistics is most abused and most dangerous. One consequence was the financial crash of 2008, created directly by the use of statistical models using fundamentally incompatible aesthetics. As far as I am aware many people are still paid a great deal of money to continue doing the same thing. This behaviour is driven by the aesthetic of the Goldman Sachs’s of the world, who have persuaded many more that they’re acting in their interests. They don’t.
Have you encountered one of those books where the authors convey one idea so clearly at the beginning that the rest of book is an overkill? This is one of those books. The idea is simple: significance tests are dangerous and useless, essentially because they do not actually measure anything; they only tell you if the result you obtained is random or not. In science, measurement is very important to make decisions; thus, they advocate the use of effect sizes and confidence intervals, instead of significant tests.
Not the earliest critic - that's Meehl or Freedman or Gosset himself - but the most readable. You don't necessarily need to read past page 100, it's recapitulation.
They have made excellent arguments against the importance of statistical significance and prove that it is an abstraction commonly used to falsify data. The book highlights danger and corruption that flows from the overwhelming importance placed upon statistical significance by using examples from various fields like pharma, law, forensics. Another interesting aspect of the book is a paean to Gosset at the beginning of the book.
Interesting to read about the background on Gosset (inventor of students t) and Fisher. Some chapter were less interesting and could easily have been excluded but the book finished in a good way with some suggestions and methods to use instead.
This one goes on my list of books whose purpose is almost exclusively to tell the reader how brilliant the author is, and how stupid everyone else is. (Other items on the list are Fooled by Randomness and The God Delusion -- in fact, almost anything by Richard Dawkins.) It seemed like a good idea -- there certainly are fundamental problems with the notion of statistical significance, and huge confusion about its application. But the authors seem to have just one hobbyhorse -- statistical significance ignores the size of effects -- and they ride it for chapter after chapter, heaping up anecdotes about colleagues and predecessors who supposedly didn't get it, instead of actually developing an argument. I gave up after about 50 pages.
Highly academic, but basically boils down to: "statistically significant" does not mean the same thing as "meaningful" or "material". Just because something is "statistically significant" doesn't mean we necessarily care about the outcome; a statistically significant +1% impact is less interesting than a less predictable +50-100% impact (the latter of which may be discarded because of high variance in outcomes).
In other words: do things that might matter, rather than things that are certain or repeatable.
Also based on an important problem in applied statistical research, it gives nothing in understanding, brainwashing the reader with gross misrepresentations, emotional appeals and other rhetorical tricks instead. Harmful to anyone without sufficient background to find fallacies in the presented arguments. Prepared reader, however, may gain a lot of insights by dissecting the fallacies, if he can tolerate the combination of arrogant presentation with nonsensical content.
An illuminating read on null hypothesis significance testing and how it causes many of us to lose focus on what is truly important: substantive significance or "oomph." That being said, the point was a bit belabored with the majority of the book devoted to describing the problem and very few pages were dedicated to the question of how to fix it.
The book is stylistically consistent which is saying something. I found their puns somewhat annoying and hard to decipher. But they are allowed that joy of their profession, to speak directly about the aspects of their industry which annoy them most, namely sizelessness and oomph.
As the justifiers of decision making, they wish to connect the meaning of their studies beyond insular economic language. This humanizes their profession, by relying on outside experts, which, they claim is the personal failing of the founder of the >.05 statistical significance king, Fisher who was raised by a bad mom. So they said he had personality issues, so he patented his profession in denying to talk with anyone. He was so successful at perpetrating this view of meaning -- as a structurally significant indicator -- so today we miss the significance of statistical studies, making bad decisions. Worse than ignorance, we are mistaken about what factors give the greatest impact on real-world situations.
This is a good primer on understanding statistics beyond routine.
This is a must-read book for every human being on the planet.
That said, I wish they'd made their point with more efficiency and less self-indulgent stream of consciousness.
Also, it's stronger on the "what's wrong" than on the "what should we do to fix it," but the "what's wrong" part is absolutely vital for everyone to know.
The authors' attempt at thoroughness make this book a bit exhausting at times, but it is still a must-read for anyone using statistics in real-world contexts.
Suppose I told you I had found irrefutable scientific evidence that a CEO’s golf handicap affects his or her company’s stock performance. Better golfers imply higher stock returns, I’d say. As evidence I would produce a huge regression model, covering thousands of companies, where golf handicap as explanation of differing stock returns was found to be statistically significant at the 1 percent level. Now promise me you will never trust any such mock science.
Stephen Ziliak and Deirdre McCloskey, two economics professors with a keen sense of the right and wrong uses of statistics, have written a truly eye-opening book. Without resorting to equations, they show that the practice known as hypothesis testing or tests of statistical significance is utterly flawed. Unfortunately, most scientists are unaware of this fact and scientific journals of the highest standing continue using it anyway.
Let it be said at once, Ziliak and McCloskey’s book is not an enjoyable read. Not only for its uncomfortable implications but just as much for its poor editing. On the good side the prose is sometimes beautiful and often funny. But its points are rarely well argued, the quotations and examples are too many and too lengthy and the endless repetition of its central polemic soon becomes tedious. However, I am willing to accept these flaws as the book is too important to be written off on the grounds of style.
The authors take a searing look at the practice of labelling factors either statistically significant or insignificant. The method is often associated with the statistician Ronald A. Fisher. But the authors show how Fisher virtually abducted the thinking from William Sealy Gosset, the head brewer of Guinness brewery, but corrupted Gosset’s original ideas and presented them in a simplistic and flawed version. Gosset is the man behind the famous pseudonym Student which has given name to Student’s t.
The biggest flaw in significance testing is to ask only whether there is a relation, not the magnitude of the assumed relation. Some factors could come out statistically significant while having negligible impact (like the golf handicap?), while others could have large impact while never achieving statistical significance. The outcome is to some extent in the hands of the researcher. Choose a big enough sample size and almost anything becomes statistically significant. Conversely, a small enough sample size makes nothing significant. The risk of manipulation by researchers is obvious as they can pick and choose between significance and insignificance by choice of sample size.
Hypothesis testing is even asking the wrong question. It asks what the likelihood would be of observing the data you have collected, assuming the null hypothesis is true. But the more relevant question concerns the likelihood of the hypothesis being true, given the observed data. This is equivalent to confusing the probability of a person being dead given that he was hanged, with the probability of him being hanged given that he is dead. Quite a difference, I would say.
The chilling implication of Ziliak’s and McCloskey’s book is that it casts doubt on almost everything we regard as knowledge in society. If papers published in eminent journals like the American Economic Review make such elementary errors, how are we to trust any scientific findings at all? Let alone any of the causes and effects communicated to us by less rigorous media?
This book is wildly repetitive, but makes a very important statistical point.
An example to exemplify the point. Drugs are often marketed on the basis of effectiveness vs. placebo. A typical research argument is that users benefit from taking the drug by a statistically "significant" margin, i..e., one that would not occur by chance more often than n% (where n, in the scientific community, is usually 5%).
So what's wrong with that? Well, the problem is, this methodology throws out anything to do with "how much." It could be that most or all of the cases in which NewDrug performed better than placebo were marginal. But it could also be that most or all of the cases where NewDrug did worse than placebo were disastrous, even fatal. The drug company's "good" cases may outweigh the "bad" cases numerically, to a statistically significant degree (and they'll trumpet that in the ads); but in common sense terms, if the bad cases result in so much death or disaster, bad outweighs good even though good cases outnumber bad to an arbitrarily "significant" degree. Ziliak's point is that the bottom line verdict in this example should not be the scientific conclusion that NewDrug is "effective," but rather the common sense one that NewDrug is "disastrous." In other words, you should look past so-called (statistical) "significance" to see what's really happening. Get it?
Hiding common sense conclusions can be a lucrative way of marketing products. Or of furthering ideological aims, as the book show economics can do. Throwing statistical significance around can make an economist look and feel like a wizard. But at the expense of common sense.
It's a worthy read. Practice in this area remains appalling, as they demonstrate. I do wonder if it's a bit over long. There is only one simple thesis in the whole book, and it did drag a bit towards the end. There's a nice couple of chapters on history (Fisher, Gosset, etc.) to round it off, which are great fun
The book feels a bit repetitive and is very aggressive in its various attacks. Yet, it would still be preferable to encounter this book early on in one's career as a researcher. It can be eye-opening for the grad student or early career researcher raised in the NHST tradition. And also full of useful and fun information. Definitely worth a read even if you disagree with the tone or content.
I didn't finish this book, but I read the first several chapters. They make a good point. It's just that they keep making the same point over and over. Effect size and real-world consequences are more important than statistical significance. Doesn't really require 300 pages to say that.
Loistava kirja siitä, mikä tilastollisten menetelmien käytössä on mennyt pieleen ja miksi! Jälkimmäisen kysymyksen suhteen kirja sortuu jonkin verran Fisheriin henkilönä kohdistuvaan kritiikkiin. Muutoin erinomaisen suositeltava kirja kaikille tilastotieteestä kiinnostuneille!