Bernoullis Fallacy

Because statistical methods are a means of accounting for the epistemic role of measurement error and uncertainty, the “statistics wars” (at least on the frequentist versus Bayesian front) are best described as a dispute about the nature and origins of probability: whether it comes from “outside us” in the form of uncontrollable random noise in observations, or “inside us” as our uncertainty given limited information on the state of the world. — *location: 71* ^ref-27009 --- The safeguard, missing completely from the standard template, is the prior probability for the hypothesis, meaning the probability we assign it before considering the data, based on past experience and what we consider established theory. — *location: 106* ^ref-30432 --- It is impossible to “measure” a probability by experimentation. Furthermore, all statements that begin “The probability is …” commit a category mistake. There is no such thing as “objective” probability. — *location: 149* ^ref-44374 --- “Rejecting” or “accepting” a hypothesis is not the proper function of statistics and is, in fact, dangerously misleading and destructive. The point of statistical inference is not to produce the right answers with high frequency, but rather to always produce the inferences best supported by the data at hand when combined with existing background knowledge and assumptions. — *location: 158* ^ref-19859 --- Science is largely not a process of falsifying claims definitively, but rather assigning them probabilities and updating those probabilities in light of observation. This process is endless. No proposition apart from a logical contradiction should ever get assigned probability 0, and nothing short of a logical tautology should get probability 1. — *location: 162* ^ref-39143 --- Sampling probabilities go from hypothesis to data: Given an assumption, what will we observe, and how often? Inferential probabilities go from data to hypothesis: Given what we observed, what can we conclude, and with what certainty? Sampling probabilities are fundamentally predictive; inferential probabilities are fundamentally explanatory. — *location: 402* ^ref-34354 --- The problem with Bernoulli’s answer to his question is that the arrow is pointing in the wrong direction. He wanted to answer a question about the probability of a hypothesis, but he did so by thinking only about the probability of an observation. The confusion of the two—and the general idea that one can settle questions of inference using only sampling probabilities—is what I call Bernoulli’s Fallacy. — *location: 410* ^ref-24975 --- The main idea is this: probability theory is logical reasoning, extended to situations of uncertainty. We’ll show how this flexible definition includes both the sampling and the inferential types of probability in Bernoulli’s problem and how it allows us to both solve the problem and describe how Bernoulli’s attempt was only part of a complete answer. — *location: 429* ^ref-58079 --- Like any human institution, statistics is and was largely a product of its times. In the late 19th and early 20th centuries, scientific inquiry demanded a theory free from even a whiff of subjectivity, which led its practitioners to claim inference based solely on data without interpretation was possible. They were mistaken. But their mistake was so powerfully appealing, and these first statisticians so prolific and domineering, that it quickly took hold and became the industry standard. — *location: 536* ^ref-60014 --- The key idea was a Dutch book, a portfolio of bets that would earn a sure profit (in modern language, we would call this an arbitrage strategy). He showed that unless a person’s probability assignments—that is, betting prices—followed the rules of probability, a Dutch book was always possible. For example, suppose my probability of it raining tomorrow were 30 percent, meaning I’d pay $0.30 for a chance to win $1 if it rained, but my probability of it not raining were 80 percent, so I’d pay $0.80 to win $1 if it didn’t rain. Someone could make a Dutch book against me by selling me both bets for a total price of $1.10 with the certain profit of $0.10, since they would have to pay out only $1 in either case. — *location: 1131* ^ref-7333 --- In Jaynes’s view, probability was entirely about information—specifically, the degree of certainty a rational person should have given incomplete information about an event or process. — *location: 1276* ^ref-40725 --- Jaynes’s probabilities were subjective in the sense of being dependent on the assumptions brought to a problem by a particular person, but they were objective in the sense that any two people reasoning rationally (following Cox’s rules for plausible reasoning) from the same starting assumptions would necessarily arrive at the same answer. — *location: 1433* ^ref-33074 --- At this point in the book, if things have gone according to plan, you should be convinced of two inescapable propositions: (1) probability is best understood as the plausibility of a statement given some assumed information, not just the frequency of occurrence of events, and (2) the correct process of probabilistic inference involves a faithful accounting of background information and assumptions, which is conceptually tricky and enormously difficult. — *location: 1700* ^ref-29879 --- Bernoulli offered this bargain: “Precise estimates, high certainty, or small samples. Pick two.” — *location: 1778* ^ref-31969 --- Beginning with his 1930 paper “Inverse Probability,” the statistician Ronald Fisher referred to this method of inference as “fiducial.”10 The term fiducial, meaning “faithful,” has origins in astronomy and land surveying, describing a fixed point of reference for establishing position or distance. Fisher’s idea, just like Bernoulli’s, was that, when trying to learn about an unknown quantity from a sample, either the unknown quantity could be taken as fixed and the sample shown to be close to it, or the sample value could be taken as fixed with the underlying quantity therefore necessarily being close to it. From either perspective, close was close. — *location: 1860* ^ref-48253 --- If we cast the problem in terms of Bayes’ theorem, we can begin to grasp the real magnitude of what’s going on here. For any hypothesis H and observed data D, the theorem tells us The inferential probability is the one on the left-hand side (How probable is the hypothesis given the data?). The sampling probability that Bernoulli focused on exclusively is only the numerator of the second term on the right-hand side (How probable is the data given the assumed hypothesis?). So how the sampling probability affects the inferential one will depend on the other terms in the equation. In particular, we need to know P[H] (How probable do we consider the hypothesis without knowing the data?) and P[D] (How probable is the data, considering all possible alternative hypotheses together?). — *location: 1870* ^ref-31878 >Frequentists ignore P(D) & P(H) --- This is what I will refer to from now on as Bernoulli’s Fallacy: the mistaken idea that sampling probabilities are sufficient to determine inferential probabilities. The Bayesian analysis reveals that this way of thinking misses out on two other essential ingredients: (1) what available hypotheses we may have to explain the data some other way, with their own associated sampling probabilities, and (2) what prior probabilities we assign to the various hypotheses in play. — *location: 1942* ^ref-4476 --- Bernoulli’s mistake was not just confusing the sampling statement in his theorem and the inferential statement he wanted to make, although that implies Bernoulli’s Fallacy as a consequence. His real mistake was thinking he had all the necessary information in the first place. — *location: 1946* ^ref-22592 --- Persi Diaconis and Frederick Mosteller called this the “law of truly large numbers”: given a large enough sample size, any outrageous thing is bound to happen.12 — *location: 1988* ^ref-4481 --- For example, here is a way to produce an outcome almost certainly never seen before in human history and nearly certain never to be repeated: shuffle a deck of cards. The resulting permutation, if the cards are shuffled correctly, should occur only about once every 52 factorial shuffles—that is, 52 · 51 · 50 · … · 2 · 1—because this is the number of possible shuffles, all of which should be equally likely. This number is on the order of 1068, or one hundred million trillion trillion trillion trillion trillion. Every person on earth could shuffle cards once every nanosecond for the expected lifetime of the universe and not even put a dent in that number. — *location: 1991* ^ref-20519 --- For example, under the assumption that a coin is fair, the sequences of 20 flips HHTHTTHTHHHTTTTHTHHT and HHHHHHHHHHHHHHHHHHHH have exactly the same probability: (1/2)20, or about 1 in 1 million, which is a pretty small chance. But only the latter is suggestive of an alternative hypothesis: that the coin is biased or even has heads on both sides, which would make the observed outcome certain. — *location: 2000* ^ref-16869 --- This is why particular forms of unlikely occurrences are so noteworthy. They carry an enormous potential energy like a coiled-up spring that could be released to launch an unlikely alternative hypothesis, such as the idea that something other than chance is at work, into the heights of near certainty. — *location: 2019* ^ref-29295 --- Unlikely events happen all the time, as in the story Richard Feynman sardonically recounted in a lecture about the scientific method: “You know, the most amazing thing happened to me tonight … I saw a car with the license plate ARW 357. Can you imagine? Of all the millions of license plates in the state, what was the chance that I would see that particular one tonight? Amazing!” — *location: 2121* ^ref-22051 --- In legal circles, the argument Meadow presented in these cases—that, under an assumption the suspect is innocent, the facts of the case would be incredibly unlikely, and, therefore, the suspect is unlikely to be innocent—is known as the prosecutor’s fallacy. — *location: 2290* ^ref-42074 --- The failure to appreciate (1) the need for alternative hypotheses and (2) the difference between sampling probabilities and inferential probabilities is what makes Bernoulli’s argument incorrect and what has led us in the present day to numerous instances of mistaken medical advice, wrongful assessment of risk, and miscarriages of justice. — *location: 2362* ^ref-46130 --- Defending the logic of this approach, Fisher wrote, “A man who ‘rejects’ a hypothesis provisionally, as a matter of habitual practice, when the significance is at the 1 percent level or higher [that is, when data this extreme could be expected only 1 percent of the time] will certainly be mistaken in not more than 1 percent of such decisions. For when the hypothesis is correct he will be mistaken in just 1 percent of these cases, and when it is incorrect he will never be mistaken in rejection.”35 However, that argument obscures a key point. To understand what’s wrong, consider the following completely true summary of the facts in the disease-testing example (no false negatives, 1 percent false positive rate): Suppose we test one million people for the disease, and we tell every person who tests positive that they have it. Then, among those who actually have it, we will be correct every single time. And among those who don’t have it, we will be incorrect only 1 percent of the time. So overall our procedure will be incorrect less than 1 percent of the time. Sounds persuasive, right? But here’s another equally true summary of the facts, including the base rate of 1 in 10,000: Suppose we test one million people for the disease, and we tell every person who tests positive that they have it. Then we will have correctly told all 100 people who have the disease that they have it. Of the remaining 999,900 people without the disease, we will incorrectly tell 9,999 people that they have it. Therefore, of the people we identify as having the disease, about 99 percent will have been incorrectly diagnosed. — *location: 2377* ^ref-33666 --- Finally, suppose the people who received positive test results and a presumptive diagnosis in our example were tested again by some other means. We would see the majority of the initial results fail to repeat, a “crisis of replication” in diagnoses. That’s exactly what’s happening in science today. Virtually every area of experimental science that uses statistics is now being forced to confront the fact that many of their established results are not reproducible. — *location: 2397* ^ref-28134 --- the more important a question is to society, the more fierce the objection will be to using probability to answer that question, and the more the users of probability will retreat to frequency as a justification. — *location: 2818* ^ref-3560 --- Reflecting on his career in statistics in 1934, Pearson said his great epiphany had been that “there was a category broader than causation, namely correlation, of which causation was only the limit, and that this new conception of correlation brought psychology, anthropology, medicine and sociology in large parts into the field of mathematical treatment. It was Galton who first freed me from the prejudice that sound mathematics could only be applied to natural phenomena under the category of causation.”20 — *location: 3200* ^ref-6313 --- He also compiled his most important statistical work, the book Statistical Methods for Research Workers (1925), which contained a collection of practical methods for scientists, especially biologists, to use in problems of inference when dealing with small sample sizes—Fisher’s specialty since his days at Cambridge. — *location: 3423* ^ref-13607 --- For an experimental scientist without advanced mathematical training, the book was a godsend. All such a person had to do was find the procedure corresponding to their problem and follow the instructions. As a result, Statistical Methods for Research Workers was enormously successful. It went through 14 editions between 1925 and 1970, and it became such the industry standard that anyone not following one of Fisher’s recipes would have a hard time getting results published. — *location: 3437* ^ref-32370 --- Both were extremely ambitious men possessed of colossal egos, and both wielded tremendous influence over the next generation through their writing and teaching. As a result, any student of statistics these days will know the names Pearson and Fisher. The correlation coefficient, now a standard calculation applied to almost any data containing two variables, is Pearson’s rho. Pearson gets credit for the multivariate normal distribution, contingency tables, the chi-squared test, the method of moments, and principal component analysis, all standard tools. He also invented significance testing and the p-value, which is the most common measure of statistical significance, as we’ll describe in the next chapter. From Fisher we get the F-test, the idea of a sufficient statistic, the method of maximum likelihood, the concept of a parameter, linear discriminant analysis, ANOVA (analysis of variance), Fisher information, and Fisher’s exact test, among others. — *location: 3640* ^ref-27929 --- In the next generation, Egon Pearson and Neyman introduced a different, more mathematical approach to hypothesis testing by means of decision theory. That is, they viewed the results of a statistical test in terms of the decision to accept or reject a hypothesis in favor of an alternative, with penalties for making the wrong choice. From this mode of thinking came such concepts as unbiased estimators, statistical power, Type I and Type II errors, and confidence intervals. — *location: 3652* ^ref-44029 --- While these feuds continued, it was up to the community of research scientists, journal editors, and statistics textbook authors to decide which techniques would become industry standards. Because of the great influence of both the Fisher and the Neyman-Pearson schools, the answer was that working scientists took a hybrid approach and combined ideas from both camps. Most notably, the current standard procedure of null hypothesis significance testing is the result of cramming Fisher’s p-value measure of significance into Neyman and Pearson’s hypothesis testing framework. — *location: 3675* ^ref-53912 --- The result of those arguments, one triggered by the hint of Bayesianism in Fisher’s work, is that orthodox statistics may be an unholy hybrid no single author would recognize, but it is an entirely frequentist one, as we’ll see worked out in more detail in the next chapter. — *location: 3698* ^ref-24927 --- So, in the end, we mostly have Fisher to blame for modern statistics being frequentist, but he was in many ways just carrying out a line of thought begun nearly a century earlier by many others—arguably stretching all the way back to Bernoulli—and consistent with Galton’s and Pearson’s views of statistics. In answer to the question of why the statistical methods they ultimately produced were frequentist, the simplest and most correct answer, then, is that Fisher thought they could be. His predecessors may have desired for all inference to rely solely on observable facts because it would have fit their overall scientific philosophy, but Fisher was the one who brewed the mathematical snake oil with that as its promise. He was further emboldened by critiques like those of Boole and Bertrand, who had shown that different meanings of ignorance could lead to inconsistent prior probability assignments and that a uniform probability distribution couldn’t be justified for all problems. — *location: 3799* ^ref-1468 --- The implicit claim in the use of strictly frequentist statistical methods is that every probability in the calculations is objectively measurable and therefore so are all the conclusions emanating from those calculations. But that claim has always involved some sleight of hand. All statistical estimates of anything—for example, the correlation between skull size and measured intelligence or the mean difference in disease incidence between different races—are explainable in more than one way. The estimation process may be perfectly objective and its associated frequencies reliably measurable, and for problems with weak prior information, the estimates may even come out sensibly. But then it’s up to the scientist to decide what conclusions to make from those estimates: whether they represent a real association or a difference worth caring about, or whether they might be the products of some unobserved variable or infected with bias in the ways the relevant quantities were measured. Galton, Pearson, and Fisher all demonstrated this flexibility abundantly by interpreting the same statistical results one way or another, depending on what conclusion suited their agenda. The agenda guiding their inferences was the unacknowledged subjective element, while their flashy statistical calculations were meant to provide misdirection. — *location: 4022* ^ref-16974 --- Considering the names of all the various properties of estimators and tests in the statistical literature, it’s hard not to notice that they all seem to embed value judgments, suggesting that this particular estimator or test is good. We have already mentioned unbiased and consistent estimators. There are also efficient estimators, admissible estimators, dominant estimators, robust estimators, uniformly most powerful tests, and, surely the best example, the best linear unbiased estimator. It seems nearly certain that all this normativity is a by-product of the political infighting and jockeying for position between various camps within the world of frequentist statistics over the course of the last century. As different factions fought for legitimacy and for acceptance of their methods as standard, they must have thought it advantageous to give their methods virtuous-sounding names. Who would want to be seen as being in favor of bias, inconsistency, inefficiency, inadmissibility, subordination, frailty, powerlessness, or … worst-ness? — *location: 4379* ^ref-30468 --- With Bayesian techniques, every analysis has the potential to be a meta-analysis because we are free to take the posterior probabilities from someone else’s work as our prior probabilities for the start of our own. Bayes’ theorem guarantees that we will reach the same conclusions as a meta-analysis combining our results, since, mathematically, for any two propositions A and B, we have P[H | (A and B) and X] = P[H | A and (B and X)] — *location: 4904* ^ref-11492 --- Dr. Joseph Berkson at the Mayo Clinic wrote in the Journal of the American Statistical Association that the logic of significance testing was flawed because it always interpreted unlikely observations as evidence against a hypothesis. He argued that this could make sense only if there was a cogent alternative: There is no logical warrant for considering an event known to occur in a given hypothesis, even if infrequently, as disproving the hypothesis…. Suppose I said, “Albinos are very rare in human populations, only one in fifty thousand. Therefore, if you have taken a random sample of 100 from a population and found in it an albino, the population is not human.” This is a similar argument but if it were given, I believe the rational retort would be, “If the population is not human, what is it?”1 — *location: 5322* ^ref-17677 --- So why weren’t the critics more successful at dislodging these methods? At least a good portion of the answer is given in Andreski’s argument: statistical methods gave researchers in the “softer” sciences a feeling of objectivity, which they desperately desired as a way to lend quantitative legitimacy to their work. As we saw in chapter 3, that need had been felt in the discipline since the days of Adolphe Quetelet’s “social physics” and had been a key motivation throughout the development of statistics in the 19th and 20th centuries. Objectivity was what frequentism promised, and it found hungry consumers in the worlds of social science. — *location: 5399* ^ref-13590 --- The first real bombshell, though, came in 2005, when John Ioannidis, a professor at Stanford University’s School of Medicine and its Department of Statistics, laid the replication problem at the feet of the orthodox statistical methods, primarily NHST. In an article titled “Why Most Published Research Findings Are False,” he showed in a straightforward Bayesian argument that if a relationship, such as an association between a gene and the occurrence of a disease, had a low prior probability, then even after passing a test for statistical significance, it could still have a low posterior probability of being true.17 — *location: 5464* ^ref-42858 --- In general, assuming a prior probability p for any theory and putting the assumed false positive rate (α) and the assumed false negative rate (β) in an inference table, we would have the results shown in table 6.2, with the observation D being “The observed effect is statistically significant.” TABLE 6.2 General inference given a statistically significant result — *location: 5478* ^ref-9300 --- So an effect that had passed the significance test would have less than a 50 percent chance of being true if the second pathway was less probable than the first. In terms of the quantities in table 6.2, this would happen if Since α was usually taken to be 5 percent for most significance tests and a typical test might have a false negative rate around 50 percent, this meant most published research findings would be false if the prior ratio of true to false effects was anything less than 10 percent. — *location: 5483* ^ref-18497 --- In other words, collecting data samples is supposed to lead us toward the conclusion we would reach if, ideally, we had access to the whole population. But without doing any research at all, we know from the beginning that certain kinds of null hypotheses, when applied to the whole population, are almost surely false. So what’s the point of doing the sample? — *location: 5520* ^ref-31926 --- In 1968, David Lykken of the University of Minnesota called this the “ambient noise level of correlations.”20 He and Meehl demonstrated it with an analysis of 57,000 questionnaires that had been filled out by Minnesota high school students. The survey included a wide range of questions about the students’ families, leisure activities, attitudes toward school, extracurricular organizations, etc. The two found that, of the 105 possible cross-tabulations of variables, every single association was statistically significant, and 101 (96 percent) of them had p-values less than 0.000001. So, for example, birth order (oldest, youngest, middle, only child) was significantly associated with religious views and also with family attitudes toward college, interest in cooking, membership in farm youth clubs, occupational plans after school, and so on. Meehl called this the “crud factor,” meaning the general observation that “in psychology and sociology everything correlates with everything.”21 But as Meehl emphasized, these were not simply results obtained purely by chance: “These relationships are not, I repeat, Type I errors. They are facts about the world, and with N = 57,000 they are pretty stable. Some are theoretically easy to explain, others more difficult, others completely baffling. The ‘easy’ ones have multiple explanations, sometimes competing, usually not. Drawing theories from a pot and associating them whimsically with variable pairs would yield an impressive batch of H0-refuting ‘confirmations.’”22 That is, any one of these 105 findings could, according to standard practice, be wrapped in a theory and published in a journal. — *location: 5526* ^ref-20470 --- As larger samples become easier to collect, these kinds of small-effect results can be expected more and more. For example, a 2013 study on more than 19,000 participants showed that people who met their spouses online tended to have higher reported rates of marital satisfaction than those who met in person, with a tiny p-value of 0.001. It sounds like an impressive, and very topical, result until you see that the observed difference was minuscule: an average “happiness score” of 5.64 versus 5.48 on a 7-point scale—that is, less than a 3 percent relative improvement.23 — *location: 5541* ^ref-45387 >Importance of reporting effect size --- All of this confusion serves to underscore the point that hypothesis testing is meaningless without alternatives against which to test. When the hypothesis that a population correlation is exactly 0 or that a population proportion is exactly 1/2 is tested against its simple negation—that the correlation is not 0 or that the proportion is not 1/2—the null hypothesis will always lose if the amount of data is large enough. But that’s no surprise because these hypotheses should have basically 0 prior probability anyway. Instead, we need to give hypotheses a fighting chance by stating them such that their prior probability is not 0 or, even better, to treat hypotheses on a continuum and assign prior and posterior probability distributions. — *location: 5564* ^ref-60170 --- That is, as first articulated by psychologist Edwin Boring in 1919,25 a scientific hypothesis is never just a statistical hypothesis—that two statistics in the population are different from each other, that two variables are correlated, that a treatment has some nonzero effect—but also an attempt at explaining why, by how much, and why it matters. Forgetting this is what Stephen Ziliak and Deirdre McCloskey in The Cult of Statistical Significance (2008) called “the Error of the Third Kind.” As they put it, “Statistical significance is not a scientific test. It is a philosophical, qualitative test. It does not ask how much. It asks ‘whether.’ Existence, the question of whether, is interesting. But it is not scientific.”26 — *location: 5572* ^ref-8396 --- So the common practice, in force from about 1930 to the present, of judging whether research findings were publication worthy based only on statistical significance created the possibility that two kinds of bad scientific research could enter the literature. One was a simple Type I error, where, by a fluke of random sampling, data was obtained that passed a threshold of significance despite there being no real effect present; this could be expected more often as researchers were sifting through many possible associations until they found one that worked, per Ioannidis’s dire predictions. The other possibility was a Type III error, where the effect was real in a statistical sense but did not actually support the scientific theory it was supposed to, perhaps because the sample was so large that the procedure found a tiny effect of little to no scientific value. It could be that another factor, something specific to that experiment and not thought of by the researcher, could explain the finding in a way that made it of no practical use to anyone else. Older and younger siblings in Minnesota high schools in 1966 might have had genuinely different feelings about college when asked a certain way, but that’s only scientifically meaningful if the result generalizes beyond that one particular time and place. — *location: 5583* ^ref-53474 --- In August 2015, Nosek and his team of 270 collaborators released the results of their psychology replication project. Of the 100 papers they studied, 97 had originally claimed to discover a significant effect. The replication studies used the original materials when possible and large enough sample sizes that they would have high power (at least 80 percent) to detect the effects that were claimed. The experimental protocols were all reviewed and approved by the original authors. They found they were able to replicate only 35 of these 97 results (36 percent), which they defined as achieving a statistically significant effect in the same direction as the original.55 Of those effects they did replicate, they found the average size of the effect to be about half the original. — *location: 5929* ^ref-53883 --- In March 2019, an article in Nature cosigned by more than 800 research scientists called for an end to the concept of statistical significance altogether. In the authors’ words, “Let’s be clear about what must stop: we should never conclude there is ‘no difference’ or ‘no association’ just because a P value is larger than a threshold such as 0.05 or, equivalently, because a confidence interval includes zero. Neither should we conclude that two studies conflict because one had a statistically significant result and the other did not. These errors waste research efforts and misinform policy decisions.”81 It was, by their estimates, an incredibly widespread problem. A tally of 791 articles in five journals showed that roughly half had wrongfully interpreted a lack of significance as confirmation of the null. — *location: 6062* ^ref-8555 --- In March 2019, the editors of the American Statistician, including Ron Wasserstein, the executive director of the ASA, published a special issue of the journal titled “Statistical Inference in the 21st Century: A World Beyond p < 0.05” with even stronger words of warning. In the introduction to the issue, the editors wrote, “The ASA Statement on P-Values and Statistical Significance stopped just short of recommending that declarations of ‘statistical significance’ be abandoned. We take that step here. We conclude, based on our review of the articles in this special issue and the broader literature, that it is time to stop using the term ‘statistically significant’ entirely. Nor should variants such as ‘significantly different,’ ‘p < 0.05,’ and ‘nonsignificant’ survive, whether expressed in words, by asterisks in a table, or in some other way.”83 — *location: 6071* ^ref-16882 --- What this means potentially is that there’s an altogether unseen crisis of replication hiding in the shadows: the failure of failed experiments to fail again. The (bad) replication crisis we know about is of Type I; there might be a wholly different (good) crisis of Type II. There might be a rich vein of results just waiting to be mined. — *location: 6093* ^ref-1246 --- These failures of replication are compounded by the fact that effect sizes even among the findings that do replicate are generally found to be substantially smaller than what the original studies claimed. — *location: 6120* ^ref-62014 --- Among the many causes of the replication crisis is the problem of multiple comparisons—the ability of one or more researchers to sort through the possible associations present in the data until one pops out as being significant just by chance—but Bayesian analysis has a built-in safeguard: the prior probability. — *location: 6174* ^ref-61059 --- Bayesian analysis also takes care of the “crud factor,” the tiny background correlations that tend to exist between any two measured variables and can show up as statistically significant in large enough samples. Instead of just classifying results as significant or insignificant, the Bayesian posterior distribution always properly reports the likely size of any claimed effect or association. — *location: 6183* ^ref-61757 --- Bayesian analysis does not chase after the question of “whether” but stays grounded in “how much” and “how likely.” — *location: 6187* ^ref-27284 --- The better, more complete interpretation of probability is that it measures the plausibility of a proposition given some assumed information. — *location: 6273* ^ref-2236 --- This extends the notion of deductive reasoning—in which a proposition is derivable as a logical consequence of a set of premises—to situations of incomplete information, where the proposition is made more or less plausible, depending on what is assumed to be known. Deductive reasoning, in this framework, is probability with 1s and 0s. Or viewed another way, probability is deductive reasoning with uncertainty. — *location: 6274* ^ref-47782 --- But the good news is that we don’t have to evaluate every expression analytically if a numerical approximation will suffice. Modern computational techniques make this kind of thing a snap. If we need to compute a tough integral, we can either divide up the parameter space into some number of grid points and use them up to approximate the integral, or if that proves inadequate, we can use Markov Chain Monte Carlo methods to simulate random variables with the appropriate distributions. — *location: 6590* ^ref-27828 --- What we can do is try (and most often fail) to be honest about the factors that influence us and avoid serving any unjust masters who would push us toward whatever research conclusions suit them best. The eugenics movement, for example, should be understood as a cautionary tale about the dangers of failing to do this introspection while attempting to sail under the flag of objectivity. In other words, we should try to be objective—not in the impossible sense that Galton, Pearson, and Fisher claimed granted them unquestionable authority but in the way they failed to be when they let the political interests of the ruling class dictate the outcome of their research before it began. — *location: 6629* ^ref-46503 --- Even if we express our theories in the most precise technical language and back them with the most exact measurements, we cannot escape the fact that all science is a human enterprise and is therefore subject to human desire, prejudice, consensus, and interpretation. — *location: 6627* ^ref-21181 ---