gelman_2014

## Beyond Power Calculations: Assessing Type S (Sign) and Type M (Magnitude) Errors > [!Abstract]- > Statistical power analysis provides the conventional approach to assess error rates when designing a research study. However, power analysis is flawed in that a narrow emphasis on statistical significance is placed as the primary focus of study design. In noisy, small-sample settings, statistically significant results can often be misleading. To help researchers address this problem in the context of their own studies, we recommend design calculations in which (a) the probability of an estimate being in the wrong direction (Type S [sign] error) and (b) the factor by which the magnitude of an effect might be overestimated (Type M [magnitude] error or exaggeration ratio) are estimated. We illustrate with examples from recent published research and discuss the largest challenge in a design calculation: coming up with reasonable estimates of plausible effect sizes based on external information. > [!Cite]- > Gelman, Andrew, and John Carlin. “Beyond Power Calculations: Assessing Type S (Sign) and Type M (Magnitude) Errors.” _Perspectives on Psychological Science_ 9, no. 6 (November 2014): 641–51. [https://doi.org/10.1177/1745691614551642](https://doi.org/10.1177/1745691614551642). ## Notes %% begin notes %% %% end notes %% %% begin annotations %% ## Imported: 2024-08-15 6:48 am In this article, we examine some critical issues related to power analysis and the interpretation of findings arising from studies with small sample sizes. We highlight the use of external information to inform estimates of true effect size and propose what we call a design analysis—a set of statistical calculations about what could happen under hypothetical replications of a study—that focuses on estimates and uncertainties rather than on statistical significance. In general, the larger the effect, the higher the power; thus, power calculations are performed conditionally on some assumption of the size of the effect. We suggest that design calculations be performed after as well as before data collection and analysis. We frame our calculations not in terms of Type 1 and Type 2 errors but rather Type S (sign) and Type M (magnitude) errors, which relate to the probability that claims with confidence have the wrong sign or are far in magnitude from underlying effect sizes (Gelman & Tuerlinckx, 2000). Design calculations, whether prospective or retrospective, should be based on realistic external estimates of effect sizes. This is not widely understood because it is common practice to use estimates from the current study’s data or from isolated reports in the literature, both of which can overestimate the magnitude of effects. The idea that published effect-size estimates tend to be too large, essentially because of publication bias, is not new (Hedges, 1984; Lane & Dunlap, 1978; for a more recent example, also see Button et al., 2013). One practical implication of realistic design analysis is to suggest larger sample sizes than are commonly used in psychology. Effect-size estimates based on preliminary data (either within the study or elsewhere) are likely to be misleading because they are generally based on small samples, and when the preliminary results appear interesting, they are most likely biased toward unrealistically large effects (by a combination of selection biases and the play of chance; Vul, Harris, Winkelman, & Pashler, 2009). We have implemented these calculations in an R function, retrodesign(). The inputs to the function are D (the hypothesized true effect size), s (the standard error of the estimate), D (the statistical significance threshold; e.g., .05), and df (the degrees of freedom). The function returns three outputs: the power, the Type S error rate, and the exaggeration ratio, all computed under the assumption that the sampling distribution of the estimate is t with center D, scale s, and dfs. Whether considering study design and (potential) results prospectively or retrospectively, it is vitally important to synthesize all available external evidence about the true effect size. In the present article, we have focused on design analyses with assumptions derived from systematic literature review. In other settings, postulated effect sizes could be informed by auxiliary data, meta-analysis, or a hierarchical model. When it is difficult to find any direct literature, a broader range of potential effect sizes can be considered. For example, heavy cigarette smoking is estimated to reduce life span by about 8 years (see, e.g., Streppel, Boshuizen, Ocke, Kok, & Kromhout, 2007). Therefore, if the effect of some other exposure is being studied, it would make sense to consider much lower potential effects in the design calculation. For example, Chen, Ebenstein, Greenstone, and Li (2013) reported the results of a recent observational study in which they estimated that a policy in part of China has resulted in a loss of life expectancy of 5.5 years with a 95% confidence interval of 0.8, 10.2. Most of this interval—certainly the high endis implausible and is more easily explained as an artifact of correlations in their data having nothing to do with air pollution. We have provided a tool for performing design analysis given information about a study and a hypothesized population difference or effect size. Our goal in developing this software is not so much to provide a tool for routine use but rather to demonstrate that such calculations are possible and to allow researchers to play around and get a sense of the sizes of Type S errors and Type M errors in realistic data settings. Our recommended approach can be contrasted to existing practice in which p values are taken as data summaries without reference to plausible effect sizes. In this article, we have focused attention on the dangers arising from not using realistic, externally based estimates of true effect size in power/design calculations. It is not sufficiently well understood that “significant” findings from studies that are underpowered (with respect to the true effect size) are likely to produce wrong answers, both in terms of the direction and magnitude of the effect. There is a range of evidence to demonstrate that it remains the case that too many small studies are done and preferentially published when “significant.” There is a common misconception that if you happen to obtain statistical significance with low power, then you have achieved a particularly impressive feat, obtaining scientific success under difficult conditions. However, that is incorrect if the goal is scientific understanding rather than (say) publication in a top journal. In fact, statistically significant results in a noisy setting are highly likely to be in the wrong direction and invariably overestimate the absolute values of any actual effect sizes, often by a substantial factor. %% end annotations %% %% Import Date: 2024-08-15T06:49:10.109-06:00 %%