Background: Statistical testing in phase III clinical trials is subject to chance errors, which can lead to false conclusions with substantial clinical and economic consequences for patients and society. Methods: We collected summary data for the primary endpoints of overall survival (OS) and progression-related survival (PRS) (eg, time to other type of event) for industry-sponsored, randomized, phase III superiority oncology trials from 2008 through 2017. Using an empirical Bayes methodology, we estimated the number of false-positive and false-negative errors in these trials and the errors under alternative P value thresholds and/or sample sizes. Results: We analyzed 187 OS and 216 PRS endpoints from 362 trials. Among 56 OS endpoints that achieved statistical significance, the true efficacy of experimental therapies failed to reach the projected effect size in 33 cases (58.4% false-positives). Among 131 OS endpoints that did not achieve statistical significance, the true efficacy of experimental therapies reached the projected effect size in 1 case (0.9% false-negatives). For PRS endpoints, there were 34 (24.5%) false-positives and 3 (4.2%) false-negatives. Applying an alternative P value threshold and/or sample size could reduce false-positive errors and slightly increase false-negative errors. Conclusions: Current statistical approaches detect almost all truly effective oncologic therapies studied in phase III trials, but they generate many false-positives. Adjusting testing procedures in phase III trials is numerically favorable but practically infeasible. The root of the problem is the large number of ineffective therapies being studied in phase III trials. Innovative strategies are needed to efficiently identify which new therapies merit phase III testing.
Submitted September 14, 2020; final revision received November 29, 2020; accepted for publication November 30, 2021.
Published online June 21, 2021.
Author contributions: Study design: Shen. Data analysis: Shen, Xu. Data interpretation: Shen, Ferro. Writing – original draft: Shen, Ferro. Writing – review and editing: Shen, Ferro, Kramer, Patell, Kazi. Critical feedback for further interpretation of results: Shen, Ferro, Kramer, Patell, Kazi. Final approval: All authors.
Disclosures: The authors have disclosed that they have not received any financial consideration from any person or organization to support the preparation, analysis, results, or discussion of this article.