《新英格兰医学杂志》修改了统计分析指南，宣布弱化P值的地位

在最新一期的《新英格兰医学杂志》里，编委会写了一篇评论《New Guidelines for Statistical Reporting in the Journal》，宣布弱化P值在多重比较中的地位。

最近几年，P值广受争议。现代生物统计奠基人Fisher于20世纪20年代提出P 值的概念，认为在原假设成立的情况下，如果当前统计量以及更极端值发生概率小于二十分之一时，拒绝我们的无效假设。然后，一不小心，成为主流。这P值的概念及其对其的应用已经接近百年，目前主导绝大多数应用统计研究。

但是，对P值的质疑声持续不断。为什么一定是0.05，如果是0.06就没有统计学意义了吗？0.05的设定值是不是偏大了？近年来，最具有代表性的是2016年美国统计协会关于 P值和统计学意义的讨论，以及其2018年在美国统计学杂志发表"move to a world beyond “p < 0.05”，建议放弃P值.此外，DanielBenjamin等人发表论文支持将P值设定为0.005；当代著名流行病学家，《现代流行病学》作者Rothman KJ，建议用置信区间代替P值。

P值面临着百年大考。

总体来说，流行病学界和生物统计学界达成一个基本的共识，那就是单纯作为P值无法提供足够的信息来反映医学研究的效果。替代的方法，可能是置信区间加上P值来反映统计结果，或者贝叶斯统计的方法给出一个概率值。

现在《新英格兰医学杂志》回应了对P值的质疑。他们重新修改了杂志社对统计学分析的要求，弱化了P值的地位。在2019年7月17日的评论中，特别指出：

“Thenew guidelines discuss many aspects of the reporting of studies in the Journal, including a requirement to replace P values with estimatesof effects or association and 95% confidence intervals when neither theprotocol nor the statistical analysis plan has specified methods used to adjustfor multiplicity”

“临床研究特别是临床试验，如果没有在研究方案制定时候提出多重比较的方法，那么事后再进行多重比较时，不再提供P值，而是用点估计值与置信区间代替。”

《新英格兰医学杂志》修改P值方针基于几个重要的前提：第一，在预先设计了分析计划的情况下，遵循计划很重要；第二，利用统计阈值来说明存在着效应或者相关性时，应限于分析计划中注明了一型错误控制方法的分析；第三，关于治疗或暴露带来的获益或者危害，其证据应该包括点估计值及其误差界限，也就是置信区间。

上述三点告诉我们，研究设计必须要详细写出统计分析计划；如果没有统计分析计划，多重比较不应该注明P值；疗效的评估应该包括统计量及其置信区间！因此《新英格兰医学杂志》文章里，P值的地位已经大不如从前了。

为了让大家了解《新英格兰医学杂志》统计分析指南全貌，本人手工翻译了最新版统计分析指南。有兴致者可以了解下。

The Methodssection of all manuscripts should contain a brief description of sample sizeand power considerations for the study, as well as a brief description of themethods for primary and secondary analyses.

所有稿件“方法部分”应该对研究进行一个简单的样本量和检验功效的描述，同时描述主要结局和次要结局的分析方法。

The Methods section of all manuscripts should include adescription of how missing data have been handled. Unless missingness is rare,a complete case analysis is generally not acceptable as the primary analysisand should be replaced by methods that are appropriate, given the missingnessmechanism. Multiple imputation or inverse probability case weights can be usedwhen data are missing at random; model-based methods may be more appropriate when missingness may be informative. For the Journal’s generalapproach to the handling of missing data in clinical trials please see Wareet al (N Engl J Med 2012;367:1353–1354).

所有稿件的“方法部分”应该告知如何处理缺失数据的。除非缺失非常罕见，否则只分析完整信息病例的研究是无法接受的。在这种情况下，应该基于缺失数据的机制来进行数据填补。多重填补或者逆向概率加权法可以用来填补随机缺失数据。如果缺失数据具有一定规律性（比如非随机缺失），应该采用模型的方法来进行填补。如何处理缺失数据，可见2012年本刊的方法学文章Wareet al (N Engl J Med2012;367:1353–1354).

Significance tests should be accompanied by confidence intervals for estimated effect sizes, measures of association, or other parameters of interest. The confidence intervals should be adjusted to matchany adjustment made to significance levels in the corresponding test.

假设检验应该提供效应值、关联度、或者其它感兴趣结果指标的置信区间。置信区间应该根据不同置信度来进行调整，不局限于0.05（比如两两比较的时候，当调整检验水准时，那么置信区间也要调整）

Unless one-sided tests are required by study design, suchas in noninferiority clinical trials, all reported P values should betwo-sided. In general, P values larger than 0.01 should be reported to twodecimal places, and those between 0.01 and 0.001 to three decimal places; Pvalues smaller than 0.001 should be reported as P<0.001. Notable exceptionsto this policy include P values arising from tests associated with stopping rules in clinical trials or from genome-wide association studies.

除了少数研究设计，比如非劣效性临床研究，一般情况下所有的报告P值应该是双侧检验的P值。一般情况下，P值大于0.01时候应该保留2位小数（本人觉得3位也行，表格整齐点），如果在0.01到0.001之间，应该保留3位，如果小于0.001，应用P<0.001.表达。当然有些场合允许变通，比如临床试验有些中断试验情况或者全基因组关联性研究。

Results should be presented with no more precision thanis of scientific value and is meaningful given the available sample size. Forexample, measures of association, such as odds ratios, should ordinarily be reported to two significant digits. Results derived from models should be limited to the appropriate number of significant digits.

结果应该除了科学、有意义的数值之外，别再提供更多的东西了。举个例子，比如报告关联性指标如RR值，应该报告两个重要的数值（RR值、可信区间或P值）。从模型（比如回归模型）得到的结果，应该限制于有限的重要的几个值（b值、SE值、统计量、P值，可信区间，最多这么几个，正常会更少）。

For clinicaltrials: 临床试验的特殊要求：

Original and final protocols and statistical analysis plans (SAPs) should be submitted along with the manuscript, as well as a table of amendments made to the protocol and SAP indicating the date of the change and its content.

最初的和最终的研究设计方案已经相应的统计分析计划(SAPs)应该跟稿件一起递交，以及在研究实施过程中对方案和统计计划的调整清单（包括日期和内容）。

The analyses of the primary outcome in manuscripts reporting results of clinical trials should match the analyses prespecified in the original protocol, except in unusual circumstances. Analyses that do not conform to the protocol should be justified in the Methods section of the manuscript. The editors may ask for additional analyses that are not specified in the protocol。

临床试验稿件的主要分析结果应该根据既定的研究方案和统计分析计划形成的，除非发生特殊的情况。如果实际跟研究方案不一致，应该在稿件的方法中进行澄清，编委会可能会质疑并询问不在方案中的一些分析结果。

When comparing outcomes in two or more groups in confirmatory analyses, investigators should use the testing procedures specified in the protocol and SAP to control overall type I error — for example, Bonferroni adjustments or prespecified hierarchical procedures. Pvalues adjusted for multiplicity should be reported when appropriate and labeled as such in the manuscript. In hierarchical testing procedures, P values should be reported only until the last comparison for which the P value wasstatistically significant. P values for the first nonsignificant comparison andfor all comparisons there after should not be reported. For prespecified exploratory analyses, investigators should use methods for controlling false discovery rate described in the SAP — for example, Benjamini–Hochberg procedures.

在验证性分析中，如果要进行多组比较，研究者应该采用方案和统计分析计划所设计的控制一类错误的方法，比如Bonferroni adjustments 或事先制定的多层次比较方法（例如序贯比较或者Dunntt 检验）。多重比较的P值应该汇报出来.如果采用分层次的多重比较方法，应该只报最后一次有统计学意义的P值。第一次没有统计学意义的P值，以及接下来的两两比较都不用汇报了。（按这句话什么意思呢，临床试验验证性两两比较，可能根据研究设计，会按照顺序来，比如比较三组，先第一组和第二组比较，如果有意义，再比较第一次和第三组，如果没有意义，那么第二组和第三组不再进行比较了。因此只报到最后一次有统计学意义的）

When no method to adjust for multiplicity of inferences or controlling false discovery rate was specified in the protocol or SAP of aclinical trial, the report of all secondary and exploratory endpoints should belimited to point estimates of treatment effects with 95% confidence intervals. In such cases, the Methods section should note that the widths of the intervalshave not been adjusted for multiplicity and that the inferences drawn may notbe reproducible. No P values should be reported for these analyses.

如果临床试验统计分析计划中，没有写清楚多重比较时候采用何种的方法来调整一类错误，或者控制false discoveryrate，那么报告的所有次要和探索性结果中，只能报告处理效应和95%置信区间。在这些情况下，“方法部分”要注意置信区间不要去调整检验水准，不要用P值来报告结果（这个是柳叶刀杂志最新版的重要修改，非预先设计的统计学方法，不再推荐报告P值）

Please see Wanget al (N Engl J Med 2007;357:2189–2194) on recommended methods for analyzing subgroups. When the SAP prespecifies an analysis of certain subgroups, that analysis should conform to the method described in the SAP. Ifthe study team believes a post hoc analysis of subgroups is important, the rationale for conducting that analysis should be stated. Post hoc analyses should be clearly labeled as post hoc in the manuscript.

请注意Wang et al (NEngl J Med 2007;357:2189–2194) 建议的亚组分析方法。当然统计分析计划事先计划进行某一亚组分析的时候，所有的分析应该必须遵从。如果研究团队认为事后有必要进行无设计的亚组分析，那么必须阐明合理的理由，而且在报告中必须说明哪些是事后分析的结果。

Forest plots are often used to present results from ananalysis of the consistency of a treatment effect across subgroups of factorsof interest. Such plots can be a useful display of estimated treatment effects across subgroups, and the editors recommend that they be included for important subgroups. If subgroups are small, however, formal inferences about the homogeneity of treatment effects may not be feasible. A list of P values for treatment by subgroup interactions is subject to the problems of multiplicity and has limited value for inference. Therefore, in most cases, no P values for interaction should be provided in the forest plots.

一般会用森林图来表达不同亚组中，干预效果的一致性情况。这些森林图对于表达不同亚组的效应是否一直非常有用，编委会建议报告应该针对一些重要因素开展亚组分析。如果亚组非常小，之前关于治疗结果是否具有一致就没有意义了。所有亚组与治疗因素的交互作用分析P值不用报告，因为这个时候P值会遇到多重比较产生的问题，对于统计推断没有什么价值。

If significance tests of safety outcomes (when notprimary outcomes) are reported along with the treatment-specific estimates, no adjustment for multiplicity is necessary. Because information contained in thesafety endpoints may signal problems within specific organ classes, the editors believe that the type I error rates larger than 0.05 are acceptable. Editorsmay request that P values be reported for comparisons of the frequency of adverse events among treatment groups, regardless of whether such comparisons were prespecified in the SAP.

安全性的假设检验（如果不是主要效应指标）应该同时汇报，这时多重比较不再调整检验水准了。因为安全性指标是一个非常重要的不良反应指标，编委会认为，一类错误大一点也不要紧，假阳性高一点也是可以接受。编委会认为，不良反应事件的比较应该要汇报P值，无论在原些的统计计划中是否提及。

When possible, the editors prefer that absolute eventcounts or rates be reported before relative risks or hazard ratios. The goal isto provide the reader with both the actual event frequency and the relative frequency. Odds ratios should be avoided, as they may overestimate the relative risks in many settings and be misinterpreted.

有可能的话，编委会喜欢在报告HR或RR之前，用绝对数或者相对率报告阳性事件的结局，这样的目的是给作者一个绝对数概念和相对数发生概念。OR值应该避免，因为OR值会高估RR，甚至会被误解。

Authors should provide a flow diagram in CONSORT format. The editors also encourage authors to submit all the relevant informationincluded in the CONSORT checklist. Although all of this information may not bepublished with the manuscript, it should be provided in either the manuscriptor a supplementary appendix at the time of submission. The CONSORT statement, checklist, and flow diagram are available on the CONSORT website.

作者需要按照CONSORT清单进行撰写论文。编委会鼓励按照CONSORT清单写出所有的信息和材料。有些时候，论文发表时候，不需要提及的内容，也可以采用补充材料的性质提供。

For observational studies: 观察性研究特别要求：

The validity offindings from observational studies depends on several important assumptions,including those relating to sample selection, measured and unmeasured confounding, and the adequacy of methods used to control for confounding. The Methods section of observational studies should describe how these and other relevant issues were managed in the design and analysis.

观察性研究结果的可靠性依赖于若干个非常重要的假设前提，包括样本选择、可测和不可耻混杂偏倚、以及控制混杂偏倚的可靠的方法。因此“方法部分”必须要提及包括上述有关问题在设计和分析中如何实现的。

If an observational study included a prespecified SAP with a description of hypotheses to be tested, a signed and dated version ofthat plan should be included with the manuscript submission. The Journal encourages authors to deposit SAPs for observational studies in one of the online repositories designed for this purpose.

如果观察性研究也有事先的统计分析计划探讨假设检验，那么这个版本的计划应该跟稿件同时递交。杂志社鼓励作者们将统计分析计划存放在某个在线存储平台中。

When appropriate, observational studies should use prespecified accepted methods for controlling family-wise error rate or false discovery rate when multiple tests are conducted. In manuscripts reporting observational studies without a prespecified method for error control, summary statistics should be limited to point estimates and 95% confidence intervals.In such cases, the Methods section should note that the widths of the intervalshave not been adjusted for multiplicity and that the inferences drawn from the inferences may not be reproducible. No P values should be reported for these analyses.

如果可以的话，观察性研究如果要进行多重比较，应该采用事先设定好的方法来控制family-wiseerror rate 或false discovery rate，如果没有事先进行设计，而多重比较方法分析时，所有结果只能报告估计值和置信区间。同样P值是不应该报告出来的。

If no prespecified analysis plan exists, the Methods section should provide an outline for the planned method of analysis, including

o Eligibility criteria for the selection of cases and method of sampling from the data, with a diagram as appropriate.

o A description of the association or causal effect to be estimated and the rationale for this choice.

o The prespecified method of analysis to draw inference about treatment or exposure effect or association.

如果事先没有分析计划，“方法”部分应该提供一个分析计划，包括：

样本的合格标志和抽样方法，最好有图来说明过程；

描述确定关联性的方法和选择这种方法的理由；

事先确定的探讨治疗效应或者暴露效应的统计学方法。

Studies reporting the effect of a treatment or exposure should show the distribution of potential confounders and other variables, stratified by exposure or intervention group. When the analysis depends on the confounders being balanced by exposure group, differences between groups shouldbe summarized with point estimates and 95% confidence intervals when appropriate.

报告治疗效应或者暴露效应该同时展示哪些可能的混杂因素，并按照暴露因素或治疗因素分组进行比较。关键的混杂因素变量，在暴露组和对照组的组间差异性以及95%置信区间应该也报道出来。

Complex models and their diagnostics can often be best described in a supplementary appendix. Authors are encouraged to conduct ananalysis that quantifies potential sensitivity to bias from unmeasured confounding; absent that, authors must provide a discussion of potential biases induced by unmeasured confounders.

详细复杂的模型结果和对模型的诊断结果可以放在附录中，鼓励作者开展敏感性分析来探讨不可测混杂因素，如果没有，必须在讨论里面涉及不可测混杂因素的影响。

Authors are encouraged to retest findings in a similarbut independent study or studies to assess the robustness of their findings.

鼓励作者尝试在类似的研究中重复当前结果，以确认结果的稳定性。

本站仅提供存储服务，所有内容均由用户发布，如发现有害或侵权内容，请点击举报。