Skip Navigation U.S. Department of Health and Human Services www.hhs.gov/
Agency for Healthcare Research Quality www.ahrq.gov
www.ahrq.gov/

Use of Bayesian Techniques in Randomized Clinical Trials: A CMS Case Study

Disposition of Comments

Project ID: STAB0508


The Agency for Healthcare Research and Quality's (AHRQ) Technology Assessment (TA) Program supports and is committed to the transparency of its review process. Therefore, invited peer review comments and public review comments are publicly posted on the TA Program Web site at http://www.ahrq.gov/clinic/techix.htm within 3 months after the associated final report is posted on this Web site.

This document presents the peer review comments and public review comments sent in response to the draft report, Use of Bayesian Techniques in Randomized Clinical Trials: A CMS Case Study, posted on the AHRQ Web site from June 25, to July 17, 2009. The final version of the report is available online.

Select for Table 1: Invited Peer Reviewer Comments (PDF file, 84 KB). Plugin Software Help.
Select for Table 2: Public Review Comments (PDF file, 97 KB). Plugin Software Help.


Contents

Select for Table 1: Invited Peer Reviewer Comments

Select for Table 2: Public Review Comments

Return to Contents

Table 1: Invited Peer Reviewer Comments


Reviewer1 Section2 Reviewer Comments Author Response3
Peer1 General This is an excellent and thoughtful assessment of the value of Bayesian methods to CMS policymaking. The report is generally readable and often eloquent. The issues raised are important ones; particularly as both public and private-sector policymakers frequently face critical tasks of synthesizing disparate sources of information into coherent policy choices. The example of the implantable cardioverter-defibrillator is entirely apt and is likely to be repeated. This report provides both an excellent historical summary of the ICD coverage "story," as well as a highly illustrative example of how data from the numerous ICD clinical trials could have been used in a Bayesian approach to better inform the CMS coverage decisions of 2003 and 2005. We thank the reviewer for their comment.
Peer1 General In terms of potential general improvements to the report, I first of all wonder if the authors would consider reducing the number and complexity of their tables and figures. While they have done well to strive for completeness, there is a tremendous volume of information in the tables/figures to digest. Perhaps some of the more detailed tables (e.g. Table 7) could be reserved for the Appendix, and some of this tabular data could be presented graphically? Perhaps some of the graphs (e.g., Fig. 11) could be simplified to line graphs? I am also left wondering about the nuts-and-bolts aspect of how CMS, FDA, and other agencies would actually use Bayesian methods to achieve the suggested aims proposed by the authors. As mentioned in the text, some of the barriers to implementation of Bayesian analyses involve access to all relevant sources of data, expertise in the relevant statistical methods and software, consensus regarding prior distributions, and consensus regarding interpretation of posteriors. Assuming CMS agreed to implement Bayesian analyses for future "high stakes" coverage decisions, how (politics aside) might such a process work?

We agree with the reviewer that the information in Chapter 5 is more complex than the rest of the report. We felt that this level of complexity was needed to accurately portray the use of Bayesian techniques in the CMS context. Based on previous feedback we have moved much of the details from the case study to the Appendix and a statistical manuscript. In the current Chapter 5 we attempt to ease the burden on the reader by including "key points" and clinical questions/answers. Following reviewers' suggestions however, we have now further revised/simplified our figures (for example, the Kaplan-Meier curves are now lines, the orientation of the figures with estimates and CIs of hazard ratios are now all vertical).

We also agree with the reviewer's concerns about the nuts-and-bolts aspect of how CMS will use Bayesian methods to achieve the suggested aims. The purpose of the report was to provide an overview of the Bayesian approach and its application to the CMS policymaking context. We will share the reviewer's comments with CMS and welcome further discussions with CMS and stakeholders about next steps in possible implementation of Bayesian approaches in the coverage process.

Peer1 General The authors are to be congratulated for a superb report. We thank the reviewer for their comment.
Peer1 Executive Summary The E.S. Results (2 pages) have too much detail compared to the E.S. Methods (5 sentences), thus the Results appear somewhat out of context. In particular, the detailed ICD simulation results (p.2-3) need to either be accompanied with more details in the E.S. Methods, or they should be greatly abbreviated here. Some of the Executive Summary's conclusions appear a bit too strong. For example, the authors seem to be saying that subgroup analyses are always compromised by small sample sizes and the tendency toward excessive post-hoc subgroup testing (p.4). This is not universally true, although these are frequent problems with subgroup analyses. Some of the authors' recommendations are likewise a bit vague. For example, how can an investigator tell if a subgroup effect is "likely to be strong" (p.4)? How can a policymaker tell when trial-based data are "sufficient" (p.5)? There is also Bayesian jargon used (e.g. "the assumed priors," p.5) that isn't defined for the reader until p.6.

Being sensitive to our policymaking audience, we wanted the executive summary not to burden the reader with the technical methods described in Chapter 5 and the Appendix concerning the case study and simulations. We therefore purposely provide a brief description of the methods focusing instead on the findings.

As noted above, our goal in the executive summary was to transmit general principles, rather than discuss exceptions and details of implementation. We acknowledge that there is additional future research needed by CMS and others in terms of implementing Bayesian approaches into the CMS policymaking process.

Peer1 Chap 1 This is an extremely well-written, accessible introduction to Bayesian analysis, particularly for a clinical audience. I will consider using this chapter (with permission) in teaching these methods to our Masters students. We thank the reviewer for their comment. Use of the chapter for teaching is encouraged once a final version of the report has been published.
Peer1 Chap 2 This section begins well but seems to meander a bit at the end. The last paragraph seems to state a number of obvious and general truths that aren't particularly relevant to why use of Bayesian analysis would be good for CMS. We felt that this last paragraph discusses important topics of stakeholder engagement and transparency and have therefore left it unchanged.
Peer1 Chap 3 The clarity of this chapter could be improved by standardizing the organization and style of the 4 sub-chapters on literature themes. The first, "Advantages and Disadvantages .," reads smoothly as a narrative literature review. The second, "Use of Bayesian Techniques .," seems to digress on page 41 into an overly detailed description of the Bayesian vs. Frequentist debate. The style also departs from the objectivity of the first sub-chapter to a more editorializing perspective (e.g., the statements in favor of using skeptical priors on pp. 47-48). The third sub-chapter barely references the literature at all (2 footnotes), and reads more like a tutorial, although it is well written. The final sub-chapter returns to the style of an objective narrative literature review. I'd favor the use of a single style throughout, preferably the style used in sub-chapters 1 and 4. Numbered subheadings might further help the reader navigate. We recognize that the organization and style of the 4 subsections in Chapter 3 are not uniform, but we judged that the differences were appropriate to the material being reviewed in each subsection and the intended audience.
Peer1 Chap 4 This is a clear and concise description of the ICD coverage "story." It may be worthwhile adding on page 68 that Medicare's coverage of Cardiac Resynchronization Therapy defibrillators (CRT-D) in 2005 reduced the number of NYHA Class IV CHF patients who would not be covered for defibrillator implantation by CMS. We now include this additional detail in the ICD "story" on page 39.
Peer1 Chap 5 - Appendix Chapter 5/Appendix. While this is a great example and it is nicely presented in both Chapter 5 and the Appendix, I'm a little concerned with the overall strategy of combining primary prevention (e.g. MADIT-II, SCD-HeFT) and secondary prevention (e.g. CASH, AVID) ICD trials. On clinical grounds, one could argue that these are entirely distinct patient populations and thus the findings from one group of trials would not be expected to inform the findings in the other group. From a policy perspective, CMS was not necessarily wrong to focus their attention on the primary prevention trial data only (were they?) when deliberating on the coverage expansions of 2003 and 2005. I understand that Bayesian methods relax the assumptions of homogeneous effects across trials, but if these are 2 fundamentally different patient populations, it's not clear how even Bayesian methods would permit analytic aggregation (i.e., wouldn't all the priors derived from the secondary prevention trials be non-informative for primary prevention trials)? I certainly could be missing a key point (maybe Bayesian analysis really enables the grouping of proverbial apples and oranges), but if so I would like to see a clearer description of the authors' non-intuitive decision to aggregate these trial data in this manner. Would statistical tests for homogeneous effects-thee bread-and-butter of classical meta-analysis-be inappropriate in a Bayesian context? If so, why?

When planning our simulations and case study we discussed within our team and with our technical expert panel the advantages and disadvantages of combining the secondary and primary prevention data. It was felt that the analysis of the complete data set with the four prognostic characteristics would provide substantial differentiation of the patient population and exploration of the similarities and differences among the trials.

We agree, however, that exploring the patient-level data taking into account the information about the prevention type is interesting. We therefore now include discussion of these findings (supporting our approach of combining the data from all trials) on page 48 (with reference to a more extended treatment in the Appendix).

Peer1 Chap 6

Similar to my comments on the Executive Summary, I think some of these findings need to be tempered a bit. While I agree with the general principles stated here, occasionally a subgroup effect should be acted upon by policymakers if the welfare issues to the affected sub-population are profound. For example, if a prescription drug was shown in a randomized clinical trial to have 50 times the expected mortality rate in a particular subgroup with an interaction effect p-value < 0.001, it may be highly appropriate for the manufacturer and FDA to restrict the use of the drug in that subgroup, rather than deem the evidence "exploratory."

I have the same comments here as above in the Executive Summary regarding the use of the words "strong" and "sufficient" in findings 3 and 4. These are highly subjective assessments, and I think it would be more helpful for policymakers if the authors quantified what these terms mean operationally. For example, should CMS never accept a subgroup analysis that has a non-informative prior (i.e., not "strong")? Does "sufficient" mean that the null hypothesis is excluded with some probability (i.e., outside the posterior 95% credible interval)?

As noted above, our goal in the executive summary and Chapter 6 was to transmit general principles, rather than discuss exceptions and details of implementation. We acknowledge that there is additional future research needed by CMS and others in terms of implementing Bayesian approaches into the CMS policymaking process.
Peer1 Pg 59 "A costly intervention to the Medicare community" - do you mean costly to payers? We have modified the bullet to clarify that the ICD is potentially a costly intervention to the Medicare program (page 34).
Peer1 Pg 108 "Ventricular tachycardi" is missing an "a" The correction has been made (page 64).
Peer1 Fig 1 & 3-10 Figures 1 and 3-10 appear to have been done in low-resolution graphics. It would be better for presentation purposes if these were upgraded to the quality of Figures 2 and 11-14. We have edited Figures 1-10 to improve resolution/clarity.
Peer1 Tables 4-7 Tables 4-7 and Appendix Tables A1-A4 appear to be identical, as are Appendix Tables A23, A25, A26 and A27 duplicates of Tables 8-11. Some of the Figures are duplicated, too. Is this duplication necessary (is the Appendix designed to stand alone)? Also, Tables 4 and A1 both would be more legible with fewer significant digits. The Appendix is designed to stand alone and so we have left the duplicative figures/tables.
Peer1 Minor Comment The 2nd and 3rd sentences of the Key Point on page A-166 seems to be overly simplistic and outside the scope of this paper. I'm sure the authors would agree that there are lots of complicated reasons, beyond differences in patient prognosis, why randomized trials often have dissimilar outcomes to observational registries. Probably better to omit these sentences. We have omitted these sentences from the final report (pages 46 and A-116).
Peer1 NOTE I'm a little surprised to see analyses of ICD/NCDR data included in this report (p. 81, p. A-165, Table 10, Appendix Table A26). I'm a member of the Research and Publications Subcommittee for the American College of Cardiology/National Cardiovascular Data Registry (ACC/NCDR), which oversees use and publication of analyses of ICD/NCDR data. I may be misinformed, but I do not recall any of the authors on this report being principal investigator of an approved project to using ACC/NCDR data. I do recall that Dr. Sana Al-Khatib of DCRI is an approved PI, but she is not listed as a co-author here, and I don't think that her NCDR-approved project was for the purpose of this technology assessment. In any event, publication of analyses using ICD/NCDR data must be approved by the ACC/NCDR prior to submission for publication. Have the appropriate data use and publication permissions been obtained? We are sorry for the confusion. The data cited in the report was obtained directly from CMS and reflects the Medicare patients within the ICD ACC/NCDR registry rather than the larger ACC/NCDR ICD registry. We now clarify this in the text on page 46 and throughout the document when we reference the registry participants. Note that because the ACC/NCDR registry does not currently contain long-term outcomes, we did not use the registry data for our case study but instead used the MUSTT registry participants.
Peer2 General Overall I found this to be an extremely well-written report presented in a very scholarly and well-balanced fashion. The authors have skillfully avoided entering into the realm of polemics and have rather elegantly demonstrated both the benefits and disadvantages of a Bayesian analysis compared to the more frequently employed Frequentist paradigm. We thank the reviewer for their comment.
Peer2 General I have no major concerns about this report although I did find the example in Chapter 5 somewhat difficult to follow, related primarily to formatting issues. It is [difficult to] smoothly and seamlessly read this chapter when there are constant references to figures, tables or appendix tables without any accompanying page numbers. This makes for very disjointed reading and impinged on the clarity of the example, especially compared to the earlier chapters. Also the apparent desire to keep the example as simple as possible by moving much of the methods to the appendix had the paradoxical effect of rendering the example more difficult to follow. I would personally favor integrating the current appendix directly into Chapter 5. We thank the reviewer for their comment. In previous drafts of the report, we included the Appendix material in the main text. However, readers of these previous versions found that the additional detail was distracting and so we've moved it to the Appendix.
Peer2 Pg 17, ln 2 The word "better" might be more appropriately replaced with the word "more". The suggested change has been made (page 10).
Peer2 Pg 18, ln 12 I would add ".shrunk toward the mean of the posterior distribution, in this case toward the null value of 0. The suggested insertion has been made (page 10).
Peer2 Pg 19, ln 6 Sequential meta-analysis is also routinely referred to as a cumulative meta-analysis and this could be mentioned. The suggested change has been made (page 11).
Peer2 Pg 19, ln 9 Strong beliefs - this may be a limitation to Bayesian analysis but of course this depends on the origins of these strong beliefs. If they come from well-randomized clinical trials then this would not be considered a worse case scenario. We now clarify on page 11 that we assume that strong beliefs are primarily based on intuition, rather than objective information such as a previous meta-analysis, and that data-based priors are superior to opinion-based priors.
Peer2 Pg 20, ln 22 .and similar examples of statistical esoterica - this has a rather pejorative connation and a better word may be desirable. We have removed this phrase as suggested (page 12).
Peer2 Pg 21, ln 9 . given the observed data and the prior information. The suggested change has been made (page 13).
Peer2 Pg 82, ln 1-5 Not sure of the origin of these points. On pages 46 and A-116, we now differentiate between what we observed and what we concluded based on our analyses.
Peer2 Pg 84, ln 1-5 It is unclear what the two prior beliefs represent. The two priors represent beliefs of no treatment effect. Both priors are centered around no treatment effect. We describe prior 2 as being more informative in the sense that it places heavier mass around no treatment effect. We now clarify this in the text (page 47).
Peer2 Pg 85, ln 14 The authors have taken a hazard ratio of 0.8 to indicate clinical significance. The justification for choosing one arbitrary dichotomous cut-point for clinical significance merits perhaps more reflection and discussion. Indeed, one of the disadvantages of the frequentist paradigm is the dichotomous p < 0.05, and one of the advantages of a Bayesian approach is the possibility, without a type one error penalty, to look at several different cut-points. Perhaps at least for the overall results the results with various cut-points of HR 0.9, 0.8 and 0.7 might be illuminating. We based the hazard ratio threshold for clinical significance of 0.8 on feedback from our technical expert panel. We now clarify this and provide for the reader results using other cut-points as well, namely, 0.7, 0.8 and 0.9 (pages 45 and 48).
Peer2 Table 6 A table of p values is bound to ignite the ire of some (not only Bayesians) and is of limited decision making utility. Perhaps at least for a small subset the effect size & 95 CI should be reported. We now include for each cell in Table 6 the hazard ratio and 95% confidence interval (unadjusted for multiple testing) comparing survival by treatment in the subgroups of interest. Missing entries indicate unavailable data for the particular subgroup. Entries highlighted in red indicate significant results at the unadjusted significance level of 5%.
Peer2 Table 7 I would like to see the cumulated number of patients and events for each of the 48 subgroups. This could also be presented in Table 9. We agree that this information is useful to the reader and have modified Table 7 to include the overall number of patients and events.
Peer3   I applaud the authors and CMS for addressing an interesting and timely topic with an in-depth report that attempts to tailor itself to a non-technical audience. The report has several overall strengths including the tutorial on Bayesian methods, literature review, application to an area of interest to CMS, and provision of a detailed statistical appendix. As one might expect, the strengths are also potential weaknesses given the length of the report and the tendency to gloss over what might be important details. Nevertheless, the overall result is quite interesting and relevant. I expect that it will be a useful document for CMS and ultimately for the clinical trials and policy communities in larger context. We thank the reviewer for their comment.
Peer3 Page 2 Simulation studies are discussed in part by saying "the often have low power to detect differential treatment effects". There is a general lack of rigor in the document when discussing this concept. The essential point is that studies designed to detect main effects (as almost all are) will have little power to detect treatment-covariate interaction. Sometimes the report refers to this concept explicitly and in other places it is termed "differential treatment effect". I think it would be helpful to refer it always as treatment-covariate interaction. Furthermore, it should be generally acknowledged and articulated that main effect designs will necessarily leave minimal power for detection of these (or other) interactions. As a very general rule, it takes roughly four times the sample size to detect interactions compared to main effects, assuming that we are discussing effects of approximately the same magnitude. Obviously, large magnitude effects can be detected more easily. As suggested, we have globally replaced references to "differential treatment effects" with "treatment-covariate interaction."
Peer3 Page 3 and 4 In addition, it might be helpful distinguish between qualitative and quantitative treatment covariate interactions. As a general rule, we would be interested in interactions that are large in magnitude whether they are qualitative or quantitative. However, small quantitative interactions (those interactions that have the same direction but quantitative differences in magnitude) are generally of little or no interest. This is because there is no therapeutic implication. If treatment X is better in both subsets, it is the recommended treatment even if treatment X is slightly better for one subset than it is for the other. In contrast, qualitative interactions (interactions that show treatment helping one subset but harming another subset) always carry therapeutic importance. Treatment X is appropriate for one subset but treatment Y is appropriate for the other. These qualitative interactions may require more power to detect because the statistical test can less efficient. I think the report and the methodology in general might need to be more respectful of these concepts because of the therapeutic relevance and the fact that we may not need to fuss very much over many quantitative interactions. We have now added a discussion of heterogeneity and concepts of clinical and statistical significance to the Tutorial (page12).
Peer3 Page 3 Another concept introduced on Page 3 that is of potential concern is the use of patient-level versus aggregate data. It is stated that "the analysis of aggregate data may be more sensitive to priors". I can image that this may be the case, however there may be additional issues with the analysis of aggregate data. In particular, the analysis of aggregate data can represent a type of "ecological fallacy" if the underlying means are subject to confounding. The analysis whether frequentist or Bayesian would then represent a kind of incomplete analysis of covariance, and can be biased or even yield the wrong algebraic sign. In any case, there may be serious pitfalls with aggregate data. We include a discussion of confounding and its potential impact to findings in Chapter 6, item 7 (page 51).
Peer3   The tutorial on Bayesian methodology will likely be appreciated by many consumers of this report. I don't think it is necessary to draw differences between frequentist and Bayesian methods that emphasize potential friction or controversies. Nevertheless, I think it would be helpful for non-statistical audiences to appreciate some of the fundamental inferential differences that may be consequential for the way that they think about interpretation of data and policy decisions. A key in my opinion is the difference between the frequentist and Bayesian perspective on the parameters of the underlying data model. To the frequentist, parameters are fixed constants of nature. In this case, sample variation must be expressed in the familiar but somewhat awkward framework of hypothetical repetitions of an identical experiment. This is the awkwardness that causes us to misinterpret confidence intervals and p-values. To the Bayesian, model parameters are random variables, and therefore sampling variation is manifest as probability distributions for those parameters. This view also requires the Bayesian to specify probability distributions for parameters. Hence, the notion of "credible intervals". Although these are superficially analogous to confidence intervals and p-values, the difference is quite fundamental and important to understand. It is useful not to place value judgments on these differences but helpful to articulate clearly the different inferential frameworks that are required for Bayesian versus frequentist analysis, and in particular to explain the different perspectives on model parameters. We thank the reviewer for their comment and agree with their view that the Tutorial was vague about some of the more technical points. As the reviewer mentions, the Tutorial's goal was to be appreciated by the consumers of this report (policymakers and regulators) and therefore the vagueness of the Tutorial regarding certain technical points was intentional.
Peer3   Another fundamental issue that I believed the report in general and the tutorial specifically does not deal with effectively is the prior distribution. The report does a fine job of illustrating generic types of prior distributions including uninformative, skeptical, and optimistic. No definition is offered for the later concept of "genuine priors", but this may not be essential. My concern deals with the construction of prior distributions in the absence of firm statistical evidence. (I don't think there would be wide-spread disagreement among statistical experts if prior distributions are constructed from actual data). An essential question is whether or not subjective belief in the absence of actual data is appropriately represented by a probability distribution. This issue is close to the heart of the general criticisms about Bayesian methodology. I don't know that the report deals with it directly enough. Absent data, is a probability distribution the appropriate way to represent our ignorance regarding a model parameter? As indicated we do describe "genuine" priors on page 33 of the literature review. In the Tutorial we attempted to keep our discussion of priors and Bayesian approaches as simple in possible to allow the Tutorial to be understood by the target audience. We do emphasize now the role of objective and subjective priors on page 8 of the Tutorial.
Peer3   I also found myself wondering a little bit about the precision with which non-informative priors are described. The confidence interval approach to the discussion of prior distributions is a useful one. However, I need some additional convincing that a wide confidence interval could be taken literally as a "non-informative prior". I imagine the confidence interval as being approximately a normal distribution whereas a non-informative prior should be an improper uniform distribution. I don't know if this point is essential but I would want to be sure that our non-statistical colleagues are not misled. We have removed the use of the confidence interval in the discussion of non-informative priors and now include a brief note about such priors having an interval over the entire real line (page 10).
Peer3   It may also be helpful to point out that apart from the specific differences that I have mentioned above; every frequentist procedure has an essentially equivalent Bayesian one. To me, this means that a frequentist based conclusion can be shown to be essentially the same as a Bayesian conclusion where the Bayesian statistician uses a prior distribution that the frequentist was seemingly unaware of. In a crude sense, the frequentist might be seen as using a prior distribution without recognizing it. We thank the reviewer for their comment.
Peer3   The literature review seems quite useful, although I personally was not too impressed with the formal aspects of the Medline search. The reference list is most helpful. The specific problems of subgroup analysis seem quite appropriate illustrating the differences between Bayesian and frequentist approaches. The general strengths of Bayesian methodology in this setting make it a useful example. The discussion regarding fixed effect models, random effects models, and random effects models with information from outside the study is a helpful framework. I was not convinced by the discussion regarding biological heterogeneity. The presence or absence of such heterogeneity is an unknown characteristic of nature. In a very real sense, the approach to heterogeneity is based on a pure assumption and not surprisingly it appears that our ability to deal with the consequences of that assumption is in part affected by the methodology. The reason this issue is so problematic is because the assumption of heterogeneity is derived only partly from what we think we know about biology and derived much too influentially by fashions of the day, including politics. For example, we might be looking for heterogeneity on the basis of non-biological constructs masquerading as biological ones. Race for example is at best a surrogate and may not be a biological construct whatsoever. In many cases, sex is also not a relevant biological factor. These issues aside, I would want to be sure that the methodology chosen does not yield a forgone conclusion derived from pure assumption. Although we acknowledge the reviewer's comments, we note that we are not arguing that biological heterogeneity is always present, but rather that it is often part of the philosophical rationale behind the Bayesian approach. We have also expanded our discussion of heterogeneity in the Tutorial (page12).
Peer3   The discussion on the resistance to the use of Bayesian methods was most helpful. I was unfamiliar with many of the concepts there. I would agree wholeheartedly with the list of items mentioned on Page 56. We thank the reviewer for their comment.
Peer3   The extensive discussion of the implantable cardioverter defibrillator is clearly an appropriate illustration of Bayesian methodology. However, I found myself struggling to extract the most relevant lesions from the example. Although this section of the report reflects a scholarly approach that might be useful for peer-reviewed journal, I wonder if for this audience, the report should be restructure in a way to present the implications and lessons is a more digestible format. I would like to see the data displays be more friendly and informative. Most people will not try to digest the tables, especially the ones (inappropriately) full of p-values. Some of the figures have odd anomalies including being essentially obscured by censoring indicators in the case of survival curves, and the curious switching between horizontal and vertical formats for confidence interval plots. I am fearful that much more time will be required to make the technical details of this report less voluminous and more accessible to the intended audience. We acknowledge that the case study is at a different level of detail than the other chapters, but thought that this was needed in order to convey accurately the use of Bayesian approaches. We tried, however, to simplify the chapter for readers through the inclusion of "key points" and specific clinical questions and answers. In addition, following reviewers' suggestions, we have revised several tables and figures. Table 6 now has point and interval estimates for the hazard ratios for the subgroups of interest. Kaplan-Meier figures now omit the censoring indicators. Moreover, figures with estimates of hazard ratios in the combined analysis now all use the same orientation.
Peer3 Pages 3 and 4 Another general issue with the report is illustrated by some of the bullets and conclusions here. There is the concept that observed interactions may not be the same across all trials. This implies that we may need to account for random effects in interaction estimates or three-way interactions (the third factor being an effect of the trial). It would seldom be worthwhile to design for this effect. I wonder if the general concept of designing studies to detect the interactions of interest as opposed to simply designing analysis should be discussed at a deeper level. The design question touches not only on our ability to detect interactions but the potential bias is in registry data that are discussed briefly on Page 3. It is well known that patients in registries may carry different prognosis from those studied in clinical trials. What may under-appreciated is that treatment effect estimates that derive from registries are typically biased (perhaps confounded by indication), whereas relative treatment effect estimates from randomized trials are probably more likely to generalize across patient subsets. We acknowledge the reviewer's comment concerning the use of Bayesian approaches in the design of clinical trials. We now clarify (on page 1) that although we address (and acknowledge the importance of) the use of Bayesian approaches both for clinical trial design and analysis, we focus our report on their use for clinical trial analysis, as this was most applicable to the CMS policymaking context.

1 Peer reviewers are not listed in alphabetical order.
2 Page and line numbers refer to the draft report.
3 Page and line numbers refer to the final report.

Return to Contents

Table 2: Public Review Comments

Reviewer1 Reviewer Affiliation2 Section3 Reviewer Comments Author Response4
Jose Ma. J. Alvir, DrPH Pfizer, Inc Methodology

The TA should further clarify how Bayesian models are applied differently as compared to classical models.

The TA provides a thoughtful analysis of the benefits and drawbacks of using a Bayesian analytic approach versus a classical (frequentist) statistical approach to data analysis. The TA concludes that "direct comparisons of meta-analysis between frequentist and Bayesian approaches do not always yield consistent results."[AHRQ. "Use of Bayesian Techniques in Randomized Clinical Trials: A CMS Case Study." Accessed at http://www.ahrq.gov/clinic/ta/bayesian.pdf on July 6, 2009. Page 52.] We recommend that AHRQ provide greater clarification and discussion why results from classical and Bayesian approaches can differ and what the implications would be for policy decisions. If CMS is to adopt Bayesian analysis in its decision-making, the agency will need guidance on how to consider and incorporate results that are conflicting, in addition to assessing the appropriateness of model selection in each case.

We also recommend the TA clarify the terms under which a fixed-effect versus random-effect model should be used when incorporating prior data in Bayesian models. Currently, interpretation of statements in the TA could imply that fixed-effect models are preferable because they give weight to larger studies. However, this implication does not necessarily account for potential heterogeneity, which requires downweighting of the larger studies. To help clarify these issues, the TA should better address the issue of heterogeneity related to both classical and Bayesian model use.

We now include in the Tutorial a new discussion of fixed- and random-effects models (under frequentist and Bayesian approaches), heterogeneity, and interaction (page 12).
Jose Ma. J. Alvir, DrPH Pfizer, Inc Implemen-tation

Appropriate infrastructure and workforce is needed to implement use of Bayesian statistics in CMS decision-making. 

We agree with the TA's discussion of some of the potential disadvantages of Bayesian method implementation, such as the lack of statistical and computational expertise and unfamiliarity with Bayesian methods on the part of policymakers. Researchers and stakeholders need advanced technical understanding to be able to discern the relative quality of different Bayesian methods. Few such individuals exist, creating the risk that conclusions from Bayesian analyses will be misrepresented to key stakeholders and misapplied in policy decisions. These shortcomings in adoption of Bayesian methods were echoed by members of the MEDCAC at the June meeting during which the MEDCAC members expressed a high level of confidence in Bayesian methods, but they highlighted the intensity of training that would be required to familiarize the future and current workforce of clinicians, policymakers, and others with the methods.

These issues should be investigated more fully to ensure that an appropriate infrastructure and workforce exists to support high quality Bayesian analysis and interpretation. Without adequate infrastructure and resources, the potential exists for poorly designed, misrepresented, or misinterpreted studies to serve as the basis for policy decisions. Researchers and analysts at CMS and organizations conducting research need to understand the strengths and weaknesses of Bayesian methods, and how and why there are differences in studies using classical or Bayesian analyses.

We thank the reviewer for their thoughtful comment and will pass it on to CMS for their consideration.
Jose Ma. J. Alvir, DrPH Pfizer, Inc Implemen-tation We agree that the articles by Sheingold [Shiengold, S. "Can Bayesian Methods Make Data and Analyses More Relevant to Decision Makers?" International Journal of Technology Assessment in Health Care. 2001; 17(1):114-22.] and Winkler [Winkler, R. "Why Bayesian Analysis Hasn't Caught on in Health Care Decision-Making." International Journal of Technology Assessment in Health Care. 2001; 17(1):56-66.] referenced in the TA provide several helpful suggestions for making Bayesian methods more accessible, such as more training materials and software, established test cases for using Bayesian analysis, and clear demonstrations of the value of this type of analysis in healthcare decision-making. We encourage both AHRQ and CMS to adopt policies and programmatic support to address the gaps in analytical skill and understanding and to prepare the workforce to handle studies with Bayesian methods. We thank the reviewer for their thoughtful comment and will pass it on to CMS for their consideration.
Jose Ma. J. Alvir, DrPH Pfizer, Inc Implemen-tation, continued

Use of Bayesian methods in CMS decision-making should be transparent and well-defined. 

We suggest that CMS' use of Bayesian studies include two types of transparency. The first layer of transparency should occur in the processes CMS uses to review and evaluate Bayesian studies. CMS should involve a range of relevant stakeholders (e.g., methodologists, clinical and other scientific experts, and patient and caregiver representatives) in the evolving discussion around use and interpretation of Bayesian methods in decision-making, to capture all viewpoints. CMS then should be transparent in defining instances in which Bayesian approaches will be considered and used in decision-making. CMS should also outline the criteria to assess Bayesian studies and be transparent in communicating to stakeholders about the decision-making process.

The second layer of transparency should be at the research study level. By their nature, Bayesian methods afford transparency into research study design. Given the acknowledged subjective nature of inputs into Bayesian analyses, researchers must prospectively design studies and specify the prior distribution, which requires planning and consideration of how the results will relate back to the study inputs. CMS should consider developing processes, including communication strategies, to make the methods from studies used in decision-making as transparent and accessible to outside stakeholders as possible. For example, CMS could explore options to standardize reporting of methods and results for Bayesian analyses submitted for use in decision-making and assure that these reports are publicly available.

We thank the reviewer for their thoughtful comment and will pass it on to CMS for their consideration.
Jose Ma. J. Alvir, DrPH Pfizer, Inc Implemen-tation, continued

CMS should build on other agencies' experiences with Bayesian methods.

Future discussions should include an assessment of how CMS can build on the FDA's efforts to implement Bayesian methods. The FDA has worked with Bayesian methods for several years. The agency's 2006 Draft Guidance on the Use of Bayesian Statistics in Medical Device Clinical Trials helped to define parameters on the use of Bayesian methods specifically for devices. In adopting Bayesian analysis, we support efforts by CMS to adopt best practices and consider lessons learned from other agencies' experiences.

The FDA has also recognized the need for education on Bayesian methods, both internally and externally. In recent presentations, Dr. Greg Campbell, Director of the FDA's Division of Biostatistics at the Center for Device and Radiological Health, has spoken about FDA's implementation of Bayesian methods in trials. His remarks highlighted the educational efforts the agency instituted, with internal courses and seminars to educate staff members and public forums to discuss Bayesian methods. [Campbell, G. "Bayesian Statistics at the FDA: The Pioneering Experience with Medical Devices." Presented at Florida State University Conference "Statistics, the Next 50 Years" on April 17, 2009. Accessed at http://www.stat.fsu.edu/Campbell.ppt on July 6, 2009.] CMS will need to adopt similar programs to educate staff members who will be analyzing Bayesian trial data.

We thank the reviewer for their thoughtful comment and will pass it on to CMS for their consideration.
Jose Ma. J. Alvir, DrPH Pfizer, Inc Conclusion

Pfizer appreciates the opportunity to provide AHRQ and CMS with comments on the draft TA on the use of Bayesian methods in randomized clinical trials. This report should serve as the beginning of a larger discussion on how to implement Bayesian methods and ensure that they are used appropriately.

Pfizer welcomes any opportunity to discuss our comments and recommendations in further detail. Please feel free to contact me at 212-733-2051 with any questions, or if you need additional information on our above comments.

We thank the reviewer for their thoughtful comments.
Richard Chapell Merck & Co., Inc. General Thank you for allowing us the opportunity to comment on the draft document. We have reviewed it thoroughly and believe that it describes a viable way forward for the use of Bayesian techniques. We especially applaud the way in which the twin dangers of Type 1 and Type 2 errors are highlighted in the discussion of subgroup analysis. We hope that the more methodical approach described in the document will come to replace the data-mining techniques that are commonly used at present. We thank the reviewer for their comment.
Richard Chapell Merck & Co., Inc. Pg 8, ln 5 "combine" should be "combines" The suggested change has been made (page 5).
Richard Chapell Merck & Co., Inc. Pg 9, ln 1 "where" should be "at which" The suggested change has been made (page 5).
Richard Chapell Merck & Co., Inc. Pg 9, ln 6 Numeral "777" should be removed The deletion has been made (page 5).
Richard Chapell Merck & Co., Inc. Pg 9, ln 20 Please add a period to the end of the sentence. A period has been inserted (page 6).
Richard Chapell Merck & Co., Inc. Pg 16, ln 17 "around" should be "centered at" The suggested change has been made (page 9).
Richard Chapell Merck & Co., Inc. Pg 20, ln 3 Please remove extraneous period. The deletion has been made (page 11).
Richard Chapell Merck & Co., Inc. Pg 22, ln 13 Please remove extraneous comma and capitalize "It". The suggested changes have been made (page 13).
Richard Chapell Merck & Co., Inc. Pg 24, ln 18 Please remove space between the final word in the sentence and the period. The deletion has been made (page 14).
Richard Chapell Merck & Co., Inc. Pg 54, ln 8 "a cost-effectiveness decision models" please remove either the initial "a" or the final "s" We have removed the "a" as suggested (page 31).
Richard Chapell Merck & Co., Inc. Pg 62, ln 3 Please remove the apostrophe from "it's" The apostrophe has been removed (page 35).
Richard Chapell Merck & Co., Inc. Pg 67, ln 7 "Many of these questions are hoped to be explored." should be "It is to be hoped that many of these questions will be explored.". Begin a new sentence with "Others" We have restructured the sentence as requested (page 38).
Richard Chapell Merck & Co., Inc. Pg 80, ln 12 "is" should be "are" The suggested change has been made (page 45).
Richard Chapell Merck & Co., Inc. Pg 80, ln 18 "regarded" should be "considered" or "regarded as" The suggested change ("considered") has been made (page 46).
Richard Chapell Merck & Co., Inc. Pg 91, ln 1-5 A block of text appears to be missing. We have added in the missing text (page 51).
Christine Fletcher, MSc Amgen Ltd General I do not agree with some of the general points but there are schools of thought that would agree with the position: For instance the idea of taking epidemiological data and treating it as a prior to clinical data by weighting so as to "have the same weight" as the clinical data. David Spiegelhalter points out that adding a fixed amount to the variability in the prior is better, rather than increasing variance by a proportional amount seems. We believe that the reviewer's comment is referring to the last paragraph on page 53 of the draft report (Case 2: Dissimilar Information). We have modified this paragraph to indicate that the weight assigned to prior information and the clinical data can be given more, less, or equal importance. We also now reference Spiegelhalter's work within this paragraph (pages 29-30 of final report).
Christine Fletcher, MSc Amgen Ltd General Also: Bayesian methods seem to be being used because they are trendy. There is confusion between random effects hierarchical models and Bayesian methods, and which is the aspect that allows us to do certain things. We agree with the reviewer and now note the first time that Bayesian hierarchical models are introduced (page 45) that hierarchical models are not limited to the Bayesian paradigm but are particularly natural within that way of thinking.
Christine Fletcher, MSc Amgen Ltd General Could further clarification be given to the following points: Direct comparisons of meta-analyses between frequentist and Bayesian approaches (e.g., Bloom et al. 34) do not always yield consistent results - in particular, sometimes the results of the two approaches are similar and sometimes they are different. [If they are different it is crucially important to understand why]. However, some observations do appear to be reasonably consistent. In general, in situations where extensive data are available, it is less likely that the conclusions of Bayesian and frequentist analyses will differ substantially. The bulleted points listed on page 30 detail additional situations which might result in either similar or different findings from meta-analytic approaches.
Christine Fletcher, MSc Amgen Ltd General Estimates of efficacy from random-effect models have less precision than estimates of efficacy from fixed-effect models. This is nothing to do with Bayesian/Frequentist, but due to one model allowing for an extra level of variation (study to study) which is ignored in the other case. Both can be done within a frequentist paradigm. We now include a discussion of fixed- and random-effects models (under both frequentist and Bayesian approaches) in a new section of the Tutorial ("Technical Note on Fixed- and Random-Effects Models, Heterogeneity, and Interaction"), page 12.
Christine Fletcher, MSc Amgen Ltd General Fixed-effect models give greater weight to larger studies than do random-effects models. This will be mis-read. It is technically correct as a relative statement (with or without random effects), but it will be read in absolute terms ("fixed are better as they give more weight to this important studies"). What is true is that if there is heterogeneity between studies then we need to allow for that heterogeneity by downweighting the big studies and taking more account of the spectrum of different values across the studies. See response to immediately preceding comment. We now include a new discussion of fixed- and random-effects models (under both frequentist and Bayesian approaches), heterogeneity, and interaction on page12.
Christine Fletcher, MSc Amgen Ltd General Both approaches struggle a bit when the number of studies is small to moderate. In the fixed-effect model, this is reflected by a test for heterogeneity that has low power. In the random-effects models, this is reflected by the tendency for the results to be sensitive to the estimate (model B) or assumptions (model c) about ?. What they are saying is that with small numbers of studies it is difficult to estimate the variability from study to study. As such using methods that work well with small amounts of data in random effect models are important (e.g., the KR adjustment). We agree with the reviewer that both approaches struggle a bit when the number of studies is small to moderate, and this is reflected within the text (page 30, bullet point 3).
Christine Fletcher, MSc Amgen Ltd General The results of the fully Bayesian analysis are most likely to differ from others when relatively little information is available from the data. This is, in general, the most dangerous circumstance for drawing definitive conclusions - which phenomenon should be illustrated by a careful sensitivity analysis. Correct! The most promising circumstance to apply a fully Bayesian approach occurs when the type of information available to the analyst is sufficiently disparate as to call into question the other two models. I think this means when there is heterogeneity between studies. If so, then I do not agree. We agree with the reviewer about the most dangerous circumstance for drawing conclusions about a Bayesian analysis, and also about those circumstances in which a Bayesian analysis is promising. To clarify that we were not discussing the heterogeneity between studies in the last bullet on page 54 of the draft report, we have removed this bullet concerning the "most promising" circumstance and instead now describe how a situation where RCT data are modest and external information is available allows for a particularly natural application of Bayesian techniques (page 30, last bullet point).
Christine Fletcher, MSc Amgen Ltd General From attending several meetings with the academic health economists driving the HTA process within NICE, it's clear that the Bayesian view is not only dominant, it is almost unanimously held, with some individuals this goes to the point at which they don't even see a Bayesian/Classical debate anymore. I know this is anecdotal, but we (industry) have got to get the same level of organizational cognizance about the detailed techniques as we currently have with the, largely classical, regulatory requirements. The reviewer's comment is noted, and we agree that the acknowledgement of potential applications of Bayesian methods is becoming more widespread within certain communities. However, from our own experience, the use of Bayesian methods is still not widely understood and applied. This unfamiliarity led CMS/AHRQ and the Duke EPC to work on this report.
Christine Fletcher, MSc Amgen Ltd General I don't like the debate being cast as Bayesian/Frequentist. I see frequentism as an approach to probability, not inference. It is perfectly possible for both prior and posterior functions to be frequency based. Indeed I got a Bayesian to admit to this once when he confirmed that, if agreement about a prior could not be found, a vote could be taken - which is about as frequentist as you can get! I prefer the term "Classical." In this report we refer to Bayesian versus frequentist approaches where the common term "frequentist" encompasses "classical" approaches. The essential interchangeability of these two terms for the purpose of this report is indicated on page 5.
Christine Fletcher, MSc Amgen Ltd General The distinction between classical multi-level error models and Bayesian models needs to be well defined. I have, again anecdotally, seen people miss the difference between the two. We agree with the reviewer and now note the first time that Bayesian hierarchical models are introduced (page 45) that hierarchical models are not limited to the Bayesian paradigm but are particularly natural within that way of thinking.
Christine Fletcher, MSc Amgen Ltd General There is full support on the conclusions and recommendations regarding subgroups, and the report should be commended for tackling a variety of current statistical issues pertaining to areas surrounding evidence synthesis, the methodological considerations for incorporating randomized and observational data, and the specific issues dealing with subgroup analyses. The report includes practical examples and the simulations appropriately investigate methodological aspects analysts have to deal with in this area. We thank the reviewer for their comment.
Karen Lynn Price, PhD Eli Lilly and Co. General Expert opinion - when considering an expert opinion, it is prudent to be wary of potential biases that are related to expert opinions (too enthusiastic, span too narrow of the parameter space) - we recommend that the use of Bayesian methods to assess benefit/risk be considered. For example, Bayesian methods can be used to calculate the probability that, for example, the benefit exceeds the risk. We now remind the reader of the importance of sensitivity analyses on the prior distributions based on expert opinion (page 20). Although this is important, because it is not the focus of our discussion, and in order to preserve clarity, we have not addressed the use of Bayesian methods to assess benefits and risks.
Karen Lynn Price, PhD Eli Lilly and Co. Pg 52 Any suggestions regarding what is considered small to moderate? (3rd bullet) Just as with sample size calculations, what is considered small to moderate would depend on the variability of the outcome of interest within and between trials and goals of analysis. We now clarify this on page 29.
Karen Lynn Price, PhD Eli Lilly and Co. Pg 90-91 It appears that the thought was not finished. We have clarified this sentence (page 51).
Karen Lynn Price, PhD Eli Lilly and Co. Pg 90-91 There should be some mention of what to do if one trial appears aberrant. If a trial appears aberrant, one may adopt a cross-validation approach, considering the analysis with and without the data arising from that particular trial. This would allow us to assess how influential the trial might be in the overall conclusions. We now add this explanation into the text on page 51.

1 Names are alphabetized by last name. Those who did not disclose name are labeled "Anonymous Reviewer 1," "Anonymous Reviewer 2," etc.
2 Affiliation is labeled "NA" for those who did not disclose affiliation.
3 Page and line numbers refer to the draft report.
4 Page and line numbers refer to the final report.

Return to Contents

Current as of December 2009


Internet Citation:

Use of Bayesian Techniques in Randomized Clinical Trials: A CMS Case Study. Disposition of Comments. Disposition of Comments. December 2009. Rockville, MD: Agency for Healthcare Research and Quality. http://www.ahrq.gov/clinic/ta/comments/bayesian/


 

AHRQAdvancing Excellence in Health Care