Burying Our Heads in the Sand: The Neglected Importance of Reporting Inter-Rater Reliability in Antipsychotic Medication Trials

There is no author summary for this article yet. Authors can add summaries to their articles on ScienceOpen to make them more accessible to a non-specialist audience.

Abstract

The declining efficacy of antipsychotic medication in randomized clinical trials has led to major concern. Over the last decade, the number of failed phase II trials raised by 15%. In the search for the causes of the apparent declining potency of antipsychotic medication, it has been suggested that the explanation may be found in inadequate rating procedures. 1 For several decades now, Helena Kraemer stressed the fundamental importance of inter-rater reliability (IRR) for randomized clinical trials, 2 in particular for the rating of psychotic symptoms since measurements are largely dependent on observational instruments that require acceptable reliability. In fact, reliability scores in the absence of training procedures are generally low (<0.6) for the observational instruments that are commonly applied in psychosis research. 3 Unreliable assessments can have a major impact on the interpretation of study outcomes. Firstly, low reliability of data leads to underpowered studies, and therefore more false-negative findings and attenuated effect sizes. 4 Secondly, unsatisfactory training procedures, unreliable assessments combined with expectation biases of raters and time pressure to complete the inclusion may lead to inflated baseline severity scores. Then, after inclusion, rapid declines can be seen in severity scores as true severity scores of these participants are actually lower. These inflated baseline severity scores are associated with higher placebo responses, making it more difficult to identify real effects in the intervention condition. 5 Moreover, after the selection procedure and without controlling for reliability, rater drift might occur leading to increased measurement error with subsequent regression to the mean. Although the value of training procedures and reliability assessment is abundantly clear, reporting in these areas is inconsistent. About 20 years ago, 2 papers found that only 9.5% of the included manuscripts reported training procedures and that only 19% or 35% of the included papers reported reliability measurements. 6,7 However, these reviews did not provide precise information about training procedures and IRR coefficients in antipsychotic medication trials, and we wondered whether there has been an improvement during the last 20 years. We therefore conducted a new review to determine the proportion of papers with and without reported training procedures or IRR coefficients in double-blind randomized controlled trials (RCTs) with antipsychotic medication during the past 2 decades. To this end, we searched Medline for double-blind RCTs of antipsychotic medication for the treatment of schizophrenia spectrum disorders between January 2000 and January 2019. We also selected all double-blind RCTs of antipsychotic medication from 4 large meta-analyses published since 2000. Two authors (S.B. and L.V.) working independently retrieved the following coefficients from the published manuscripts and supplements: presence of an actual IRR coefficient: Intraclass correlation coefficient (ICC), Cohen’s Kappa, Krippendorff’s alpha or Agreement coefficient 1 (AC1). Further, we collected information about correlation coefficients or a minimum percentage agreement that were used as IRR coefficient, central rater, and any reported training of raters. The details of our approach can be found in the supplementary material, parts 1.1 to 1.5. We identified 207 double-blind RCTs: 34.8% (N = 72) reported training for raters and 11.1% (N = 23) reported an actual IRR coefficient. Of the 23 RCTs reporting an IRR coefficient, 78.3% (N = 18) used the ICC and 21.7% (N = 5) used Cohen’s Kappa as a measure of IRR. In addition, 6.8% (N = 14) of all RCTs reported that the reliability of assessments was determined, but these studies did not report an IRR coefficient, 1.9% (N = 4) reported a correlation coefficient, 2.4% (N = 5) reported a percentage agreement and only 2.9% (N = 6) used central raters. We found no significant differences between studies sponsored by pharmaceutical companies or non-industry supported trials in the reporting of training variables or reliability measures. Inappropriate measures of IRR, such as percentage agreement and correlation coefficients were reported in 4.3% of the RCTs. The latter analyses, as well as Cronbach’s alpha, are unsuitable to evaluate IRR, as percentage agreement is not change corrected and correlation coefficients merely determines associations between raters without accounting for inter-individual agreement. These measures provide a false impression of sufficient IRR. In a correct analysis for IRR, the ICC, Cohen’s kappa, or Krippendorff’s alpha are applied. Despite strong recommendations in the literature concerning the relevance of the inclusion of IRR coefficients or training procedures, no improvement was observed during the past 2 decades. The finding that differences between antipsychotic medication and placebo have become smaller during recent decades may be attributed in part to the lack of training procedures and shortcomings in reliability. The description of training procedures that we found in the reviewed RCTs varied strongly: from detailed descriptions of repeated training procedures to merely stating that raters were trained. The bottom line is that the value of high-quality training procedures for accurate signal detection is widely recognized for decades, and we still seem to bury our heads in the sand and ignore its vital importance. The neglect of training procedures could be caused by the preconception that clinically experienced raters conduct reliable assessments and that training is not required. However, several studies have indicated that even experienced clinicians cannot make reliable assessments of at least one-third of the individual PANSS items. 3 Additionally, it is possible, albeit highly unlikely, that some authors actually did implement training procedures or reliability estimations without reporting them. In the more likely event that there was actually no training, this may have been due to the perception that rater training is too costly, time-consuming or difficult to implement in large multi-national trials. Nevertheless, significant savings can be made by improving reliability since it improves power, meaning that smaller sample sizes are needed to demonstrate effectiveness. To illustrate: improvement of reliability from 0.7 to 0.9 will reduce the required sample size by 22%. 4 Using central raters could result in major improvements in the areas discussed here: they are independent of study design, highly trained, and they have high IRR scores. It has been shown that using central raters results in significantly less baseline-score inflation in studies with antidepressant medication. 5 Changes in the procedure could be considered. Firstly, training procedures should include a course on interview skills, video-taped interviews followed by reliability assessment. Independent interviews of the same patient by several raters would be ideal. However, we consider the feasibility of such a procedure in multicenter projects as problematic. Secondly, assessments during clinical trials could be recorded to be reevaluated for reliability and rater drift. Subsequently, inadequate observations can be adjusted and raters demonstrating insufficient observations may receive additional training. Ultimately, raters could even be removed from the trial if their assessments persistently fail. Thirdly, by reevaluating each assessment by several raters the average score can be used as an outcome measure. As a result, reliability would increase as well as power and effect sizes. In conclusion, training procedures and IRR coefficients are still often neglected in double-blind RCTs with antipsychotic medication. Despite urgent recommendations, there has been no improvement in reporting on, and probably the implementation of, training procedures and reliability assessment in the last 2 decades. Editors of psychiatric journals could contribute to improvement in the future by imposing strict and detailed requirements for reporting on training procedures and IRR coefficients in manuscripts. Furthermore, the use of central raters could provide major benefits in terms of reliability, the prevention of baseline-score inflation and accurate study outcome. Funding No external funding or financial resources have been used for this project. Supplementary Material sbaa036_suppl_Supplementary_Material Click here for additional data file.

Related collections

Most cited references 3

Record: found
Abstract: found
Article: not found

Penny-wise and pound-foolish: the impact of measurement error on sample size requirements in clinical trials.

John Bartko, Marcella J. Wyatt, Mark Perkins (2000)

Clinical research studies must compensate for measurement error by increasing the number of subjects that are studied, thereby increasing the financial costs of research and exposing greater numbers of subjects to study risks. In this article, we model the relationship between reliability and sample-size requirements and consider the potential tangible cost savings resulting from the decreased number of subjects needed when reliability of raters is improved or multiple ratings are used. Standard methods are used to model reliability based on the intraclass correlation coefficient (R) and to perform power calculations. The impact of multiple raters on reliability for a given baseline level of reliability is modeled according to the Spearman Brown formula. Our models demonstrate that meaningful reductions in sample size requirements are gained from improvements in reliability. For example, improving reliability from R = .7 to R = .9 will decreases sample size requirements by 22%. Reliability is improved by training and by the use of the mean of multiple ratings. For example, if the reliability of a single rating is 0.7, the reliability of the mean of two ratings will be 0.8. The costs to improve reliability either through rater training efforts or use of the mean of multiple ratings is cost effective because of the consequent reduction in number of subjects needed. Efforts to improve reliability and thus reduce subject requirements in a study also may lead to fewer patients bearing the burden of research participation and to a shortening of the duration of studies.

0 comments Cited 21 times – based on 0 reviews      Review now

Bookmark

Record: found
Abstract: found
Article: found

Is Open Access

What Is Causing the Reduced Drug-Placebo Difference in Recent Schizophrenia Clinical Trials and What Can be Done About It?

Aaron S. Kemp, Nina R Schooler, Amir H. Kalali … (2008)

On September 18, 2007, a collaborative session between the International Society for CNS Clinical Trials and Methodology and the International Society for CNS Drug Development was held in Brussels, Belgium. Both groups, with membership from industry, academia, and governmental and nongovernmental agencies, have been formed to address scientific, clinical, regulatory, and methodological challenges in the development of central nervous system therapeutic agents. The focus of this joint session was the apparent diminution of drug-placebo differences in recent multicenter trials of antipsychotic medications for schizophrenia. To characterize the nature of the problem, some presenters reported data from several recent trials that indicated higher rates of placebo response and lower rates of drug response (even to previously established, comparator drugs), when compared with earlier trials. As a means to identify the possible causes of the problem, discussions covered a range of methodological factors such as participant characteristics, trial designs, site characteristics, clinical setting (inpatient vs outpatient), inclusion/exclusion criteria, and diagnostic specificity. Finally, possible solutions were discussed, such as improving precision of participant selection criteria, improving assessment instruments and/or assessment methodology to increase reliability of outcome measures, innovative methods to encourage greater subject adherence and investigator involvement, improved rater training and accountability metrics at clinical sites to increase quality assurance, and advanced methods of pharmacokinetic/pharmacodynamic modeling to optimize dosing prior to initiating large phase 3 trials. The session closed with a roundtable discussion and recommendations for data sharing to further explore potential causes and viable solutions to be applied in future trials.

0 comments Cited 20 times – based on 0 reviews      Review now

Bookmark

Record: found
Abstract: found
Article: not found

Evaluation of standardized rater training for the Positive and Negative Syndrome Scale (PANSS).

Matthias Müller, Wolfgang Rossbach, Petra Dannigkeit … (1998)

The Positive and Negative Syndrome Scale (PANSS) is employed increasingly for the evaluation of therapeutic outcome in studies on schizophrenia. Rater training is important to improve the concordance and accuracy of ratings; however, there are no established guidelines for carrying out such training. We conducted rater training, under clinical conditions, of psychiatrists and clinical psychologists who were rather unfamiliar with the PANSS. Based on videotapes of PANSS interviews, all participants were trained during five successive standardized weekly sessions. The results were analyzed with respect to conventional criteria of concordance with standard expert ratings and interrater reliability. The main objective was to evaluate the number of training sessions which are necessary and sufficient to achieve acceptable PANSS rating results. Additionally, differences in training outcome for positive, negative and general psychopathological symptoms and between subgroups of different clinical and PANSS experience were considered. After three weekly sessions, satisfactory concordance of about 80% of clinicians on the PANSS total scale was obtained. However, in comparison with the positive and general psychopathological subscales, the PANSS negative-symptom subscale yielded somewhat less satisfactory results, with about 70% of the raters achieving sufficient accuracy. Intraclass correlations corroborated these findings. No substantial differences in training outcome were found between subgroups of different clinical and PANSS experience. We conclude that at least three standardized PANSS training sessions are recommended to obtain satisfactory accuracy of ratings.

0 comments Cited 8 times – based on 0 reviews      Review now

Bookmark

All references

Author and article information

Journal

Journal ID (nlm-ta): Schizophr Bull

Journal ID (iso-abbrev): Schizophr Bull

Journal ID (publisher-id): schbul

Title: Schizophrenia Bulletin

Publisher: Oxford University Press (US )

ISSN (Print): 0586-7614

ISSN (Electronic): 1745-1701

Publication date Collection: September 2020

Publication date (Electronic): 18 March 2020

Publication date PMC-release: 18 March 2020

Volume: 46

Issue: 5

Pages: 1027-1029

Affiliations

[1 ] University Medical Center, location Academic Medical Center, Department of Psychiatry , Amsterdam, the Netherlands

[2 ] Arkin Mental Health Care, Department of Research , Amsterdam, the Netherlands

Author notes

To whom correspondence should be addressed; Meibergdreef 9, 1105 AZ Amsterdam, the Netherlands; tel: +31208913600, fax: +31208913701, e-mail: s.berendsen@ 123456amsterdamumc.nl

Article

Publisher ID: sbaa036

DOI: 10.1093/schbul/sbaa036

PMC ID: 7505183

SO-VID: 4b588a52-b3df-4a25-a331-2a75cb9b97b7

License:

This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License ( http://creativecommons.org/licenses/by-nc/4.0/), which permits non-commercial re-use, distribution, and reproduction in any medium, provided the original work is properly cited. For commercial re-use, please contact journals.permissions@oup.com

Burying Our Heads in the Sand: The Neglected Importance of Reporting Inter-Rater Reliability in Antipsychotic Medication Trials

Read this article at

Abstract

Related collections

Karger: Neurology and Neuroscience

Most cited references 3

Penny-wise and pound-foolish: the impact of measurement error on sample size requirements in clinical trials.

What Is Causing the Reduced Drug-Placebo Difference in Recent Schizophrenia Clinical Trials and What Can be Done About It?

Evaluation of standardized rater training for the Positive and Negative Syndrome Scale (PANSS).

Author and article information

Journal

Affiliations

Author notes

Article

History

Page count

Categories

Comments

Comment on this article

Similar content 11

Most referenced authors 80