A Primer on the Use of Equivalence Testing for Evaluating Measurement Agreement

There is no author summary for this article yet. Authors can add summaries to their articles on ScienceOpen to make them more accessible to a non-specialist audience.

Abstract

<div class="section"> <a class="named-anchor" id="S1">  </a> <h5 class="section-title" id="d5041651e164">Purpose</h5> <p id="P1">Statistical equivalence testing is more appropriate than conventional tests of difference to assess the validity of physical activity (PA) measures. This paper presents the underlying principles of equivalence testing and gives three examples from PA and fitness assessment research. </p> </div><div class="section"> <a class="named-anchor" id="S2">  </a> <h5 class="section-title" id="d5041651e169">Methods</h5> <p id="P2">The three examples illustrate different uses of equivalence tests. Example 1 uses PA data to evaluate an activity monitor’s equivalence to a known criterion. Example 2 illustrates the equivalence of two field-based measures of physical fitness with no known reference method. Example 3 uses regression to evaluate an activity monitor’s equivalence across a suite of 23 activities. </p> </div><div class="section"> <a class="named-anchor" id="S3">  </a> <h5 class="section-title" id="d5041651e174">Results</h5> <p id="P3">The examples illustrate the appropriate reporting and interpretation of results from equivalence tests. In the first example, the mean criterion measure is significantly within +/−15% of the mean PA monitor. The mean difference is 0.18 METs and the 90% confidence interval of [−0.15, 0.52] is inside the equivalence region of [−0.65, 0.65]. In the second example, we chose to define equivalence for these two measures as a ratio of mean values between 0.98 and 1.02. The estimated ratio of mean VO2 values is 0.99, which is significantly (p=0.007) inside the equivalence region. In the third example, the PA monitor is not equivalent to the criterion across the suite of activities. The estimated regression intercept and slope are −1.23 and 1.06. Neither confidence interval is within the suggested regression equivalence regions. </p> </div><div class="section"> <a class="named-anchor" id="S4">  </a> <h5 class="section-title" id="d5041651e179">Conclusions</h5> <p id="P4">When the study goal is to show similarity between methods, equivalence testing is more appropriate than traditional statistical tests of differences (e.g., ANOVA and t-tests). </p> </div>

Related collections

Most cited references 18

Record: found
Abstract: found
Article: not found

A comparison of the two one-sided tests procedure and the power approach for assessing the equivalence of average bioavailability.

Donald Schuirmann (1987)

The statistical test of hypothesis of no difference between the average bioavailabilities of two drug formulations, usually supplemented by an assessment of what the power of the statistical test would have been if the true averages had been inequivalent, continues to be used in the statistical analysis of bioavailability/bioequivalence studies. In the present article, this Power Approach (which in practice usually consists of testing the hypothesis of no difference at level 0.05 and requiring an estimated power of 0.80) is compared to another statistical approach, the Two One-Sided Tests Procedure, which leads to the same conclusion as the approach proposed by Westlake based on the usual (shortest) 1-2 alpha confidence interval for the true average difference. It is found that for the specific choice of alpha = 0.05 as the nominal level of the one-sided tests, the two one-sided tests procedure has uniformly superior properties to the power approach in most cases. The only cases where the power approach has superior properties when the true averages are equivalent correspond to cases where the chance of concluding equivalence with the power approach when the true averages are not equivalent exceeds 0.05. With appropriate choice of the nominal level of significance of the one-sided tests, the two one-sided tests procedure always has uniformly superior properties to the power approach. The two one-sided tests procedure is compared to the procedure proposed by Hauck and Anderson.

0 comments Cited 357 times – based on 0 reviews      Review now

Bookmark

Record: found
Abstract: found
Article: not found

Validity of consumer-based physical activity monitors.

Gregory J Welk, Jung-Eun Lee, Youngwon Kim (2014)

Many consumer-based monitors are marketed to provide personal information on the levels of physical activity and daily energy expenditure (EE), but little or no information is available to substantiate their validity. This study aimed to examine the validity of EE estimates from a variety of consumer-based, physical activity monitors under free-living conditions. Sixty (26.4 ± 5.7 yr) healthy males (n = 30) and females (n = 30) wore eight different types of activity monitors simultaneously while completing a 69-min protocol. The monitors included the BodyMedia FIT armband worn on the left arm, the DirectLife monitor around the neck, the Fitbit One, the Fitbit Zip, and the ActiGraph worn on the belt, as well as the Jawbone Up and Basis B1 Band monitor on the wrist. The validity of the EE estimates from each monitor was evaluated relative to criterion values concurrently obtained from a portable metabolic system (i.e., Oxycon Mobile). Differences from criterion measures were expressed as a mean absolute percent error and were evaluated using 95% equivalence testing. For overall group comparisons, the mean absolute percent error values (computed as the average absolute value of the group-level errors) were 9.3%, 10.1%, 10.4%, 12.2%, 12.6%, 12.8%, 13.0%, and 23.5% for the BodyMedia FIT, Fitbit Zip, Fitbit One, Jawbone Up, ActiGraph, DirectLife, NikeFuel Band, and Basis B1 Band, respectively. The results from the equivalence testing showed that the estimates from the BodyMedia FIT, Fitbit Zip, and NikeFuel Band (90% confidence interval = 341.1-359.4) were each within the 10% equivalence zone around the indirect calorimetry estimate. The indicators of the agreement clearly favored the BodyMedia FIT armband, but promising preliminary findings were also observed with the Fitbit Zip.

0 comments Cited 132 times – based on 0 reviews      Review now

Bookmark

Record: found
Abstract: found
Article: found

Is Open Access

Statistical Methods Used to Test for Agreement of Medical Instruments Measuring Continuous Variables in Method Comparison Studies: A Systematic Review

Rafdzah Zaki, Awang M Bulgiba, Roshidi Ismail … (2012)

Background Accurate values are a must in medicine. An important parameter in determining the quality of a medical instrument is agreement with a gold standard. Various statistical methods have been used to test for agreement. Some of these methods have been shown to be inappropriate. This can result in misleading conclusions about the validity of an instrument. The Bland-Altman method is the most popular method judging by the many citations of the article proposing this method. However, the number of citations does not necessarily mean that this method has been applied in agreement research. No previous study has been conducted to look into this. This is the first systematic review to identify statistical methods used to test for agreement of medical instruments. The proportion of various statistical methods found in this review will also reflect the proportion of medical instruments that have been validated using those particular methods in current clinical practice. Methodology/Findings Five electronic databases were searched between 2007 and 2009 to look for agreement studies. A total of 3,260 titles were initially identified. Only 412 titles were potentially related, and finally 210 fitted the inclusion criteria. The Bland-Altman method is the most popular method with 178 (85%) studies having used this method, followed by the correlation coefficient (27%) and means comparison (18%). Some of the inappropriate methods highlighted by Altman and Bland since the 1980s are still in use. Conclusions This study finds that the Bland-Altman method is the most popular method used in agreement research. There are still inappropriate applications of statistical methods in some studies. It is important for a clinician or medical researcher to be aware of this issue because misleading conclusions from inappropriate analyses will jeopardize the quality of the evidence, which in turn will influence quality of care given to patients in the future.