The Wilcoxon-Mann-Whitney test is usually appropriate for data that are not normally distributed. However, if the hypothesis is that a randomly selected sample from one group will be larger than a randomly selected sample from another group, then the right statistic to calculate is the closely related WMWodds (Wilcoxon-Mann-Whitney odds). (Image source: Thinkstock)

The Wilcoxon-Mann-Whitney test is usually appropriate for data that are not normally distributed. However, if the hypothesis is that a randomly selected sample from one group will be larger than a randomly selected sample from another group, then the right statistic to calculate is the closely related WMWodds (Wilcoxon-Mann-Whitney odds). (Image source: Thinkstock)

In published literature on analgesic-related studies, the Wilcoxon-Mann-Whitney (WMW) rank sum (or signed rank) test is frequently used to analyze outcomes such as morphine consumption in the PACU, antiemetic dose, or some Likert scored outcome such as pain or nausea. These outcomes share the common characteristic of being non-normally distributed. As we all learned in our introductory statistics classes, ranked variables are not normally distributed, ranked variables are common in biostatistics and ranked variables are analyzed by nonparametric tests such as the Wilcoxon Mann Whitney (WMW) test.

In the September 2013 issue of Anesthesia & Analgesia, Dr. George Divine, Public Health Sciences, Henry Ford Hospital, Detroit, Michigan, and colleagues address some of the misconceptions of Wilcoxon tests: “A Review of Analysis and Sample Size Calculation Considerations for Wilcoxon Tests”. The accompanying editorial by Dr. Frank Dexter, Department of Anesthesia, Division of Management Consulting, University of Iowa, Iowa City, Iowa, “Wilcoxon-Mann-Whitney Test Used for Data That Are Not Normally Distributed,” explains why we should interpret these tests cautiously. There are a few major take-away messages from these two important articles.

First, the WMW test is a rank sum test but NOT a test of medians. As shown in the examples from Dexter’s editorial, even though the medians of two groups may be nearly identical, the WMW test result could be statistically significant because the two groups have different rank sums.

Second, an appropriate effect size is necessary when comparing two groups using the WMW test. As we know, estimates of effect size are necessary for determining the practical or theoretical importance of a clinical effect, and are required for power analysis when planning for a new study. The CONSORT (CONsolidated Standards of Reporting Trials) 2010 statement emphasizes the importance of always reporting the size of the effect with statistical significance’s P-value in any randomized clinical trial.  When data are not normally distributed, median and interquartile range are routinely reported as summary statistics. In other words, it seems natural to report the median difference and its 95% confidence interval (CI) as a measure of effect size. When two groups assumed to have the same shape and data were ranked, the Hodges-Lehmann confidence interval estimates could be computed and be insightful.  As discussed earlier, however, the WMW test could be significant when two medians are equal. In this situation, the median difference and its 95% CI would be inappropriate or irrelevant to report.

Divine and colleagues describe a different type of summary measure, WMWodds, (O’Brien, Ralph G., and John Castelloe. “Exploiting the link between the Wilcoxon-Mann-Whitney test and a simple odds statistic.” Proceedings of the Thirty-First Annual SAS Users Group International Conference, San Francisco, CA, March. 2006) for the use of the WMW test. The WMWodds test measures the odds of a randomly selected sample from one group being larger than a randomly selected sample from the other group; therefore, the statistical inference on whether the WMWodds = 1 corresponds exactly to whether the null hypothesis in the WMW test is rejected, making it a clinically interpretable summary measure. It should be noted that there are other situations where summary statistics other than WMWodds (e.g., the mean) might be more relevant and preferred.  For testing mean difference, Divine and colleagues1 recommend the distribution-free permutation test to allow a valid conclusion when data are not normally distributed. When the sample size is large (i.e., usually for n ≥ 30), the parametric t-test with unequal variances can also be used to compare two groups. However, some more appropriate methods to analyze the means of skewed distribution with or without extra zeros have been developed and are described in detail elsewhere. (see Ledolter J, Dexter FFriedrich J, Adhikari N, Beyene JChen YH, Zhou XH)

Third, Divine and colleagues suggest several simple formulae for sample size calculation for the WMW tests. The approach is based on logistic regression. Click here for sample SAS code for the calculation. Description of the use of the code is in the Editorial footnote d.

In general, the nonparametric tests are statistically more powerful than the statistical approaches that assume normal distributions when the underlying distributions depart from normality with heavy-tails (i.e., highly skewed). When data are normally distributed, nonparametric tests are less powerful, though the power loss is not substantial, especially when the sample size is not small.  When the sample size is extremely small, simple graphs (like those displayed in Divine and colleagues’ article) convey more information compared to formally applying a statistical test like the WMW test.

The WMW test is a simple and reliable statistical tool for simple hypothesis testing. However, it is incapable of accounting for covariate effects on the outcomes. When data are collected from clinical trials where patients from different groups are comparable because of randomization, direct statistical inference is simple to derive. However, in observational studies (not uncommon in analgesic studies) where confounding factors cannot be controlled by study design, extra considerations must be taken into account in order to achieve comparable patient groups (i.e., confounding factors must be adjusted) before applying any simple statistics such as the WMW test. In this regard, stratification for the levels of confounders or regression adjustment should be considered.