Chapter 4 Mann-Whitney

The Mann-Whitney test is quite similar to the Wilcoxon Rank-Sum test.

We make the following assumptions:

  1. Observations from groups are independent
  2. Both population distributions are continuous (not categorical / discrete)

4.1 How It Works

In a sample of \(m\) observations in sample \(X\), and \(n\) observations in sample \(Y\), we want to focus on each possible pair of observations. The test statistic \(U\) is simply the number of pairs where \(X_i < Y_j\). The minimum \(U\) can be is \(0\), while the max is every possible pair, or \(m*n\).

Let’s say we have two samples, and want to see if sample 1 has a lower location than sample 2. Here’s our raw data:

Sample 1 31 33 46 40
Sample 2 39 49 55 57

We look at every possible combination of the two samples and compare the values, checking if the values of the first sample are greater than the values of the second sample:

Is Sample 1 > Sample 2?
31 33 40 46
39 N N Y Y
49 N N N N
55 N N N N
57 N N N N

Our test statistic \(U_{obs}\) is the number of pairs where \(X_i < Y_j\), indicated by the Y’s in the matrix. So, \(U_{obs}=2\).


If we were to randomly permute the observations across both labels, we’d need to see that our observed test statistic is far more extreme that the rest of the U statistics before concluding that the null hypothesis doesn’t apply. Why? The null hypothesis suggests that the observations from both samples are derived from the same distribution, so we need sufficient evidence that this isn’t the case in order to reject it.

Permutation U
1 0
2 1
3 1
70 16

Our observed sample assignment is just one of \(\binom{8}{4}=70\) possible permutations of the values between the two sample labels. Let’s go ahead and find the corresponding test statistic \(U^*\) for each of the other permutations:

So, the probability that we observe \(U_{obs}=2\) is then the number of \(U\)’s less than \(U_{obs}\). Since there are 4 of 70 possible permutations with \(U^* \leq U_{obs}\), we get a p-value of \(\frac{4}{70}=0.057\).

The intuition here is fairly straightforward: We expect to see a test statistic \(U\) as or more extreme than \(U_{obs}\) 5.7% of the time when we assume that there is no difference in the null hypothesis.

Formal Definitions

For a double sided test: \[ H_0: F_1(x) = F_2(x) \\ H_a: F_1(x) \neq F_2(x) \\ ~ \\ p\text{-value}_{two\ sided} = \frac{\text{# of U's farther from } \frac{mn}{2}}{\binom {m+n}{m}} \\ \] For an upper tail test: \[ H_0: F_1(x) = F_2(x) \\ H_a: F_1(x) \leq F_2(x) \\ ~ \\ p\text{-value}_{upper} = \frac{\text{# of }U\leq U_{obs}}{\binom {m+n}{m}} \]

For a lower tail test: \[ H_0: F_1(x) = F_2(x) \\ H_a: F_1(x) \geq F_2(x) \\ ~ \\ p\text{-value}_{lower} = \frac{\text{# of }U\geq U_{obs}}{\binom {m+n}{m}} \]

4.2 Code

sample1 <- c(31, 33, 46, 40)
sample2 <- c(39, 49, 55, 57)

wilcox.test(sample1, sample2, alternative='less')

4.3 Note

The Wilcoxon \(W\) is linearly related to Mann Whitney \(U\), and results in the same p-value.

Proof

\[ \begin{array}{l} W_{2} \\ =\sum_{j=1}^{n} R\left(Y_{j}\right) \\ =R\left(Y_{1}\right)+R\left(Y_{2}\right)+\cdots+R\left(Y_{n}\right) \\ =\left[1+\left(\text {number of } X^{\prime} s \leq Y_{1}\right)\right]+\left[2+\left(\text {number of } X^{\prime} s \leq Y_{2}\right)\right]+\cdots \\ =[1+\cdots+n]+\left[\left(\text {number of } X^{\prime} s \leq Y_{1}\right)+\cdots+\left(\text {number of } X^{\prime} s \leq Y_{n}\right)\right] \\ =[1+\cdots+n]+U \\ =\frac{n(n+1)}{2}+U \end{array} \]