Chapter 3 Wilcoxon Rank-Sum
Wilcoxon Rank-Sum is great for testing with low sample sizes and outliers, since it uses the rank of the observation as opposed to the value itself.
Only one assumption: both population distributions should be continuous (not categorical or discrete)
3.1 How It Works
The goal here is to use ranks, not actual values, to identify differences in location. Why? Ranks are far more resistant to outliers, since a singly high observation is now just ranked at the max, doesn’t matter how far above in absolute value it is.
We can pool the observations and compare the ranks that were assigned to sample 1 against those assigned to sample 2. We can compare the average rank of sample 1 against sample 2, and if it’s lower by a particular margin, we can conclude that the values of sample 1 are below sample 2.
We’ll use the following as an example:
| Sample 1 | 31 | 32 | 33 | 47 |
| Sample 2 | 46 | 48 | 49 | 51 |
We’ll first calculate \(W_{obs}\), our test statistic. To do so, we’ll first pool both samples together and rank them, assigning a value of 1 to the smallest observation, and m+n to the largest (since there are now \(m+n\) observations in the pooled group).
| Values | 31 | 32 | 33 | 47 | 46 | 48 | 49 | 51 |
| Ranks | 1 | 2 | 3 | 5 | 4 | 6 | 7 | 8 |
\(W_{obs}\) is simply the sum of ranks of sample 1 observations: \(W_{obs}=\) 1+2+3+5 \(=11\).
Why use this instead of comparing mean ranks? Turns out, the sum of ranks of one sample is a linear function of the mean, so there’s a 1:1 correspondence between the two. The sum of ranks just simplifies the computation.
Under the null hypothesis, we’d expect our observations to have no difference in ranks. So, if we were to randomly switch around (permute) the observations across the samples, we’d expect our observed test statistic \(W_{obs}\) to not be anything unusual, according to \(H_0\). In otherwords, random chance could have just as easily produced \(W_{obs}\) as the treatment we gave.
Let’s make that idea a little more quantitative: If sample 1 has \(m\) observations, and sample 2 has \(n\) observations, there are \(\binom {m+n}{m} = \frac {(m+n)!}{m!n!}\) permutations when we pool together our observations and reassign them to a group. We can calculate a test statistic \(W^*\) for each one of these permutations.
Our p-value is then just the fraction of permutations that have a test statistic \(W\) as or more extreme than what was observed \(W_{obs}\):
In our example, there are \(\binom {8}{4}=70\) possible ways we could’ve obtained the four observations in sample 1 from a total of eight values. We could look up the significance from a table, or read off the p-value from the test output.
Formal Definitions
For a double sided test: \[ H_0: F_1(x) = F_2(x) \\ H_a: F_1(x) \neq F_2(x) \\ ~ \\ p\text{-value}_{two\ sided} = \frac{\text{# of W's more extreme than } W_{obs} \text{ across both tails}}{\binom {m+n}{m}} \] For an upper tail test: \[ H_0: F_1(x) = F_2(x) \\ H_a: F_1(x) \leq F_2(x) \\ ~ \\ p\text{-value}_{upper} = \frac{\text{# of }W\leq W_{obs}}{\binom {m+n}{m}} \]
For a lower tail test: \[ H_0: F_1(x) = F_2(x) \\ H_a: F_1(x) \geq F_2(x) \\ ~ \\ p\text{-value}_{lower} = \frac{\text{# of }W\geq W_{obs}}{\binom {m+n}{m}} \]
Interpretation: Given a p-value of 0.028, there is a 2.8% chance of observing a difference as extreme as we did under the hypothesis that these samples come from populations with the same distribution.
Because our p-value is less than our confidence threshold \(\alpha\) of \(0.05\), we reject the null hypothesis that \(F_1(x) = F_2(x)\), and conclude that the location of population 1 is lower than population 2.