Are Lower Rated Referees Associated With More Upsets?

Experiencing a wrong call from a referee can be quite disappointing, especially during those nail-biting moments when the score is 14-14. There’s often a sinking feeling that the bout could have ended in victory if not for the referee’s blunder. Luckily, USA Fencing has adopted a referee rating system for national tournaments designed to lessen the odds of such unfortunate incidents. This system works by assigning referees to bouts based on their rating, which is updated at the end of each year. Though some subjectivity and politics may come into play, in general, it’s expected that a higher rated referee would make improved calls in comparison to a lower rated one.

The hypothesis

My theory was that bouts officiated by lower rated referees could influence the likelihood of an upset. This could be because these referees, with their lack of confidence and experience, might favor the more highly-ranked fencer during ambiguous calls or that their erroneous calls could tip the scale towards the lower-seeded fencer.

Explanation of referee rating system

If you already understand the referee rating system, you can skip this section.

The referee rating system comprises seven levels, with each level indicating the referee’s competence and the events they can officiate. A few years ago, the rating system used to be a scale from 1-10, with 1 being the best referees and 10 being the newest and least experienced referees. My dataset used the older 1-10 scale, so here’s a mapping for clarity:

L2 (formerly 9-10) – Beginning referee working at local level events
L1 (formerly 8) – Referee with demonstrated competency to work higher level local events
R2 (formerly 7) – Referee with demonstrated competency to work at regional level events
R1 (formerly 6) – Referee with demonstrated competency to work higher levels at regional events
N2 (formerly 4-5) – Referee with demonstrated competency to work national level events
N1 (formerly 1-3) – Referee with demonstrated competency to work highest level national events

In my graphs, I referred to high level international referees rated FIE B as just B.

I excluded referees rated 7-10 in my analysis because they typically aren’t hired for national events due to their lower ratings.

First results

In my dataset (excluding Veteran and Div 1 bouts), I examined the referee’s rating and whether the lower seeded fencer won. I used this to calculate the probability of an upset based on the rating of the referee who officiated the bout. Here are the results:

Although there are some statistically significantstatistically significant a result in statistical testing that provides enough evidence to reject the null hypothesis, suggesting that an observed effect is likely not due to chance alone differences between referee ratings, it’s hard to draw anything conclusive because there is no apparent trend as you go up or down the ratings. I computed a few of the trendlines, and the slopes were less than 0.001, suggesting no substantial relationship between referee rating and the chance of an upset.

Removing confounders

Considering possible confounding variables, such as higher-rated referees being assigned more challenging bouts or bouts in later rounds with a greater propensity for upsets, I focused my analysis on the second round of DEs in Cadet and Junior competitions. In the initial round of these DEs, referee assignments tend to be fairly random, with the possibility of both experienced and less experienced referees officiating difficult bouts. I excluded the first round due to BYEs. However, despite this refined approach, the final analysis, like its predecessor, was inconclusive.

It was unexpected to discover that referees rated 3 (quite experienced) and 6 (least experienced on the national stage) recorded similar outcomes in Foil. Moreover, referees rated 5 and 1 exhibited a similar trend in Epee. This is puzzling and contradictory.

I further examined these trends by creating trendlines, but once again, no significant relationship emerged. I expected referees with different ratings to be least consistent in Saber, where their decisions have the most influence, but paradoxically, the least consistency was observed in Foil.

Looking at individual referees

To pinpoint whether specific referees were biased towards higher or lower-seeded fencers, I analyzed individual referees. For each referees who had officiated at least 30 bouts, I checked what percent of their bouts resulted in an upset. I placed each referee on a histogram:

Upon inspection, one might assume that referees on the far left or right of the histogram were the ones displaying bias. However, this distribution did not indicate any referees showing significant bias.

To clarify, I wrote code that simulated 100 people flipping 100 coins. Then I plotted the distribution:

Just as the outliers in this coin-flipping simulation are not cheating or using unbalanced coins, referees on the extremities of the histogram are likely experiencing similar variability due to chance rather than bias. Therefore, it is likely that these referees were simply assigned more bouts where the lower-seeded fencer outperformed their opponent, rather than favoring one fencer over the other.

I also conducted a chi-square test for each weapon and did not get statistically significant results for variance of individual referees.

Conclusion

The compiled data suggests that the likelihood of an upset broadly remains the same, irrespective of the referee’s rating. This, however, does not imply that referees are immune to making mistakes. It’s plausible that lower-rated referees make more errors than their higher-rated counterparts, but these errors occur with equal frequency against both lower and higher-seeded fencers, so overall, upsets happen as often as they should. Referees are unlikely to exhibit specific bias for or against a fencer based on their seeding.

If you enjoyed, please follow!