Visualising grade (un)reliability

dennis2045
Dec 6, 2018
3 min read

Updated: Apr 2, 2024

On 27th November 2018, Ofqual published some important measures of the reliability of GCSE and A level examinations in 14 subjects - “reliability” being defined by Ofqual as “the probability that the grade as awarded is the same as the grade that would be awarded had that script been marked by the senior examiner”. The grade corresponding to the mark given by the senior examiner is by definition ‘right’, and so any other grade is necessarily ‘wrong’.

In principle, all exams should be 100% reliable, with every candidate awarded the ‘right’ grade. As the Ofqual document shows, however, the reliability of grades varies from 96% for maths to 52% for a combined exam in English language and literature; other results include biology, 85%; economics, 74%; geography, 65%; English language, 61%; history, 56%. Ofqual give no overall average across all subjects, but when the published data for each subject is weighted by the corresponding number of certifications for that subject, the result is an average grade reliability of about 75%. Or, in plain language, on average, across all subjects at both GCSE and A level, about 1 grade in every 4 is wrong.

To make that real, in the summer of 2017, the total number of GCSE and A level grades awarded across England, Wales and Northern Ireland was 7,278,425. Of which over 1,800,000 were wrong. You might like to read that again. Over 1.8 million.

Importantly, this unreliability is not caused by mistakes in marking. Rather, it happens because marking is not precise, in that – to use Ofqual’s own words – “it is possible for two examiners to give different but appropriate marks to the same answer”. So, one examiner might give a script 54 marks, whilst another might give 55. If grade 6 is defined as all marks between 51 and 56, then both marks correspond to grade 6, so the grade does not depend on which examiner marks the script, or on which might be the senior examiner. But if the 6/7 grade boundary is 55, then a mark of 54 corresponds to grade 6, whilst 55 is grade 7. A single mark makes all the difference. The grade actually awarded is therefore unreliable, for it depends on which examiner marks the script, and on which examiner is the senior examiner. Oh dear.

I have written some spreadsheets that simulate all this, and here is one way of visualising grade (un)reliability, using data for 2017 GCSE English Language, which was taken by 632,419 candidates.

The areas of the ‘bubbles’ along the bottom represent the numbers of candidates actually awarded the corresponding grade. The columns show what happens when all the scripts given an original grade are re-marked by the senior examiner, and then graded accordingly. The green bubbles represent the number of candidates for whom the original grade is the same as the senior examiner’s; the blue bubbles, the number of candidates for whom the senior examiner’s grade is higher than the original grade; the yellow bubbles, those for whom the senior examiner’s grade is lower. It is not an exaggeration to describe those candidates represented by the blue bubbles as ‘disadvantaged’, for they would have received a higher grade had their scripts been marked by the senior examiner; by the same token, the yellow bubbles represent candidates who have been ‘lucky’.

For a highly reliable exam, the sizes of the blue and yellow bubbles are very small, and in essence not visible. For the English language

examination used for this chart, the average reliability is 61%, and, as can be seen, the blue and yellow bubbles are of considerable sizes, and there are also some ‘satellite’ bubbles, two - and even three - grades adrift.

A rather different visualisation, also for 2017 GCSE English Language, is this chart

which shows how the probability that the originally-awarded grade is right varies according to the original mark. As can be seen, scripts marked very high or very low result in reliable grades; scripts given intermediate marks have much lower reliability, and if the script is marked very close to a grade boundary, you might as well toss a coin. Oh dear.

This is dire, and the grade reliability problem needs to be fixed. As I stated towards the start, in summer 2017, over 1.8 million GCSE and A level grades were wrong. And the numbers were similar in 2016, 2015..., and will be similar in 2018, 2019... Unless Ofqual changes its policy.

In fact, this problem is very easy to fix (which is a story for another time, but some ideas are to be found elsewhere on this website). But it will only be fixed if Ofqual can be influenced to take sensible action. If you think this is important, if you care, please contact me – and if you would like to use the spreadsheets that produced these results, please contact me too. Thank you.