How to make GCSE and A level grades reliable

dennis2045
Aug 21, 2017
11 min read

Updated: Apr 2, 2024

Grade misallocations real...

My blog How reliable are GCSE and A level grades? featured a chart, published by the exam regulator Ofqual, showing, for six subjects, the probability that a GCSE or A level candidate is awarded the right grade: so, for example, for physics, this probability is about 85%, for history about 40%. Of more importance is the interpretation of the data the other way around: currently, out of every 100 candidates in physics, 15 are given the wrong grade; similarly, for history, 40 candidates. 'Wrong' is wrong both ways, so for history, of those 40 candidates with wrong grades, 20 receive a grade higher than they merit, and so are 'lucky', whilst 20 receive a grade lower than they merit, and so are 'disadvantaged' - and possibly being denied important life chances as a consequence. This is most unfair.

Another way of representing this is shown in Figure 1, which represents what happens across a grade boundary:

Figure 1: Grade misallocation across the C/B grade boundary.

At every grade boundary, in every subject, at both GCSE and A level, the current system results in (at least) four populations: two comprising those who are awarded the grade they merit; one comprising those who are awarded a higher grade ('lucky' candidates); and one comprising those who are awarded a lower grade ('disadvantaged' candidates). Grade misallocation is real. And, in principle, other (much smaller) populations are possible too, such as 'doubly lucky' (two grades higher than merited) and 'doubly disadvantaged' (two grades lower), as might happen if grade widths are particularly narrow.

...and is not resolved by the appeals process

You might think that the appeals process resolves this. It doesn't. And here's why:

■ Firstly, 'lucky' candidates who have been awarded a higher grade than they might have expected say "Great!". They have no reason to appeal - why should they? The population of those candidates who were originally 'lucky' therefore remains unchanged.

■ Secondly, many of those who have been awarded a lower grade are quite likely to shrug their shoulders and say, "Oh dear, I hadn't done as well as I had hoped". They trust the 'system', and blame themselves. As a result, they don't appeal.

■ Thirdly, making an appeal costs money - money that is refunded if the appeal results in a re-grade, but not otherwise. Many people, and many schools, can't afford to take the risk. As a result, most appeals are made by wealthier people, and wealthier schools - another manifestation of unfairness.

The overall result of the appeals process - which requires the 'victim' to shout 'this hurts' (which, of itself, is pernicious, as discussed further in my blog Are regulators doing the wrong thing? ) - is therefore to make the muddle even worse, with some originally 'disadvantaged' candidates having their error corrected, whilst a number of candidates who were correctly awarded grade C become 'lucky', and are up-graded to a B, as illustrated in Figure 2:

Figure 2: The current appeal system does not resolve the original grade misallocation.

Why grade misallocation happens

Most importantly, grade misallocation is NOT, repeat NOT, the result of mistakes or negligence, such as the failure of a marker to mark a particular item, non-compliance with the marking scheme, or some sort of operational foul-up. These things can - and do - happen, but the Exam Boards, and the regulator, take considerable trouble to prevent them, and to correct them when detected.

Rather, grade misallocation is an inevitable consequence of the structure of our exams. Instead of being a sequence of 'pub-quiz' questions, our exams invite candidates to express themselves through essays, or to demonstrate how they go about solving problems, with marks being awarded for method, as well as for the final result. Such questions do not have unambiguous right/wrong answers, so it is quite possible, and legitimate, for one marker to give an essay, say, 14 marks out of 20, and another 16/20. Indeed, this variability - technically known as 'tolerance' - is built into the quality control procedures used by all the Exam Boards. Whilst marking is taking place, and to ensure quality, the mark given by a senior examiner to a particular question will be compared to the mark given to the same question by a randomly selected marker: if the two marks are within the defined 'tolerance', that's fine; if the marker's mark is different from the senior examiner's mark by a number greater than the 'tolerance', then the marker's work is scrutinised, and appropriate action taken.

As a result of 'tolerance', a script of several questions might be given a total mark of, say, 64, or perhaps 66. Neither mark is 'right' or 'wrong'; neither is the result of 'hard' marking or 'soft'; neither indicates the presence of an 'error'. Both marks are equally valid.

In practice, each script is marked once, and given a single mark. As we have just seen, this mark might be 64/100, or it might be 66/100. If grade B is defined as all marks from 63 to 68 inclusive, the candidate is awarded grade B in both cases. But suppose that grade B is defined as all marks from 65 to 69, and grade C from 60 to 64. A script marked 64 results in grade C, whereas the same script, given the equally valid mark 66, would be graded B. The grade awarded depends not on the candidate’s ability, but on the lottery as to whether the script was marked 64 or 66. Furthermore, if a script marked 64 is appealed, it is possible that a re-mark might give the same mark 64, or perhaps 63 or even 62, so confirming grade C. But the re-mark might be 65, or 66, in which case the candidate is up-graded from C to B. And a script marked 66, grade B, might be re-marked 64 - and down-graded to C.

As this example demonstrates, the mark given to any script is not a precise number, 64; rather, the mark is better represented as a range, for example, from 62 to 66. It is this range that is the root cause of grade unreliability, of grade misallocation: if the range straddles one or more grade boundaries, then the grade that the candidate receives is determined by good luck if the grade is higher, or bad luck if lower.

This is unfair. Grades should not be determined by luck.

How to make grades reliable

There are at least two ways in which grades could be made more reliable.

The first is to change the structure of all our exams, away from open-ended essay-style questions, to multiple choice right/wrong answers. Each question therefore has a single, unambiguous answer, which the candidate either identifies, or not. This eliminates the variability in marking, implying the single mark given to the script is independent of the marker, will be the same no matter how many times the script might be re-marked, and corresponds to a specific, totally reliable grade. But as well as eliminating all variability in marking, this eliminates something else too - all the values we have about the importance of general learning and broad understanding, and the encouragement of self-expression. Not a good trade-off.

The second is to recognise that variability in marking is real, and to change the way in which the 'raw' mark given to a script is used to determine the candidate's grade.

To show how this might work, consider the example of a script marked 64, which, under the current system, would result in grade C (for grade boundaries defined such that all marks from 65 to 69 are grade B; marks from 60 to 64, grade C).

We know, however, that the mark '64' is more realistically represented as the range from 62 to 66, or 64 ± 2, where the '2' represents the variability in marking.

Suppose, then, that the grade is determined not by the 'raw' mark 64, but by the 'adjusted' mark 64 + 2 = 66, where the additional 2 marks takes the variability of marking into account. The candidate is now awarded grade B, not grade C. Furthermore, if the script is re-marked - for example, on appeal - it is almost certain that any re-mark will not exceed 66, and so the originally-awarded grade will be confirmed, no matter how many times the script might be re-marked.

Robustness under appeal is critical in establishing trust and confidence in the examination system. According to the Ofqual’s annual statistics (see, for example Ofqual's Statistical Release for the summer 2016 exams) over the last several years, the number of appeals has been steadily increasing, whilst the number of exam submissions has been about the same, with some 18% of appeals (that’s about 1 in 6) resulting in an up-grade. Such a high probability of “winning” is both a strong incentive for those who can afford to appeal (which is one of the reasons that the number of appeals has been increasing), as well as an indictment of the current system: if so many appeals result in an up-grade, why were so many scripts marked incorrectly in the first place?

A new 'rule' for determining grades

So here is a general rule for determining reliable grades

■ For a script given a 'raw' mark m (in the example, 64), and if the variability of marking for that examination is represented by f (in the example, 2), the grade is determined by the 'adjusted' mark m + f (in the example, 64 + 2 = 66).

■ Suppose that an appeal is made, and that the re-mark is m*. If m* is less than, or equal to, the original 'adjusted' mark m + f , then the original grade is confirmed. But if the re-mark m* is greater than m + f (or less than m – f ), then the script is re-graded on the basis of m* + f , which may result in a grade change, depending on whether or not the new 'adjusted' mark m* + f is on the other side of a grade boundary as compared to the original 'adjusted mark' m + f .

Some features of the m + f rule

■ There is an assumption that the 'adjustment' f , which is a measure of the variability in marking, is a property of an examination rather than of an individual script. If this is the case, then f can (I think!) be determined by a simple statistical analysis of a sample of scripts (this needs to be explored empirically by studying examination data). Once f has been determined for any particular examination, then the same value of f can be used for all candidates, so that all candidates have their 'raw' marks adjusted by the same amount. This ensures that all candidates are treated fairly.

■ It is very likely that the value of f is meaningful in the context of different examination subjects: I would expect f to be a smaller number for subjects such as maths and French, and a larger number for subjects such as history and art.

■ By assigning a candidate's grade based on the 'adjusted' mark m + f , the candidate is being given the "benefit of the doubt", for the grade is based on a statistically sensible estimate of the highest mark likely to be given to that script. That is why any re-mark m* is almost certain to be less than, or equal to, the 'adjusted' mark m + f, so ensuring that the grade, as originally awarded, is confirmed on appeal, and not changed. This is very important in building confidence in the entire examination system.

■ This rule does not eliminate the variability in marking: this is still present. Rather, the rule manages the way in which this variability impacts the populations of 'lucky' and 'disadvantaged' candidates. As illustrated in Figure 3, the effect of grading based on the 'adjusted' mark m + f is to reduce the population of 'disadvantaged' candidates to close to zero, whilst simultaneously increasing the population of 'lucky' candidates:

Figure 3: Grading according to m + f ensures that almost no candidates are 'disadvantaged'.

■ You might now be thinking, "Ah! More 'lucky' candidates! Grade inflation!!!" Well, no. Adopting the rule that grades are based on the 'adjusted' mark m + f, does indeed increase the number of 'lucky' candidates as compared to basing grades on the 'raw' mark m, but this does not drive 'grade inflation'. 'Grade inflation' is, by definition, a phenomenon that occurs repeatedly year after year, and is totally dependent on how the regulator sets the grade boundaries. The m + f rule makes no statement about grade boundaries - it is solely a rule as to how to map marks onto grades. So this rule does not drive 'grade inflation'. If the grade boundaries are unchanged, and a comparison is made between grading according to m + f, and grading according to m, then the top grade will have more candidates, and the lowest grade fewer, with the intermediate grades having about the same populations - but happens only once, when the change is made from grading based on m to grading based on m + f : thereafter, the system is stable. This is very similar to the re-calibration that takes place from time to time when, for example, a change is made to the 'basket of goods' that comprises the retail price index. And the smart idea is to introduce grading according to m + f at the same time that the grading structure changes from A*, A, B... to 9, 8, 7... - this disrupts all the grade boundaries anyway.

■ An assumption throughout this discussion has been that the original mark m is valid, and not a 'marking error' as might happen if the marker failed to comply with the marking scheme, or as a result of an operational problem. In practice, marking errors can occur, and it is important that they are identified and resolved, ideally by internal quality control before the exam results are published, or as the result of an easily-accessible appeals process afterwards. That is why the suggested appeals process asks the question "Does the re-mark m* lie within the range m ± f ?". If it does, this confirms the original grade, for grading on the basis of m + f gives the candidate the "benefit of the doubt" at the outset. But if the re-mark m* is less than m – f , or greater than m + f , this suggests that the original mark m was a marking error, so allowing this error to be corrected.

A final thought...

Basing grades on m + f might appear to be a new, even alarming, idea.

But in fact, this is the formula that is currently being used - under the (undeclared) assumption that the value of the parameter f is zero. The assumption that f = 0 implies that there is no variability in marking, which has been known to be false for years: Figure 4, for example, shows the variability in marking for 30 GCSE history scripts, when each script is marked by 40 different markers, as published in an Ofqual research paper from 2010:

Figure 4: The marks given by 40 different markers to 30 different GCSE history scripts.

Figure 3.5 from Component reliability in GCSE and GCE, Sandra Johnson and Rod Johnson, Ofqual, November 2010,

http://webarchive.nationalarchives.gov.uk/20140402200706/http://ofqual.gov.uk/documents/component-reliability-gcse-gce/all-versions/

There might, and probably will, be a vigorous debate about what f means, how it might be used, and how difficult it is to measure. But one thing on which everyone will agree is that the only value which f definitely is not is zero. Yet this is the value that has been assumed, and used, for a very long time indeed.

Some further documents

Here are some further documents which you are welcome to download:

■ The Great Grading Scandal - A description of the misallocation problem.

■ How to make grading fair - A description of the solution.

■ How to determine f - An exploration of some different possibilities for measuring f.

■ Identifying marking 'errors' - A suggestion as to how marking errors might be identified.

■ The statistics of examination marking and grading - For those who would like to study the maths.

■ The 'cliff-edge' problem - What happens when a script is marked f + 1 marks below a grade boundary.

And...

...thank you for reading this far! I think this is important: why have grades if they depend not on the candidate's ability but on the lottery of marking, if there is a high likelihood that they can be changed on appeal, if they are unreliable? Indeed, why have grades at all? Why not declare each candidate's 'raw' mark m, as associated with the corresponding value of f for that examination? But if grades remain, they must be reliable - and I think that grading according to m + f might be one way of doing this.

But there might be others. So let's get the debate going, and please do contact me if you wish.