How to make exam grades reliable | silverbulletmachine

How to make GCSE and A level grades reliable

Grade misallocation

The news item The Great Grading Scandal featured a chart, published by the exam regulator Ofqual, showing, for each of twelve GCSE subjects, the probability that a candidate is awarded the right grade: so, for example, for GCSE physics, this probability is about 85%, for GCSE history about 40%. Of more importance is the interpretation of the data the other way around: currently, the probability that a candidate is awarded the wrong grade is 15% for GCSE physics, and 40% for GCSE history - or, out of every 100 candidates in physics, 15 are given the wrong grade; similarly, for history, 40 candidates in every 100 are given the wrong grade. 'Wrong' is wrong both ways, so for history, of those 40 candidates with wrong grades, 20 receive a grade higher than they truly merit, and so are 'lucky', whilst 20 receive a grade lower than they truly merit, and so are 'disadvantaged'. This is most unfair.

Another way of representing this is shown in Figure 1, which represents what happens across a grade boundary:

Figure 1: Grade misallocation across the C/B grade boundary.

At every grade boundary, in every subject, at both GCSE and A level, the current system results in (at least) four populations: two comprising those who are awarded the grade they merit; one comprising those who are awarded a grade higher than they merit ('lucky' candidates); and one comprising those who are awarded a grade lower than they merit ('disadvantaged' candidates), and so may be denied important life chances. Grade misallocation is real. And, in principle, other (sparser) populations are possible too, such as 'doubly lucky' (two grades higher than merited) and 'doubly disadvantaged' (two grades lower), as might happen if grade widths are narrow.

You might think that the appeals system resolves this. It doesn't. And here's why:

■ Firstly, 'lucky' candidates who have been awarded a higher grade than they might have expected say "Great!". They have no reason to appeal - why should they? The population of those candidates who were originally 'lucky' therefore remains unchanged.

■ Secondly, those who have been awarded a lower grade are quite likely to shrug their shoulders and say, "Oh dear, I hadn't done as well as I had hoped". They trust the system, and blame themselves. As a result, they don't appeal.

■ Thirdly, making an appeal costs money - money that is refunded if the appeal results in a re-grade, but not otherwise. Many people, and many schools, can't afford to take the risk. As a result, most appeals are made by wealthier people, and wealthier schools - another manifestation of unfairness.

The overall result of the appeals process - which requires the 'victim' to shout 'this hurts' (which, of itself, is pernicious, as discussed further in my blog Are regulators doing the wrong thing? ) - is therefore to make the muddle even worse, with some originally 'disadvantaged' candidates having their error corrected, whilst a number of candidates who were correctly awarded grade C become 'lucky', and are up-graded to a B, as illustrated in Figure 2:

Figure 2: The current appeal system does not resolve the original grade misallocation.

Why grade misallocation happens

Most importantly, grade misallocation is NOT, repeat NOT, the result of mistakes or negligence, such as the failure of a marker to mark a particular item, non-compliance with the marking scheme, or some sort of operational foul-up. These things can - and do - happen, but the Exam Boards, and the regulator, take considerable trouble to prevent them from happening, and to correct them when detected.

Rather, grade misallocation is an inevitable consequence of the structure of our exams. Instead of being a sequence of 'pub-quiz' questions, our exams invite candidates to express themselves through essays, or to demonstrate how they go about solving problems, with marks being awarded for method, as well as the final result. Such questions do not have unambiguous right/wrong answers, so it is quite possible, and legitimate, for one marker to give an essay, say, 14 out of 20 marks, and another 16/20. Indeed, this variability - technically known as 'tolerance' - is built into the quality control procedures used by all the Exam Boards. Whilst marking is taking place, and to ensure quality, a senior marker will mark a randomly selected question, and then compare that mark with the mark given to the same question by a more junior marker: if the two marks are within the defined 'tolerance', that's fine; if the junior marker's mark is different from the senior marker's mark by a number greater than the 'tolerance', then the junior marker's work is scrutinised closely, and appropriate action taken.

As a result of 'tolerance', a script of several questions might be given a total mark of, say, 64, or perhaps 66. Neither mark is 'right' or 'wrong'; neither is the result of 'hard' marking or 'soft'; neither indicates the presence of an 'error'. Both marks are equally valid.

In practice, each script is marked just once, and given a single mark. As we have just seen, this mark might be 64/100, or it might be 66/100. If grade B is defined as all marks from 63 to 68 inclusive, the candidate is awarded grade B in both cases. But suppose that grade B is defined as all marks from 65 to 69, and grade C from 60 to 64. A script marked 64 results in grade C, whereas the same script, given the equally valid mark 66, would be graded B. The grade awarded depends not on the candidate’s ability, but on the lottery as to whether the script was marked 64 or 66. Furthermore, if a script marked 64 is appealed, it is possible that a re-mark might give the same mark 64, or perhaps 63 or even 62, so confirming grade C. But the re-mark might be 65, or 66, in which case the candidate is up-graded from C to B. And a script marked 66, grade B, might be re-marked 64 - and down-graded to C.

As this example demonstrates, the mark given to any script is not a precise number, 64; rather, the mark is better represented as a range, for example, from 62 to 66. It is this range that is the root cause of grade unreliability, of grade misallocation: if the range straddles one or more grade boundaries, then the grade that the candidate receives is determined by good luck if the grade is higher, or bad luck if lower.

This is unfair. Grades should not be determined by luck.

How to make grades reliable

There are at least two ways in which grades could be made more reliable.

The first is to change the structure of all our exams, away from open-ended essay-style questions, to multiple choice right/wrong answers. Each question therefore has a single, unambiguous answer, which the candidate either identifies, or not. This eliminates the variability in marking, implying the single mark given to the script is independent of the marker, will be the same no matter how many times the script might be re-marked, and corresponds to a specific, totally reliable grade. But as well as eliminating all variability in marking, this eliminates something else too - all the values we have about the importance of general learning and broad understanding, and the encouragement of self-expression. Not a good trade-off.

The second is to recognise that variability in marking is real, and to change the way in which the 'raw' mark given to a script is used to determine the candidate's grade.

To show how this might work, consider the example of a script marked 64, which, under the current system, would result in grade C (for grade boundaries defined such that all marks from 65 to 69 are grade B, marks from 60 to 64, grade C).

We know, however, that the mark '64' is more realistically represented as the range from 62 to 66, or 64 ± 2, where the '2' represents the variability in marking.

Suppose, then, that the grade is determined not by the 'raw' mark 64, but by the 'adjusted' mark 64 + 2 = 66, where the additional 2 marks takes the variability of marking into account. The candidate is now awarded grade B, not grade C. Furthermore, if the script is re-marked - for example, as the result of an appeal - it is almost certain that any re-mark will not exceed 66, and so the grade is robust under appeal.

A new 'rule' for determining grades

So here is a general rule for determining reliable grades

■ For a script given a 'raw' mark m (in the example, 64), and if the variability of marking for that examination is represented by f (in the example, 2), the grade is determined by the 'adjusted' mark m + f (in the example, 64 + 2 = 66).

■ If an appeal is made, and the re-mark is m*, and if m* is less than, or equal to, the original 'adjusted' mark m + f , then the original grade is confirmed. But if m* is greater than m + f (or less than m – f ), then the script is re-graded on the basis of m* + f , which may result in a grade change, depending on whether or not the new 'adjusted' mark m* + f is on the other side of a grade boundary as compared to the original 'adjusted mark' m + f .

Some features of the m + f rule

■ There is an assumption that the 'adjustment' f , which is a measure of the variability in marking - as fundamentally determined by the policy on 'tolerance' adopted by the Exam Board - is a property of an examination rather than of an individual script. If this is the case, then f can (I think!) be determined by a simple statistical analysis of a sample of scripts (this needs to be explored empirically by studying examination data). Once f has been determined for any particular examination, then the same value of f can be used for all candidates, so that all candidates have their 'raw' marks adjusted by the same amount. This ensures that all candidates are treated fairly.

■ It is very likely that the value of f is meaningful in the context of different examination subjects: I would expect f to be a smaller number for subjects such as maths and French, and a larger number for history and art.

■ By assigning a candidate's grade based on the 'adjusted' mark m + f , the candidate is being given the "benefit of the doubt", for the grade is based on a statistically sensible estimate of the greatest mark that might be given to that script. That is why any re-mark m* is almost certain to be less than, or equal to, the 'adjusted' mark m + f, so ensuring that the grade, as first published, is robust under appeal. This is very important in building confidence in the 'system', and in alleviating the current anxiety that a re-mark has a relatively high probability of an up-grade (for the last several years, about 18% of appeals - that's rather more than 1 in every 6 - has resulted in an up-grade).

■ This rule does not eliminate the variability in marking: this is still present. Rather, the rule manages the way in which this variability impacts the populations of 'lucky' and 'disadvantaged' candidates. As illustrated in Figure 3, the effect of grading based on the 'adjusted' mark m + f is to reduce the population of 'disadvantaged' candidates to close to zero, whilst simultaneously increasing the population of 'lucky' candidates:

Figure 3: Grading according to m + f ensures that almost no candidates are 'disadvantaged'.

■ You might now be thinking, "Ah! More 'lucky' candidates! Grade inflation!!!" Well, no. Adopting the rule that grades are based on the 'adjusted' mark m + f, does indeed increase the number of 'lucky' candidates as compared to basing grades on the 'raw' mark m, but this does not drive 'grade inflation'. 'Grade inflation' is, by definition, a phenomenon that occurs repeatedly year after year, and is totally dependent on how the regulator sets the grade boundaries. The m + f rule makes no statement about grade boundaries - it is solely a rule as to how to map raw marks onto grades. So this rule does not drive 'grade inflation'. If the grade boundaries are unchanged, and a comparison is made between grading according to m + f, and grading according to m, then the top grade will have more candidates, and the lowest grade fewer. with the intermediate grades having about the same populations - but this is a once-only effect which occurs when the change is made from grading based on m to grading based on m + f : thereafter, the system is stable. This is very similar to the re-calibration that takes place from time to time when, for example, the 'basket of goods' that comprises the retail price index changes. And the smart idea is to introduce grading according to m + f as the grading structure changes from A*, A, B... to 9, 8, 7... - this changes disrupts all the grade boundaries anyway.

■ An assumption throughout this discussion has been that the original mark m is valid, and not a 'marking error' as might happen if the marker failed to comply with the marking scheme, or as a result of an operational problem. In practice, marking errors can occur, and it is important that they are identified and resolved, ideally by internal quality control before the exam results are published, or an easily-accessible appeals process afterwards. That is why the suggested appeals process asks the question "Does the re-mark m* lie within the range m ± f ?". If it does, this confirms the original grade, for grading on the basis of m + f gives the candidate the "benefit of the doubt" at the outset. But if the re-mark m* is less than m – f , or greater than m + f , this suggests that the original mark m was a marking error.

A final thought...

Basing grades on m + f might appear to be a new, even alarming, idea.

But in fact, this is the formula that is currently being used - under the (undeclared) assumption that the value of the parameter f is zero. The assumption that f = 0 implies that there is no variability in marking, which has been known to be false for years: Figure 4, for example, shows the variability in marking for 30 GCSE history scripts, when each script is marked by 40 different markers:

Figure 4: The marks given by 40 different markers to 30 different GCSE history scripts.

Figure 3.5 from Component reliability in GCSE and GCE, Sandra Johnson and Rod Johnson, Ofqual, November 2010,

http://webarchive.nationalarchives.gov.uk/20140402200706/http://ofqual.gov.uk/documents/component-reliability-gcse-gce/all-versions/

There might, and probably will, be a vigorous debate about what f means, how it might be used, and how difficult it is to measure. But one thing on which everyone will agree is that the only value which f definitely is not is zero. But that is the value that has been assumed for a very long time indeed...

Some further documents

Here are some further documents which you are welcome to download:

■ The Great Grading Scandal - A description of the misallocation problem.

■ How to make grading fair - A description of the solution.

■ How to determine f - An exploration of some different possibilities for measuring f.

■ Identifying marking 'errors' - A suggestion as to how marking errors might be identified.

■ The statistics of examination marking and grading - For those who would like to study the detail.

■ The 'cliff-edge' problem - What happens when a script is marked f + 1 marks below a grade boundary.

And...

...thank you for reading this far! I think this is important: why have grades if they depend not on the candidate's ability but on the lottery of marking, if there is a high likelihood that they can be changed on appeal, if they are unreliable? Indeed, why have grades at all? Why not declare each candidate's 'raw' mark m, as associated with the corresponding value of f for that examination? But if grades remain, they must be reliable - and I think that grading according to m + f might be one way of doing this.

But there might be others. So let's get the debate going, and please do contact me if you wish.