Skip to main content

:Education Futures Studio]

UK exam algorithm game

Increasingly, algorithms are being used in education, including allocating student grades. If you were in charge, what decisions would you make for a fairer grading system? How would you design an algorithm? Play our ‘algorithm game’ and we’ll guide you through some of the fairness challenges based on the 2020 UK exam controversy. An algorithm game invites you to learn and think about how algorithms work, such as how different inputs lead to different outputs. Our game also aims to provide insights into the complexity of fairness issues in using algorithms. Let’s start.

The questions below are part of a research project on algorithms. No identifying data is being collected, including no ISP addresses (we are GDPR friendly!). Please note, there is no need to complete the questions to play the game. But if you would like to click on the blue arrow after each question then we can use your answers in our research. Thanks!

Question 1

Question 2

Question 3

Question 4

Question 5

Question 1

The controversy

In England, Wales and Northern Ireland, 2019 was largely like any other year for high school students. These students, in their final year of secondary school, sat their exams. About 6 months earlier about 50% of them applied to universities. For most, some of the universities they had applied to made conditional offers to accept them to study, based on their predicted grades. Students then picked their first, and backup, unis from these offers. In August, they sat back, enjoyed the summer, and hoped their exam results met the uni conditions, so they could start in September. Students received their grades, university places were nationally roughly what were expected and students set off on life after school.

But 2020 was completely different. Just like 2019, universities spent the first few months of 2020 sending out offers to students, conditional on the results of exams they would sit from mid-May. But then COVID-19 struck. On March 23rd, the first lockdown went into effect. By April there had been 26,000 deaths. The exams were meant to start in mid-May.

But the decision had been made in March that these exams could not be held. That left students with a problem: all their university places were dependent on getting particular grades in these exams. The grades were essential, but the exams were impossible. Solving this problem fell at the feet of Ofqual, the UK exam regulator.

Ofqual needed some way of allocating grades to students without getting them to sit exams. They decided to use an algorithm with some automated features. The algorithm used information about individual students and schools to determine individual grades, while standardising across cohorts of students. Ofqual was trying to avoid problems that they saw with the information they had, particularly around grade inflation, and potential for variation in accuracy of predictions from individual teachers or schools.

The algorithm went to work. But once the grades were announced in August, there was a huge public backlash. Students from across the UK hit the streets. Students and teachers demanded the government take immediate action. The official response from the government was:

“no U-turn, no change”

Two days later Gavin Williamson, the Secretary of State for Education, announced that the algorithm would be completely scrapped.

So why did they decide to use an algorithm? Was it fairer than the alternative of having teacher allocated grades? Or were the students who protested correct that it was deeply unfair? Let’s begin with a deeper dive into the problems Ofqual faced and what they tried to do about it.

The problem Ofqual faced

Ofqual had access to one key piece of information about students who were supposed to sit their exams in 2020: their predicted grades for each subject. These grades were ones teachers estimated that students would get in their exams. Such predictions were often useful, in particular in helping with the process of university applications. It wouldn’t be wise for a student to apply to a university that typically gives conditional offers of 3 A*s, when the student is predicted 3 Cs. And it would be prudent of a university who made a conditional offer of 3 Cs to expect that a student predicted 3 A*s would meet their offer. However, there is a known problem with teacher predictions: they are highly inaccurate.

Ofqual knew this because they had access to the predicted grades and actual exam results for students from previous years. Teacher predictions were routinely higher than exam results. In fact in 2019, 79% of university entrants did not get their full set of predicted grades.

Ofqual could therefore predict a serious problem with accepting teacher predictions alone in place of exams: the grades would be inflated. This posed some problems.

Exams are supposed to provide a fair measure of students’ abilities, at a point in time. They are also supposed to differentiate students, so that decisions can be made about university entry and employment (we know there are problems with exams but we will leave these aside for the purposes of this algorithm game). Grade inflation posed a problem for the goals of exams.

Given that there is a ceiling (no one can get higher than an A*), pushing students up and up means that they cannot be differentiated (we can leave aside for the moment whether the purpose of education is to differentiate students).

Additionally, grade inflation posed a serious practical problem for university admissions. Entrance would be based on offers conditional on grades, but those offers were made based on the assumptions that grades students were likely to get were similar to previous years. If they were markedly inflated across the boards, the system would be in disarray.

We can illustrate the predicament for universities by looking at the subject Mathematics for the 2019 cohort and the predicted teacher grades for 2020. The key thing to see in the below graph is that if the 2020 teacher predicted grades were accepted, universities would have 9% more A*/A grades, or 8,735 students, to deal with. This would be great for students but would pose challenges for universities.

Increasingly, algorithms are being used in education, including allocating student grades. If you were in charge, what decisions would you make for a fairer grading system? How would you design an algorithm? Play our ‘algorithm game’ and we’ll guide you through some of the fairness challenges based on the 2020 UK exam controversy. An algorithm game invites you to learn and think about how algorithms work, such as how different inputs lead to different outputs. Our game also aims to provide insights into the complexity of fairness issues in using algorithms. Let’s start.

Question 2

What do you think?

Using only teacher predicted grades also posed a problem for fairness. If the grades were inflated then they were not an accurate representation and thus, in one sense, unfair as a measure of student ability. Furthermore, it is well-known that teacher predictions are influenced by a range of social factors that call their fairness into question. For example, disadvantaged students are more likely to have their grades underpredicted than advantaged students.

Problem 1: Grade inflation

If we use teacher predicted grades, we know they’re likely to be inflated, and that there may be biases in this. Grade inflation introduces unfairness because it may devalue grades, and make it harder to compare grades across cohorts.

Solution 1: Use historical data

Use an algorithm that standardises students’ assigned grades by drawing on historical date to limit inflation.

In a normal year, exam standardisation and moderation were mechanisms to try to ensure that grades reflected abilities and differentiated students (for example, if an exam is easier one year, the mark required to achieve a top grade could be increased). So what to do?

Teachers take account of lots of information when they predict student grades at an individual level, but of course a single teacher doesn’t assign grades across the country, and they don’t necessarily weigh the ingredients in the same way for everyone. Ofqal had a responsibility to consider the impacts across all students. So, Ofqual needed to try to create an algorithm that accounted for these known issues.

Ofqual tried to use a range of information in order to provide standardisation and moderation of the teacher predictions. Using an algorithm that used historical data was their best bet in trying to avoid these problems. In what follows we will dig deeper into this solution.

So now we know why Ofqual used an algorithm. But what is an algorithm? And how could one overcome the problems Ofqual faced?

The algorithm game (an interpretation of the Ofqual algorithm)

What is an algorithm? It is a set of rules. It is a step by step instruction that needs to be followed to get a result. A really common illustrative example is a recipe.

In what follows we will outline some simplified parts of the Ofqual algorithm (the inputs and procedure) and what it means for fairness when it was used to allocate grades (the output).

This algorithm game has three students. The algorithm game focuses on showing the different impact of the algorithm on students, and on how seemingly similar data inputs can lead to different outputs.

Our students

These three students – PhilJill and Will – are similar in two key ways:

  • Their teachers have predicted they will get the same grades – an A – in a subject called ‘Critical Data Studies’. We will assume that the three students have teachers as good at predicting grades as each other. So, if they think someone is going to get an A, they tend to get an A.

  • All the students had the same prior achievement – also an A – in ‘Critical Data Studies’

As these students are all the same, we should expect the algorithm to give them the same exam results too, right?

But, remember, Ofqual is worried that these predictions may not be accurate, and that they might be inflated. So, they use some historic data about the students (their prior achievement) and about their testing centres (the schools).

What makes up our algorithm?

The actual Ofqual algorithm accounted for a multiple features (like how grades are distributed across the country, and how accurate teacher predictions tend to be), but in our algorithm game we’re just going to have two inputs:
  • Ranking of students by teachers

  • Using historical grades of a school

INPUT

Ranking of students by teacher

This part of our algorithm does not include the grade from the teacher, but the teacher ranking of the student in their school in Critical Data Studies.

This is also what Ofqual did. They used a ranking of the student within all the students at that school for that subject as they thought that was more accurate than grade predictions.

As Ofqual put it:

“We also asked teachers to provide a rank order of students for each grade for each subject … we know from research evidence … that teachers’ judgements tend to be more accurate when they are ranking students rather than estimating their future attainment

INPUT

Using historical grades of a school

Ofqual looked at the school’s prior data for 2017, 2018 and 2019. That’s because they wanted to check whether or not the school predictions were broadly in line with the results those schools normally receive. This was based on the idea that schools tend to get the same kinds of students over time, and so the results a school’s students achieve over the previous three years, should be pretty similar to those it achieves the next.

In this algorithm game, for our subject, ‘Critical Data Studies’ (CDS) we have created distributions for each of the the students’ three schools, based on imaginary average data for the previous years. We can use this to see where our three students sit compared to those historic grade distributions.

Question 3

We have two inputs. Which do you think is most important?

PROCEDURE

Our algorithm takes our two inputs and applies a procedure to allocate a grade. We’re expecting that our students will all get the same A grades as they all have the same teacher prediction, plus the same previous results. Right?

OUTPUT

The below shows what happens when the algorithm is applied. You will see our students move from their ranking in their school, to their grade.

Let’s see what happens.

What is going on here!?

Well, remember, because the algorithm uses the prior achievement data – identical for our students – and the school’s previous grade distributions, plus teacher ranking, their results will do unexpected things.

Phil is moved down because, although they are a high achieving student, in their school no other student has historically achieved an A. The previous four highest average grades were B, B, B, C. As Phil is ranked 4/50 they receive a C.

Jill also moves down. Although they are quite high achieving, the previous average grades in the school were A*, A, A, B. As Jill is ranked 4/50 they receive a B.

Will, stays the same. They are in a small school with only five students doing Critical Data Studies. The previous average grades were A*, A*, A*, A, E. Will is ranked 4 out of 5 so keeps their A grade.

The algorithm treated our students very differently because it combined a teacher input (ranking, which was the same for all the students) with historical data from the different schools.

In attempting to solve the problem of grade inflation the algorithm appears to have created another problem.

Problem 2: Historical data

Students with the same prior achievement and predicted grades have been given different assigned grades by the algorithm.

Solution 2: Defence of relevance

Keep using historical data because school track records do matter.

The testing centre problem

Ok, this is raising lots of fairness issues. And Solution 2 also raised some other problems.

We’d like to introduce another student to highlight another fairness issue.

Meet Will’s twin, Sal, who goes to the same school and is also doing ‘Critical Data Studies’. Like Will, Sal is predicted by their teacher to receive a grade of A. Sal is ranked 5 out of 5 in Critical Data Studies.

When the algorithm is applied to Sal, can you guess what happens to their grade? (remember the algorithm takes the ranking and grades from the previous cohorts).

Did you figure out what happened?

Over the course of the previous years there had been some students who received Es in Critical Data Studies, meaning that the distribution of grades used was 4 x A*/A + E.

For Will who was ranked 4 out of 5, they received an A. Their sibling Sal, ranked 5 out of 5, received an E.

This seems very unfair.

It arises because another issue Ofqual uncovered in applying this algorithm across all schools was highly variable results due to the subject size in a school. Because the algorithm relies on previous year’s data, for contexts where there was very little data, the results might be highly variable.

So what can be done?

As a result, the decision was made to exclude small testing centres (<5 students) completely from the algorithm, and to adjust how much influence the algorithm had on grades for mid-size testing centres (5-15 students) So, while those from larger schools had their grades algorithmically moderated, those from small schools did not. They received the original teacher predicted grades.

But there’s a further fairness issue. While there are a range of reasons that small testing centres exist, they are more likely to be private schools. As a result, there were concerns that private schools were being unfairly exempted from having their grades algorithmically moderated, and that this might benefit students who might already have various structural benefits.

Question 4

So, what would you do?

What happened?

We know that applying the algorithm to grades received a massive backlash.

Students – already anxious about exam results, Covid, and their university admissions – were left in the dark about how their grades were determined. They were also left in the dark for too long about how they could appeal results.

Because the algorithm was designed to combat grade inflation, a far higher proportion of students were “down graded” via the algorithmic moderation, than were upgraded. As a result, there were protests, with “Fuck the algorithm” placards, and concerns about the fairness of ranking students, and the ‘postcode lottery’ of using historic school data to determine current student outcomes.

As a result, as we noted in the introduction, on August 15th, two days after the results had come out, Gavin Williamson, the Secretary of State for Education, said there would be no change: “no U-turn, no change”. By the 17th he had announced that the algorithm would be completely scrapped.

Now that might be it. No algorithm, no issue. Though, maybe not.

Problem 3: Small changes, big effects

Small changes (in students’ predicted grade or historic data) make a bigger difference for some than others.

Solution 3: Account for class size

Don’t apply algorithm to small class sizes where small changes have the biggest impact.

Problem 4: Privilege and bias

Algorithm applied to some students but not to others (and correlation with privilege).

Solution 4: Abandon algorithm

So, in the end in the face of the problems we’ve stepped through, public pressure, and the need for immediate action, the decision was made to revert to teacher predictions…

Problem 1 and so on…

Question 5

Each of the solutions considered were trying to address a (very real) moral problem of fairness. However, each one introduced new moral problems. The final place we end up seems better from the point of view of most individual students (their grades are likely to be higher than if they had sat the exam), but this benefit is not necessarily evenly distributed. It also introduces a host of problems on a societal, systematic level in terms of university placements and for the function of exams being in part to distinguish students well based on ability.

And in fact, we saw some of these problems come to fruition, with huge leaps in numbers of students admitted at some unis, and drops at others, with concerns about the financial and resource impacts of that. This has resulted in concerns over the amount of space for the 2021 cohort, that universities are increasing their grade entry requirements to compensate, and that parents – disproportionately of children in private schools – are pressuring teachers to increase grade predictions.

In 2021 we’re back to first fairness problem of grade inflation!

Learning lessons

Now, on the one hand, we could leave it here as particular example of algorithms in education. We could see that the use of an algorithm in the Ofqual case is a unique instance. It is different from other algorithmic instances as it was not to improve a currently existing system or to make any small gain in efficiency but employed to make up for something that couldn’t be done. A one off instance (leaving aside that the UK is facing a similar problem in 2021).

The algorithm was a complex balancing act – an attempt to do something (accurately predict grades), that could not be done in the usual way (exams), while also trying to address issues of fairness.

On the other hand, while we’ve tried to explore fairness in this specific context, a key takeaway is that we need to consider the wider system. Neither humans or algorithms are going to be able to make an completely fair set of choices. Indeed, the use of algorithmic decision making still involves human decision making. A focus of explainable AI and fairness in AI is often on how we build systems to be fair. But as we’ve shown, there are often irreconcilable tensions in fairness.

What is less often discussed in such work is where we should choose not to deploy algorithms, and the wider context and implications of these technologies (such as PESTLE stands for Political, Economic, Social, Technological, Legal and Environmental factors). The fairness tradeoffs we outlined above span multiple algorithmic decision making instances. Furthermore, if there is an identified public need and value for algorithms, then participatory research is vital. This can include co-design and co-production with stakeholders who will be impacted by algorithmic decison making.

Thanks for reading (and playing). We hope it has tested your thinking about the use of algorithms in education, and you’ve enjoyed thinking through the ways different uses of data can lead to different outcomes.

We’d love to hear from you if you learned from our algorithm game.

Acknowledgements

Thanks to Gemma Campbell (Gradient Institute Summer Scholar, 2020) who did analysis that informed the testing centre size section. Thanks also to Jett Maximo who created the algorithm recipe graphic.

This algorithm game was created through a co-design process. Our thanks to the participants in the co-design workshops.

Contact details and the different groups who helped design this algorithm game are below.