While curriculum has undoubtedly been the hottest educational topic of the past few years, its close sibling — assessment — has also been the subject of much interesting debate. Specifically, the topic of grading has been under some scrutiny, with some excellent blogs by Becky Allen, Matthew Benyohai adding to this older, but no less relevant, piece by Alfie Kohn on the potential folly of grading. I’ve found little to disagree with in these blogs, and yet we at Ark still grade some assessments. This obviously begs the question — why? I’m going to attempt to answer this question, while also addressing another two important and related ones — when and how?
At Ark, we believe that student outcomes can be improved by our teachers and leaders taking informed actions. We also believe that these actions can be better informed through insightful analysis. But these things can only happen if our teachers and leaders have accurate data to analyse and act upon.
Assessment is a key source of data, but not all assessments serve the same purpose. On a day-to-day basis, the most important assessment ‘data’ in our schools comes from formative assessment — i.e. the information our teachers collect by setting tasks or simply asking students questions. These tasks and questions check for highly specific knowledge and skills, so the resultant ‘data’ helps show what each student can and can’t do at a granular level, informing teachers’ immediate and longer term next steps. And yet we do not grade or even record this vital ‘data’— at least not in any prescribed way. After all, as Ark’s former Head of Assessment, Daisy Christodoulou, illustrated so vividly in her book “Making Good Progress?”, grading formative assessment would be like a marathon runner measuring their weight-training in hours rather than kilograms. At best, pointless, and — at worst — grossly misleading.
But summative assessment serves a different purpose. Summative questions are multi-faceted and check for the retention and application of a broader knowledge and skills base from throughout a course. This is more like a marathon runner timing their practice races, which is why we do believe it’s useful to record and, yes, ‘grade’ this data. Grading is our means of approximating where a student resides along the national distribution. In isolation, this is not that useful, but it becomes increasingly more useful when we can compare it with other related data points. For example, if a student appears to move significantly upwards or downwards along this national distribution, that is useful to know. Similarly, if a student’s position on the national distribution appears to diverge significantly between different subjects, that is also useful information.
Of course, individual student data can be very ‘noisy’, but this noise becomes increasingly attenuated as data gets aggregated across multiple students. As such, approximating a class, year or school’s average position along the national distribution can help inform school and network leaders’ periodic decisions around where to invest the scarce resources at their disposal.
But we can only perform this approximation of each student’s position along the national distribution if our summative assessments are as valid, reliable and comparable as possible. This is why we have committed to network-wide summative assessments wherever feasible. Constructing high quality tests requires significant expertise, so we try to maximise test quality through the network-wide consolidation of this expertise. The other obvious benefit of this approach is that it enables like-for-like comparisons between all of our students, regardless of which Ark school they attend. However, this comparability is dependent on us ensuring consistent test conditions as well as aligning our curriculum across all participating schools.
This hopefully helps explain why we collect summative assessment data, but I still haven’t justified why we choose to encode this data as ‘grades’ (e.g. 9–1 at KS3/4 or A*-E at KS5). The most honest answer is pragmatism. The national distribution already gets broken down into these well recognised (but somewhat arbitrary) segments at the end of these key stages, so it makes things less convoluted if we use the same conventions beforehand. To be even more brutally honest, if we didn’t encode this data as ‘grades’, schools and teachers would just do it anyway. As such, by doing so at a network-level, we are ensuring consistency, so when two Heads of Maths talk about their ‘Grade 9’ students, they are both talking about students that we believe to be performing within the top 3% nationally.
However, one final caveat (and perhaps cop-out) is that everything described above relates to data that is analysed and discussed by teachers and school/network leaders. I make no attempt to justify what is or isn’t shared with students or parents (which is a big focus of the blogs listed above) because, to be perfectly frank, this is something that I have zero experience of. All I can say is that we believe that teachers and leaders can make better decisions when armed with summative assessment data, and that they can have less convoluted discussions if we also encode this data as ‘grades’.
I joined Ark during the death throes of national curriculum levels. Back then, it was still relatively common for teachers to award sub-level ‘grades’ for student work on a lesson-by-lesson basis. Meanwhile, summative ‘grades’ — based on a smorgasbord of teacher judgements, bespoke tests and aggregations of formative data-points — were recorded at least six times per year in most schools.
Since then, our efforts to increase the quality and comparability of our summative assessments have had the additional impact of reducing their frequency. Happily, we now only record student ‘grades’ once or twice a year. This shift has certainly been driven by a desire to reduce teacher workload, but it’s also been about reducing opportunity cost more broadly, since testing windows inevitably eat into teaching time and impose discontinuities into the flow of the school year.
Another reason for reducing summative assessment frequency has been the extent (and limits) of curriculum alignment across our network. At Ark, we talk about curriculum at three broad levels: macro-curriculum (i.e. the time spent teaching each subject); design architecture (i.e. the knowledge and skills covered within each subject); and delivery architecture (i.e. the resources and approaches used to teach each subject). We try to maximise alignment of macro-curriculum and design architecture, since we see these as fundamental expressions of student entitlement. However, we allow much more school/teacher-specificity around delivery architecture. In practice, this gives us confidence that the same content will be taught in each school by the end of each year, but not by the end of any given week or term. This naturally limits how often we administer our network-wide assessments, since they are only meaningful when we are comparing like with like.
I’ve described summative data as an approximation of a student’s position along the national distribution. But how can we arrive at this approximation? This is where we leverage consistency, scale and technology.
If all students across the network take the same blind test, at the same time, under the same conditions, having covered the same content, we have a sample of >3,000 students per assessment (e.g. for Year 9 Geography). We’ve developed systems to capture raw marks for each of these students, providing us with a network distribution curve.
Next, we use a student’s position on our network distribution to approximate their position along the national distribution. We do this by:
- Starting with the final grade distribution from a relevant historic network cohort (e.g. Last year’s Year 11 GCSE Geography)
- Breaking this historic cohort’s grade distribution down by prior attainment (e.g. KS2 or, better still, a nationally standardised test taken during KS3)
- Re-weighting the grade distribution of the historic cohort (i.e. Last year’s Year 11 GCSE Geography) using the prior attainment of the cohort that is now being assessed (i.e. Year 9 Geography)
For example, if last year’s Year 11 GCSE Geography cohort had the following grade distribution:
And this distribution broke down by prior attainment as follows:
But the Year 9 Geography cohort that we are now assessing has the following (higher) prior attainment profile:
Re-weighting last year’s Year 11 GCSE Geography grade distribution using our Year 9 Geography cohort’s prior attainment profile yields the following approximate grade distribution:
In other words, the top 10% of last Year’s Year 11 GCSE Geography cohort achieved a Grade 7 or better, but our Year 9 Geography cohort has slightly higher prior attainment than they did, so we can approximate that the top 13% of this new cohort are positioned within the Grade 7/8/9 segment of the national bell-curve. Similarly, we can approximate that the next highest scoring 53% are positioned within the Grade 4/5/6 segment of the national curve, and so on for the remaining segments.
We can refine this approach by interpolating further grade breakdowns within each segment (i.e. Grade 4 vs 5 vs 6), and even make appropriate adjustments for tiered papers and/or inter-related subjects like Combined and Triple Science. We can also make adjustments to account for any anticipated sample biases (though the act of re-weighting based on prior attainment already addresses this to some extent). But the general approach is as I have described above and is essentially very similar to what exam boards call ‘comparable outcomes’.
For the avoidance of any doubt, this approach relies on the assumption that network-wide performance within a given subject will remain stable over time (after adjusting for changes in prior attainment profiles). This is generally true at a national level, but is often not the case at individual school level. As such, the extent to which this assumption holds for any given network will depend on its overall size and general stability. Ark is pretty big and fairly stable (at a network-level), so it seems to hold relatively well in our case.
However, the validity of this assumption is actually less critical than some might assume. While it would of course be satisfying to get these network-to-national approximations spot on, this wouldn’t necessarily change our insights or the actions that follow from them. I asserted above that the purpose of this summative data is “to inform school and network leaders’ decisions around where to invest the scarce resources at their disposal”. In other words, this data is fundamentally about prioritisation, which is a distinctly relativist exercise. As such, so long as we can understand how students, classes and schools are performing relative to each other, it is not necessarily that important for us to know their exact positions along the national curve.
This means that the most useful summative assessment data we have at our disposal is a student’s network percentile ranking, while the ‘grades’ that we derive from these rankings are simply the pragmatic but imperfect translation of these rankings into the common language of GCSE (or A-Level, SATs etc) grades. While this statement may surprise or even concern some, it need not be a problem, so long as we all understand what these ‘grades’ are and, perhaps more importantly, what they are not.
These ‘grades’ are consistently derived approximations of our students’ current performance relative to their national peers. They are a way to quantify differences between class or school averages in a language that most teachers are familiar with. They are even potential indicators of large differences between individual students’ current performance and/or large swings in an individual student’s performance over time.
However, these ‘grades’ are not accurate predictions of any individual student’s future performance. Nor could they ever be, since student progress trajectories are famously variable and even final GCSE grades are not as reliable as many assume them to be. They are not even a precise way to quantify differences between individual students’ current performance or an individual student’s performance over time, since underlying differences can either be amplified or hidden depending on their proximity to grade boundaries. And this is before we even consider the various forms of measurement error that might influence any individual student’s apparent performance. (N.B. As a crude rule of thumb, I consider individual student grades using a +/-1 range, but class/school averages with more precision)
But most of all, these ‘grades’ do not tell teachers what their students have and haven’t learned. And this is why I’d reiterate that “the most important assessment ‘data’ in our schools comes from formative assessment”. So, while network and school leaders’ time may be well spent analysing summative data, teachers’ time is likely better spent collecting, analysing and acting on high quality formative information. This ‘data’ doesn’t need to reside in any systems; it just needs to reflect the myriad interactions they have with their students each day. Oh, and did I mention they shouldn’t grade it(?!)
N.B. Most of the examples used in this blog refer to secondary assessments, but the ‘Why’ and ‘When’ apply equally to primary. The ‘How’ is potentially a bit easier for primary, since nationally standardised tests (that align with the network’s curriculum) could provide direct national percentile rankings without needing to adjust any network-specific rankings. However, these tests are often repeated each year, so beware of teachers unconsciously teaching to the test.