Assessment Dependability

The assessment concept that I think is most useful for teachers to know about

4 min readAug 1, 2018

Context

Everyone knows about classroom tests, mini-quizzes and questioning in lessons but for too long properties of assessments were largely ignored by practising teachers. End-of-topic tests ‘work’, those who do better are the ones we would expect to, questions tell us what students do and don’t understand.

However, assessment is far more complicated than this. Complicated, but interesting; so much so that I decided to do a masters on it.

It appears I’m not alone in my opinion. In recent years there has been lots written about assessment, many discussing technical properties of assessment.

Key Concepts

Blogs

Rob Coe suggests there are 47 questions you should ask of assessment before letting it into your classroom.
Tom Sherrington’s Many blogs on assessment
David Didau on ‘When assessment fails’

I think there are certainly elements assessment theory that classroom teachers should be aware of however I am concerned that there is an element of overkill in the current push for ‘better’ assessment. You can numerically calculate measures of validity and reliability but should we bother?

This post attempts to lay out what I think about assessment as a classroom practitioner. It is not an academic thesis, it’s a pragmatic approach to a subject that could take up your working day if you let it.

Dependability

Validity is a measure of how well supported are the inferences we make based on assessment results. For example, it may be valid for me to say that a student who achieved an A in GCSE Physics has a better understanding of physics than a student who got a D. It would be invalid for me to say they are better at Chemistry based on the same results

Reliability is a measure of how consistently we can make those inferences. Would two teachers marking the same paper give the same mark? Would a student achieve the same result if they sat the assessment next week?

Reliability is a prerequisite for validity, however, rather annoyingly attempts to make an assessment more reliable can tend to make assessment less valid. For example, an authentic assessment of practical work in physics would be to ask students to perform an investigation and a teacher would make a judgement of quality. However this would probably have low reliability (dependent on the teacher, the equipment etc). To make the assessment have high reliability I could design a multiple choice test on practical work but this bears no resemblance to practical work.

A measure of this conflict is dependability.

In practice

In a lesson, I may ask a question to determine if students have ‘got’ what I’ve just taught them. I need this question to have high validity but I’m not too worried about reliability because this won’t be the only questions I’ll be asking. I’m not going to make any high-stakes decision based on the results. This is a dependable use of this assessment.

At the end of a topic, I want to make judgements on how students have done in relation to one another and other classes. I’ll need this to be more reliable and I may have to sacrifice some validity. I’ll design a test using past exam questions which are valid enough but should offer some hope of higher reliability (it is worth noting that this is the situation these questions were designed for; the high stakes context of GCSEs where high reliability is needed but validity has been sacrificed slightly). This is a dependable use of this assessment.

I’ve seen lots of question level analysis (QLA) spreadsheets that people use to identify gaps from end of topic tests. Assigning red to topic where students didn’t pick up all the marks, green where they did. This would be an example of an invalid use of the assessment data. Although a student not achieving a mark on a question may reliably tell you that they couldn’t do the question, it doesn’t tell you anything about why, or how they would do on other questions in the topic. As there is such a high correlation between subtopics, the overall score is a better indicator of ability in each subtopic. QLA is not a dependable use of this assessment.

In the past I have correlated end of year exam results with GCSE outcomes in order to determine predictive validity and calculated Cronbach’s alpha in order to determine the internal reliability of our end of topic tests. Should classroom teachers be doing this in order to justify their inferences? Absolutely not, just have an appreciation of why you are doing the assessment and make sure what you use is dependable for that purpose. If you don’t feel the inferences being made off the back off an assessment are justified, don’t make them.