Why the scoring of psychological tests isn’t helpful in posing a clinical diagnosis

vanessa.woroniak
Psyc 406–2016
Published in
5 min readMar 22, 2016

The Diagnostic and Statistical Manual of Mental Disorders (DSM) is the tool used by clinicians and other healthcare providers or related parties for the classification of mental disorders. Notably, it gives the description of various symptoms associated to a particular mental disorder to serve as indicators for giving a diagnosis. Although this tool is widely utilized and is praised for providing a consensus on the language used between practitioners, it is also a subject of controversy and criticism. Since its first publication by the American Psychological Association (APA) in 1952, five new editions have been published, with the DSM-V, the most recent version, having come out in 2013. Each revision of the DSM goes through a lengthy and conspicuous analysis involving various committees to add or remove mental disorders while also bettering the diagnostic criterion for existing disorders. Throughout the process of posing a diagnosis to a patient, many clinicians utilize psychological tests to aid them. These tests are also developed meticulously and are tested thoroughly for validity and reliability before being used for detrimental decisions, such as posing a diagnosis. A scale is then established in such a way as to associate the final score attained to a specific category or label, with different scores representing differences in the subset of the population. For example, the Patient Health Questionnaire (PHQ), used to monitor, screen, and diagnose depression, posits that scores at or above 10 represent possible cases of major depression (MD) of different degrees of severity, increasing by 5 point increments (ie. mild MD 10–14, moderate MD from 15–19, severe MD above 20). So my question to this is: what is so different between people scoring a 9 versus those scoring a 10? Do these score actually demonstrate different degrees of severity in the construct? Does the scoring of the test really allow for the discrete categorization of people?

As stated above, most psychological tests are interpreted in a norm-referenced manner. This means that an individual’s score on the test is compared to the statistical representation of the population, derived from a sample, which usually is expected to resemble a Bell curve. The scores obtained on the test from different subsets of the population would therefore represent the delimitation of the test’s scoring, which at times allows the classification of a person into a specific category, depending on the tested construct. Although I personally cannot come up with a better idea on how to score tests and have the end result be meaningful, I feel as though a better way should be devised for multiple reasons.

First of all, different symptoms from the same disorder do not necessarily carry the same weight when it comes to the patient’s distress. Moreover, the subjective perception of these symptoms and the different combinations possible of symptoms expressed also alters the individual’s experience of those symptoms. This means that not all symptoms should be weighted equally when it comes to enumerating them in psychological tests. Following with the example of the PHQ-9, where items are rated on a scale of 0 to 3 (Never to Nearly Every Day), I’ve included two examples of items present: “Thoughts that you would be better off dead or of hurting yourself in some way”, “Poor appetite of overeating”. I think we can all agree that these statements both represent symptoms of depression, but answering a 3 (nearly every day) to one of the two probably doesn’t represent the same degree of distress and impairment in the individual. My point is that the intervals (of severity, in this case) are presumed to be consistent between items, when maybe they shouldn’t be. Therefore, two people having the same score of X can mean two completely different things, depending on the symptoms driving that score.

Another reason I feel as though test scoring isn’t necessarily appropriate when it comes to clinical assessments is that it doesn’t take into account the interaction between symptoms. As previously noted, the simultaneous presentation of different symptoms do not act upon the individual in mutual exclusivity. They interact with one another and also with the biological and environmental baggage of the individual. Again with the example of the PHQ-9, the presentation of the two following items would most likely be through the interaction with each other: “Feeling tired and having little energy”, “Trouble falling asleep, staying asleep, or sleeping too much”. In another example, the next two items presenting concurrently and interacting together could possibly predict worst outcomes than if presented individually: “Feeling bad about yourself — or that you’re a failure or have let yourself or your family down”, “Thoughts that you would be better off dead or of hurting yourself in some way”. Thus, scoring items independently from one another doesn’t represent the full scope of what the symptoms represent to the individual, for it doesn’t consider other crucial factors such as the interaction between them and the person’s personal history. So again, two similar scores, even if similar items are answered the same way, still do not comprise a proper image of the syndrome.

So long story short, the scoring methods of psychological tests, according to me, should be revised to include differential weighting of items and interaction mechanisms present between items/symptoms. This would lead to more representative and comprehensive scoring of tests. I do want to point out a few points though. I am fully aware that some psychological tests do indeed do this. They are also not scored by robots which allows human “wisdom” and experience to intervene and provide insight and judgment. There is also patient-physician contact that is probably even more important than the screening devices used, which allows for a more individualized and patient-based diagnosis and treatment plan. Lastly, the PHQ-9 has been thoroughly validated and tested for reliability; I chose this test because it is known by many and is short and simple. My point is that revision of scoring mechanisms could benefit patients by making tests more specific and personalized. Linking the test scores to a DSM diagnosis would also take less time, but it would most likely be more reliable and standardized, leading to less chances of faulty labeling. The latter is associated to stigmatization, from others but also from the self, making this an issue of prime importance.

--

--