Are Schools Preparing Our Children for Life?

20 min readDec 17, 2021

Lack of basic education seriously hampers a person’s ability to have a decent life and precludes employment that can provide a livable salary. In recent decades, we have focused on standardized tests of reading and mathematics to assess how well schools are preparing our children to have a decent life. Today, we need to assess much broader goals, even if existing tests show that basic reading and math learning is inadequate in some of our schools. To be equitable, we need to identify all the major competence goals for our schools and develop more useful measures of how well children are gaining the skills they need.

A couple years ago, I wrote a book[1] that took a first pass at this, and I think much of what it said is still worth considering. Since then, folks at McKinsey have developed a much more detailed and validated list of the needed skills,[2] while the K-12 world has made only minimal and informal efforts at adapting to these emerging needs. In this essay, I consider what we can learn from the McKinsey effort and how schools need to change to address an urgent need. My sense is that the best schools and many after-school and summer opportunities serving children from privileged backgrounds implicitly provide many of these skills but that public school systems are not organized to do so and fail to do so. The schooling and extracurricular activities that wealthy parents seek for their children are substantially less available to children from less privileged families. If we care about educational equity, we need to address this gap.

Whether via schools or through extracurricular activity, providing strong and equitable preparation for full participation in our democracy and economy is a public responsibility, regardless of how much of that preparation currently is in the private sector.

The skills, which the McKinsey team call Distinct ELements of TAlent (DELTA), are listed in Figure 1. While some tests, such as those developed in response to the Common Core State Standards, implicitly assess a few of the listed elements,[3] schools generally are neither focused on teaching these skills to all their students nor measuring to see if their students are acquiring them.[4] Pieces of these skills may be acquired in part through such extracurricular and cocurricular activities as maker spaces, but there is no serious effort to assure that maker activity works for all children, nor are there rigorous checks to see how students with access to maker spaces are progressing toward acquiring the skills they will need to live well. Moreover, these extracurricular activities are less available to students from less privileged families. So, how do we get to the point where school allows all children to learn the needed “elements of talent” and assesses how well our children are succeeding in learning needed skills beyond traditional academic subjects?

Figure 1. McKinsey Distinct Elements of Talent.[5]

The Common Core State Standards[6] tried to do some of this by altering subject matter curricula to include elements of some of these skills, such as structured problem solving being embedded in the math curriculum. However, most of the skills in Figure 1 are not tied to a specific subject matter, and each subject matter has a long tradition of focusing testing on its core conceptual and specific performance elements, not on broader competences. The out-of-school activities some children experience come closer, but they lack a solid approach. Specifically, teachers and coaches who lead such activities are not taught goals and strategies for achieving those goals, nor are there tools for assessing whether the goals are achieved. For example, the listed skill of “courage and risk-taking” can be built, I suspect, from a range of experiences such as Junior Achievement[7] or school-based youth entrepreneur programs, but we lack specific outcome goals for those experiences, clear and well-documented procedures for designing and delivering them, and validated tools for assessing student progress in attaining them. This is true of just about all the DELTAs on the McKinsey list. We need to rethink assessment of learning to assure that the needed DELTAs are acquired by all our children.

Accountability is not the place to start. Our country’s earlier efforts at improving schooling assumed that teachers and school leaders knew what to do and simply needed to be held accountable for doing it. So, we focused on standardized tests of basic reading and mathematics. Principals turned assessment into stressful activity for teachers, and teachers passed on that stress to students. Colleagues of mine personally witnessed a principal at a school assembly telling students to boo their teachers because test scores had not gone up from the previous year. Curriculum narrowed to focus solely on drilling the specific kinds of items that appear on standardized tests, with arts, civics, social studies, physical education, and other distractions from the test score drive being eliminated. Not only did this fail to boost test scores, but it also produced children who are overweight, underactive, anxious, depressed, lacking any understanding of how government and citizenry works, and generally averse to continued learning. We must do better.

Facing not only the continued need to boost basic literacy but a much broader set of goals for schooling, we need to work differently. Years ago, someone — I forget who — said that if NASA had begun the moon mission the way we approach schooling, they would, by 1970, perhaps have gotten as far as building a large telescope so that in case someone got to the moon, we would know precisely when and where they landed; no one would have created the systematic enterprise that actually brought two astronauts to the moon. NASA did do a lot of assessment, but it came after they knew what needed to be assessed and how to produce what needed to be assessed. The initial efforts at developing all the design elements for getting to the moon focused on how to build a system that worked. That system included over 10,000 contractors who each had specific tasks to complete. Each first focused on developing their components and then proving that those components worked perfectly. We need to start with getting the system working in education before focusing on high stakes testing.

Steering Schooling

Assessment specialists talk about validity and reliability as the two essentials of tests. A test must measure what it purports to measure (validity), and it must produce the same score for the same level of performance every time (reliability). Testing is hard, and often validity is sacrificed in the service of perfect reliability. Too much of educational assessment has aimed at measuring the overall effects of schools and programs. Such measurements are important, but when we are unsure of exactly how to adapt to each student’s needs and when teachers may lack expertise in teaching all the new skills and may even lack some of those skills themselves, a different kind of assessment approach is needed.

I have previously suggested an analogy that might help us understand what is needed.[8] Consider the task of steering a car down the freeway. We really understand that task, so we can specify clear criteria for successful steering. One should not hit anything, and the car should stay within the lane it occupies. Most likely, we could develop a scoring system that counted the number of lane marker crossings per hour, perhaps also attending to which lane crossings are most hazardous. Such measures would be highly reliable; the same driver probably steers well or poorly on every occasion when they know their performance is being assessed. However, such a measure would do little to help the driver steer better. The results come way too late to guide performance.

As it turns out, the car I drive provides more useful assessment to help me stay on track. It beeps when I cross the lane markings. Thus, I get immediate feedback that can shape my performance. Now, it is not perfectly reliable, but it arrives fast enough to be useful, and it is readily correlated with my self-assessments as I steer. My continued sampling of my car’s location is even less reliable, as it occurs with minimal attention as I simultaneously answer questions from other car occupants, watch for the exit I need to take, and consider what I want for lunch. But, those sketchy observations keep me on the road. And, for a new driver, the correlation between such glances and the car’s beeping can help train the glances and responding wheel movements to be more effective.

This analogy illustrates three kinds of assessment. The first, which I suspect is built into some of the apps that insurance companies offer as part of good driver discount programs, tell us whether a person generally is steering adequately. The second, far less perfectly reliable but more useful, provides continual information on whether the car is being steered well. The third is the actual pattern of observations that the driver makes to steer. We call the first type summative assessment and the second type formative assessment. I suggest that we call the third type, which might be done partly by students and partly by teachers, steering assessment.

In education, we are very good at the first kind of assessment, at least for school subjects. We can detect that a child has learned something in the traditional curriculum, though it often takes weeks or months for test results to get back to the classroom. The best teachers are very good at the third form of assessment, at least for traditional schooling goals. They can listen to a child reading aloud and decide what coaching or practice the child needs next. We are beginning to do better on the second form of assessment, though we only do it in some schools and often not in schools serving children from less privileged families. For example, teacher coaches will watch a class session and provide immediate advice, perhaps linked to a video of the class, on how a teacher might be more effective. We do not yet have a well-established technology of steering assessment, though, although some of the best groups coaching teachers in the classroom are beginning to codify their practices. Much work is left to be done, but it is quite doable.

Assessment of “Soft Skills”

A quick comparison of the skills listed in Figure 1 to what standardized tests measure suggests that we do not yet have a clear technology of soft skill assessment. However, there are many examples of approaches to formative and steering assessment starting to appear. These involve the development of scoring rubrics for various student activities. Below, I provide a few examples of such rubrics and then briefly discuss how data mining approaches might be used to develop summative measures based on these. I focus on one DELTA out of the McKinsey list, Teamwork Effectiveness: Collaboration. Of the many other items on the list, some may be easier to assess and some harder.

Rubrics

As it turns out, many organizations have developed rubrics for assessing collaboration in team projects. Some may be too indirect to be trusted, such as those that ask people to describe effective teamwork but do not assess the actual occurrence of teamwork in a project situation.[9] On the other hand, several universities have developed rubrics that seem credible and likely to lead to useful steering of the course of group projects to develop teamwork skills. Brigham Young University, for example, has developed such a rubric.[10] In their rubric, under the heading of teamwork communication, they give the following description of the highest level of performance:

Team members communicate openly and treat one another with respect. All members listen to ideas. The work of each person is acknowledged. Members feel free to seek assistance and information, share resources and insights, provide advice, or ask questions of each other.

At the other extreme, here is their description of the lowest level:

Communication is limited among group members (information is not shared with one another and/or important topics are not discussed among the group because a climate of open communication has not been established).

And, here’s their intermediate level description:

There is a general atmosphere of respect for team members, but some members may not be heard as much as others. Some members may not feel free to turn to others for help. Members may avoid discussing some topics for fear of disrupting the group’s work and/or hurting someone’s feelings.

If you accept the understanding of the BYU group that developed the rubric, these kinds of descriptions provide plenty of useful steering assessment. Moreover, one can develop a library of strategies to recommend to instructors for how to intervene to move performance of teams or their individual members to higher levels of this performance.

Of course, BYU is not the only group to develop this kind of rubric.[11] While individual rubrics have been evaluated for reliability and sometimes for validity, each is a little different, especially in how the subskills of teamwork are broken down. One could undertake a project to produce the very best single standardized test of teamwork collaboration. This would be a bit like the telescope metaphor mentioned above. It could be used to hold colleges or high schools accountable, but it might not lead to improved performance, and its validity might not readily be established if the goal is to prepare students for participation in valued life activities.

For purposes of steering and formative assessment, it makes sense to encourage teachers to search through some existing rubrics, pick one that seems helpful, and start using it to focus and personalize the instruction of their students. For this to work well, though, there needs to be some ability to establish the validity of different rubrics and tools for training teachers to use them well.

Establishing validity. There are two ways to establish the validity of rubrics for steering the course of teamwork training, predictive validation and expert evaluation. Predictive validation is costly. It involves collecting not only the evaluations of performances in teams but also getting, for the same participants, data on some future criterion performance that is a compelling example of team performance. Then, a rubric is valid if the scores it produces correlate with scores on the later performance. This approach is used regularly to validate standardized tests. For example, the SAT is considered valid because it predicts freshman college grades.

For the DELTA skills identified by McKinsey, the ideal criterion measures would be performance on teams in successful enterprises like businesses, nonprofits, or government roles. However, time becomes a major factor in such validation studies. A rubric for middle school teamwork could in principle be validated against subsequent performance in an adult job years later, but the cost of tracking people long enough to get the data would be high and the results not very timely. One also could take current jobholders, measure their teamwork skills in their job through various means including team goal achievement, and then apply the rubric to exercises given to the jobholders at a training retreat. Costs in time and money would be much lower, but the rubric would have been validated for judging adults, not middle school students, and that could create problems.

The alternative approach is expert validation, letting experts in teambuilding judge performances and using that to develop a rubric or a set of scoring rules. This turns out to be an efficient approach for validating some scoring approaches, and my colleagues and I have shown that the approach, which we call expert policy capture, is a helpful tool in designing training in such skills as complex problem solving.[12] The approach works as follows. First, experts are asked to suggest several examples of tasks that require the skill in question. Then, students are given those tasks to perform, and their performance is recorded. Experts are then asked to rank order the student performances based on these recordings. Then, they are asked to justify why one performance ranked higher than another. These justifications are then used to develop scoring rules that assign a point score to a performance if it exhibits a particular characteristic. The overall score for a performance is given by the sum of the points assigned by the rules that are derived from the expert justifications for their rankings.

This approach can be refined by applying the scoring rules to a new set of performances and then asking experts to critique the rankings of those performances based upon the draft rules. Discrepancies are then overcome by making small changes in the point values of specific performance characteristics or adding new rules to make the derived scores better match the rank ordering by experts of the new sample. We found this approach to be highly effective in producing reliable scoring of performances, and often the rules can be used to develop rubrics for performance evaluation.

However, for some new problem situations, expert evaluation may not be perfect. When I was building intelligent training systems to coach technicians on how to fix chip-making machines, my colleague Marty Nahemow, the brilliant inventor of the screw-in fluorescent lightbulb, looked hard at the performance scoring rules we derived from expert policy capture and pointed out several ways in which the experts themselves were not performing optimally. We revised the rubrics we had developed to reflect better performance than experts were demonstrating and calling for.[13] In a world of rapid change, expertise may not be based upon thousands of hours of experience, since the situations they must handle may not have existed long enough to afford those thousands of hours.[14] The plain truth is that for some situations, we may not yet know what ideal performance is, though we likely can get close enough to help people do better than they currently can do.

Using Rubric Validity Assessment to Drive Instructional Strategy

In the approach just sketched, since each rule assigns value to an aspect of performance, the rules can be used to coach performance. The simplest approach would be to identify which scoring rule, if it had applied to the student’s performance, would have most boosted the student’s score and then use that rule’s antecedent condition to focus coaching. That works for the rules related to actual problem solution, but what about the soft skills, the DELTAs, mentioned above. To figure out which soft skills to coach, we need to pay attention to the probabilistic relationship between different aspects of collaborative activity and to stretch the scoring rules to include scoring of being a “team player.”

Then, we can use the combined scoring rules for actual problem solving and for teamwork to develop a Bayesian network whose nodes are scorable performance aspects and whose links are the probability of getting points for one characteristic given that one has received points for another. Such a network might be used to identify which scorable characteristics a student is likely to acquire quickly given where the student’s performance stands presently. For it to work, two things are needed. One is the Bayesian network for the target population of trainees. The other, for each individual trainee is the current probability of each skill being possessed by that person.

Having these two data sources allows coaches to come up with good strategies for deciding which aspects of skill to focus on in providing immediate coaching. Consider any one specific performance situation. In that situation, there may be several aspects of performance that could be improved, and it might be best to pick one of those aspects for coaching. So, which one should be picked? There are a few strategies that seem worth pursuing. One is to pick a skill element that was not demonstrated in the current performance but was predicted by the Bayesian network to have a reasonably high probability of being present given what is known about the trainee’s overall skill set. This likely will be a skill that is partly developed but that is not strong enough to be exhibited reliably during a high cognitive load.[15] Coaching might remind the trainee of that skill and hence strengthen it.

Another possibility will be that all but one needed skill was exhibited in the performance. This might be a good setting for coaching the development of that one missing skill, especially if the other needed skills are pretty automated/overlearned already. While there remains a need for more testing of which of these strategies to use in which circumstances, what we do know about learning suggests that both strategies are worth considering and likely will pay off in improved coaching efficacy.

To summarize, supporting the learning of many of the DELTAs might be done by developing a series of training tasks that afford opportunities for building up the various skills, using expert policy capture to develop a series of rules for evaluating performance of those tasks, building a Bayesian network of the probabilities of having each DELTA given possession of other DELTAs, and then developing and validating strategies for deciding how to coach performances of the various learning tasks given scores generated for the various DELTA components by the scoring rules.

Data Mining

This brings us to the question of how to decide which skill elements should be part of this diagnostic process. As noted above, for emergent tasks, even experts may not be entirely aware of the full range of skills needed or the optimal way to perform the task, though they will know more than non-experts. Given the skills list in Figure 1, there is a lot of work left to do. Some of the skills in the Figure may not have been the focus of schooling much at all, and some teachers may not have expertise in some of the skills. Moreover, while there is soft evidence for each of them, the skills may not all have been specified or measured very well, and we may not know all the antecedent subskills on which they depend. Overall, there is a lot of work to be done if the kind of Bayesian network just discussed is to be built for the entire skills list. Beyond that, we may discover that there are some specific variants of these skills that are especially important. But, that requires studying many work performances and figuring out what else happens besides what the scoring rules derived from expert policy capture predict. The scale of the needed exploration is large enough that data mining might be a sensible approach.

A serious data mining effort, though, requires a substantial database of content to mine. This does not really exist today in full form, but it might be possible to amass such a database quickly. A good starting point might be some of the online team-based games. It might be possible to gather transcripts of teams playing such games and then use rubrics to score for evidence of the DELTAs and also to have game experts score individual moves as well as final outcomes. This might be a good basis for developing many of the probabilistic relations needed for a Bayesian network of DELTA competences. Once preliminary estimates are available, coaching based on the preliminary network can be productive while simultaneously allowing further updating of the network in situations more relevant to the work of one or another enterprise.

I do not expect that building such a predictive network will be easy or even that it will go well the first time it is tried. At the same time, all we have now is the intuitions of leadership coaches and related experts. Those intuitions are likely valid in part, but they are not enough to drive a major effort to begin developing the DELTAs in school and after-school activities that serve all students. If we really want to prepare all our children for life in a world where the DELTAs are critical determiners of readiness for a pleasant and socially productive life, we need to start somewhere.

Currently, we are not trying the best available ideas. We expose wealthy children to experiences that experts think are productive of the DELTAs, fail to see if those experiences work, fail to provide them to less privileged families, and then select people for good roles in our society based on whether they had those experiences. We can be more systematic and fairer, and we can do better. However, the status quo does afford some opportunities.

The world of education has favored randomized controlled trials (RCTs) to test instructional ideas. While not inherently a bad approach, relying solely on RCTs has some problems. First, it is virtually never possible to figure out an RCT design that not only tells us whether a learning opportunity works better than some alternative in general but also who it works for and who it does not work for. Rather than being the culmination of an instructional improvement effort, perhaps RCTs should be seen as a strong validation that a particular approach to learning is worth the investment of further exploration. Then, it can be tried with differing populations of learners, differing learning goals, and different school contexts. This way, we can learn more about how well it works, for whom it works, and what tuning is needed to make it work for varying populations of learners. This amounts to a strategy that uses both randomized clinical trials and natural experiments to validate and better understand how to make school afford all students the opportunity to learn the DELTAs.

Now that a Nobel prize has been awarded for work on natural experiments,[16] the ideas from economics should be more readily stretchable to education and training. Because of the stratification of our education system tied to wealth and ethnicity, the natural experiments on teaching the DELTAs are already under way. More privileged children participate in more team sports, in more team activities like high school musicals, and more online multiplayer games that require patent for participation. It is time to systematically study some of these group activities, to do some data mining on the relationship between experiences early in them and performance later, use expert appraisals of a subset of student performances in such situations to train AI system to automatically score larger datasets, and, more broadly, to search for ways to extend what works for the privileged to work for all in our society.

Some of this work might be done through government-supported research projects, but the available resources could be extended more quickly if providers of Internet-based collaborative activities did a little bit of machine learning and data mining on their own. It would be in their interest to find ways to make more people able to collaborate in both real-world and virtual-world team work.

Notes

[1] Lesgold, A. (2019). Learning for the Age of Artificial Intelligence: Eight Education Competences. New York: Routledge.

[2] Dondi, M., Klier, J., Panier, F., & Schubert, J. (2021). Defining the skills citizens will need in the future world of work. https://www.mckinsey.com/industries/public-and-social-sector/our-insights/defining-the-skills-citizens-will-need-in-the-future-world-of-work

[3] See, for example, Tamayo Jr, J. R. (2010). Assessment 2.0:” Next-Generation” Comprehensive Assessment Systems. An Analysis of Proposals by the Partnership for the Assessment of Readiness for College and Careers and SMARTER Balanced Assessment Consortium. Aspen Institute.

[4] But see Peppler, K., Keune, A., Xia, F., & Chang, S. (2017). Survey of assessment in makerspaces. Open Portfolio Project.Retrieve from https://makered. org/wp-content/uploads/2018/02/MakerEdOPP_RB17_Survey-of-Assessments-in-Makerspaces. pdf.

[5] Exhibit from “Defining the skills citizens will need in the future world of work”, June 2021, McKinsey & Company, www.mckinsey.com. Copyright © 2021 McKinsey & Company. All rights reserved. Reprinted by permission.

[6] National Governors Association. (2010). Common core state standards. Washington, DC. Access at http://www.corestandards.org/read-the-standards/

[7] See https://jausa.ja.org/ for a description of Junior Achievement.

[8] See Lesgold, A. (2008). Assessment to steer the course of learning: Dither in testing. In E. L. Baker, J. Dickieson, W. Wulfeck, & H. O’Neil (Eds.), Assessment of problem solving using simulations. New York: Erlbaum. Also, Lesgold, A. (1988). The integration of instruction and assessment in business/military settings. Proceedings of the 1987 ETS Invitational Conference. Princeton, NJ: Educational Testing Service.

[9] See, for example, https://www.answersbuddy.com/interprofessional-collaboration-and-teamwork-assessment/#:~:text=%20%20%20Interprofessional%20%20%20%20Collaboration,limitations%20of%20...%20%205%20more%20rows%20

[10] https://ctl.byu.edu/sites/default/files/Teamwork_Rubric.pdf

[11] See, for example, https://www.aacu.org/sites/default/files/files/VALUE/Teamwork.pdf and Ohland, M. W., Loughry, M. L., Woehr, D. J., Bullard, L. G., Felder, R. M., Finelli, C. J., … & Schmucker, D. G. (2012). The comprehensive assessment of team member effectiveness: Development of a behaviorally anchored rating scale for self-and peer evaluation. Academy of Management Learning & Education, 11(4), 609–630, Accessed at https://provost.uni.edu/sites/default/files/documents/the_comprehensive_assessment_of_team_member_effectiveness.pdf

[12] See Pokorny, R.A., Lesgold, A. M., Haynes, J.A., & Holt, L.S., (in press). Expert Policy Capture in Simulation-Based Training and Games. In H. F. O’Neil, E. L. Baker, R. S. Perez, & S. E. Watson (Eds.), Using cognitive and affective metrics in education-based simulations and games (Vol. 2). New York, NY: Routledge/Taylor and Francis.

[13] See Lesgold, A and Nahemow, M (2001). Tools to assist learning by doing: achieving and assessing efficient technology for learning. In S M Carver and D Klahr (eds.), Cognition and instruction: twenty-five years of progress, Lawrence Erlbaum Assoc., Mahwah, NJ.

[14] Most jobs involve about 2000 hours a year of work. Even if all a person did was repair broken equipment, many devices need repair before they have been in existence for five years, so there will be no traditional experts for quite a while after a device is invented. Of course, experience with prior devices can be helpful, as can the kind of combination of deep understanding of physics and experience with a lot of machines that Marty Nahemow had.

[15] For an exposition of cognitive load theory, see Paas, F., Renkl, A., & Sweller, J. (2003). Cognitive load theory and instructional design: Recent developments. Educational psychologist, 38(1), 1–4.

[16] The Committee for the Prize in Economic Sciences in Memory of Alfred Nobel. (2021, October 11). Answering Causal Questions Using Observational Data. Downloaded from https://www.nobelprize.org/uploads/2021/10/advanced-economicsciencesprize2021.pdf.

About Work