If you’ve read the American Statistical Association’s position on the dangers of evaluating teacher performance based on the “Value-Added Model,” you’re probably wondering how they arrived at this very sobering conclusion. As Albert Einstein was alleged to have stated, “Not everything that counts is countable, and not everything that is countable counts.” In this case, AMSTAT took that advice to heart and so strongly inveighed against VAM that they essentially labeled it a form of statistical malpractice.
Associate Professor C. Kirabo Jackson, the most understated hero to decimate VAM.
In this post, I’m going to examine one of the studies that no doubt had a profound impact on the members of AMSTAT that led them to this radical (but self-evident) conclusion. In 2012, the researcher C. Kirabo Jackson at Northwestern University published a “working paper” for the National Bureau of Economic Research, a private, nonprofit, nonpartisan research organization dedicated to promoting a greater understanding of how the economy works (I’m quoting here from their website.) The paper, entitled “Non-Cognitive Ability, Test Scores, and Teacher Quality: Evidence from 9th Grade Teachers in North Carolina” questions the legitimacy of evaluating a teacher based on his/her students’ test scores. Actually, it is less about “questioning” and more about “decimating” and “annihilating” the practice of VAM.
I downloaded the paper and have been reading it for the past few days. Jackson clearly has done his homework, and this paper is extremely dense in statistical analysis which is rooted in data collected by the National Educational Longitudinal Study 1988, which began with 8th graders who were surveyed on a range of educational issues as described below:
On the questionnaire, students reported on a range of topics including: school, work, and home experiences; educational resources and support; the role in education of their parents and peers; neighborhood characteristics; educational and occupational aspirations; and other student perceptions. Additional topics included self-reports on smoking, alcohol and drug use and extracurricular activities. For the three in-school waves of data collection (when most were eighth-graders, sophomores, or seniors), achievement tests in reading, social studies, mathematics and science were administered in addition to the student questionnaire.
To further enrich the data, students’ teachers, parents, and school administrators were also surveyed. Coursework and grades from students’ high school and postsecondary transcripts are also available in the restricted use dataset – although some composite variables have been made available in the public use file.
The survey was followed up in 1990, 1992, 1994 and 2000, which means that it began when students were just about to begin their high school career, and then followed up when they were in 10th and 12th grades, and followed them through post-high school, college and postgraduate life. It is one of the most statistically valid sample sets of educational outcomes available.
What should be noted is that Jackson is not an educational researcher, per se. Jackson was trained in economics at Harvard and Yale and is an Associate Professor of Human Development and Social Policy. His interest is in optimizing measurement systems, not taking positions on either side of the standardized testing debate. Although this paper should reek with indignation and anger, it makes it’s case using almost understated tone and is filled with careful phrasing like “more than half of teachers who would improve long run outcomes may not be identified using test scores alone,” and “one might worry that test-based accountability may induce teachers to divert effort away from improving students’ non-cognitive skills in order to improve test scores.”
But lets get to the meat of the matter, because this paper is 42 pages long and incorporates mind-boggling statistical techniques that account for every variable one might want to filter out to answer the question: are test scores enough to judge the effectiveness of a teacher? Jackson’s unequivocal conclusion: no, not even remotely.
The first thing Jackson does is review a model that divides the results of education into two dimensions: the cognitive effects, which can be measured by test results, and the non-cognitive effects, which are understood to be socio-behavioral outcomes, which when combined, determine adult outcomes. To paraphrase the old Charlie the Tuna commercial, it’s more than whether we want adults that test good – we also want them to be good adults. Clearly, Jackson is aiming a little higher than those who would believe that test scores are the end result of “good teaching.” He’s focusing on what “non-cognitive” effects a teacher can have on a student, which includes things like diminishing their rates of truancy and suspensions, improving their grades (which are different from test scores) and helping increase the likelihood that they will attend college.
Which poses the less than obvious question: if teachers have an effect on both cognitive and non-cognitive outcomes, are they correlated or independent? That is, if a teacher is effective in raising test scores, will that lead to less truancy, fewer suspensions, better grades and less grade retention? Even more interesting is the idea that teachers could be more effective on one scale while being low on the other: is it possible for a teacher to be very effective at improving a student’s non-cognitive functioning while not having an effect on his/her test scores?
By page 4, Jackson’s paper starts to draw blood: using the results of the NELS 1988, Jackson concludes that a standard deviation increase in non-cognitive ability in 8th grade is associated with fewer arrests and suspensions, more college-going and better wages than the same standard deviation improvement in test scores. It’s almost as if Jackson is telling us, “hey, 8th grade teachers: want to improve your students future life? Spend less time on test prep and more time helping them show up at school, staying out of trouble and improving their actual grades.”
This alone would be enough of a takeaway, but this incredibly dense paper continues to hammer away at any thought that test scores are meaningful in any way: in the same paragraph, Jackson states that a teacher’s effect on college-going and wages may be as much as three times larger than predicted based on test scores alone. HFS! Oh, and just to make things more interesting, it is followed by this statement: “As such, more than half of teachers who would improve long run outcomes may not be identified using test scores alone.”
To summarize, we’re only in the middle of page 4 of this paper, and we’ve already learned the following:
a) Teachers have an effect on both cognitive skills of their students, and non-cognitive skills of their students. The first leads to higher test scores, the second leads to more college going, fewer arrests and better wages.
b) In 8th grade, non-cognitive achievement is a better predictor of college going and higher wages, as well as fewer arrests and suspensions, than test scores.
c) A teacher’s effect on these “non-cognitive” outcomes is as much as 300% greater than can be measured using test scores.
But wait, there’s more!
Okay, I’m only below the middle of page 4, and already I’ve read three conclusions that essentially kill off any legitimacy to judging a teacher’s effectiveness based on test scores, and the good stuff has even gotten started!
What Jackson is up to in his paper is something bigger, way bigger: it would be possible to argue at this point that somehow cognitive and non-cognitive skills, while both responsible in some part to positive adult outcomes, are still correlated; that is, if you improve the test scores, the other non-cognitive stuff will come along as a bonus. This is where Jackson goes for the jugular, and, as is typical of research papers, he essentially “buries the lead.”
“This paper presents the first evidence that teachers have meaningful effects on non cognitive outcomes that are strongly associate with adult outcomes and are not correlated with test scores.” (Emphasis mine, italics his, by the way.)
I have to stop with this blog post here (but I promise to do more deciphering of this paper in the next few days.) My only question at this point would be: why hasn’t anybody explained this to Arne Duncan, perhaps through the use of hand puppets and a mallet?