Here's a little pop quiz.
Multiple-choice tests are useful because:
A: They're cheap to score.
B: They can be scored quickly.
C: They score without human bias.
D: All of the above.
It would take a computer about a nano-second to mark "D" as the correct answer. That's easy.
But now, machines are also grading students' essays. Computers are scoring long form answers on anything from the fall of the Roman Empire, to the pros and cons of government regulations.
Developers of so-called "robo-graders" say they understand why many students and teachers would be skeptical of the idea. But they insist, with computers already doing jobs as complicated and as fraught as driving cars, detecting cancer, and carrying on conversations, they can certainly handle grading students' essays.
"I've been working on this now for about 25 years, and I feel that ... the time is right and it's really starting to be used now," says Peter Foltz, a research professor at the University of Colorado, Boulder. He's also vice president for research for Pearson, the company whose automated scoring program graded some 34 million student essays on state and national high-stakes tests last year. "There will always be people who don't trust it ... but we're seeing a lot more breakthroughs in areas like content understanding, and AI is now able to do things which they couldn't do really well before."
Foltz says computers "learn" what's considered good writing by analyzing essays graded by humans. Then, the automated programs score essays themselves by scanning for those same features.
"We have artificial intelligence techniques which can judge anywhere from 50 to 100 features," Foltz says. That includes not only basics like spelling and grammar, but also whether a student is on topic, the coherence or the flow of an argument, and the complexity of word choice and sentence structure. "We've done a number of studies to show that the scoring can be highly accurate," he says.
To demonstrate, he takes a not-so-stellar sample essay, rife with spelling mistakes and sentence fragments, and runs it by the robo-grader, which instantly spits back a not-so-stellar score.
"It gives an overall score of two out of four," Foltz explains. The computer also breaks it down in several categories of sub-scores showing, for example, a one on spelling and grammar, and a two on task and focus.
Several states including Utah and Ohio already use automated grading on their standardized tests. Cyndee Carter, assessment development coordinator for the Utah State Board of Education, says the state began very cautiously, at first making sure every machine-graded essay was also read by a real person. But she says the computer scoring has proven "spot-on" and Utah now lets machines be the sole judge of the vast majority of essays. In about 20 percent of cases, she says, when the computer detects something unusual, or is on the fence between two scores, it flags an essay for human review. But all in all, she says the automated scoring system has been a boon for the state, not only for the cost savings, but also because it enables teachers to get test results back in minutes rather than months.
Massachusetts is among those now intrigued by the possibilities, and considering jumping on the bandwagon to have computers score essays on its state-wide Massachusetts Comprehensive Assessment System (MCAS) tests.
Commissioner of Elementary and Secondary Education Jeffrey C. Riley called the prospect "exciting" at a recent Board of Elementary and Secondary Education meeting outlining plans to look into the idea. "I'm suspending belief that this is possible," he said.
Department Of Education Deputy Commissioner Jeff Wulfson cited "huge advances in artificial intelligence in the last few years" and cracked, "I asked Alexa whether she thought we'd ever be able to use computers to reliably score tests, and she said absolutely."
But many teachers are unconvinced.
"The idea is bananas, as far as I'm concerned," says Kelly Henderson, an English teacher at Newton South High School just outside Boston. "An art form, a form of expression being evaluated by an algorithm is patently ridiculous."
Another English teacher, Robyn Marder, nods her head in agreement. "What about original ideas? Where is room for creativity of expression? A computer is going to miss all of that," she says.
Marder and Henderson worry robo-graders will just encourage the worst kind of formulaic writing.
"What is the computer program going to reward?" Henderson challenges. "Is it going to reward some vapid drivel that happens to be structurally sound?"
Turns out that's an easy question to answer, thanks to MIT research affiliate, and longtime-critic of automated scoring, Les Perelman. He's designed what you might think of as robo-graders' kryptonite, to expose what he sees as the weakness and absurdity of automated scoring. Called the Babel ("Basic Automatic B.S. Essay Language") Generator, it works like a computerized Mad Libs, creating essays that make zero sense, but earn top scores from robo-graders.
To demonstrate, he calls up a practice question for the GRE exam that's graded with the same algorithms that actual tests are. He then enters three words related to the essay prompt into his Babel Generator, which instantly spits back a 500-word wonder, replete with a plethora of obscure multisyllabic synonyms:
"History by mimic has not, and presumably never will be precipitously but blithely ensconced. Society will always encompass imaginativeness; many of scrutinizations but a few for an amanuensis. The perjured imaginativeness lies in the area of theory of knowledge but also the field of literature. Instead of enthralling the analysis, grounds constitutes both a disparaging quip and a diligent explanation."
"It makes absolutely no sense," he says, shaking his head. "There is no meaning. It's not real writing."
But Perelman promises that won't matter to the robo-grader. And sure enough, when he submits it to the GRE automated scoring system, it gets a perfect score: 6 out of 6, which according to the GRE, means it "presents a cogent, well-articulated analysis of the issue and conveys meaning skillfully."
"It's so scary that it works," Perelman sighs. "Machines are very brilliant for certain things and very stupid on other things. This is a case where the machines are very, very stupid."
Because computers can only count, and cannot actually understand meaning, he says, facts are irrelevant to the algorithm. "So you can write that the War of 1812 began in 1945, and that wouldn't count against you at all," he says. "In fact it would count for you because [the computer would consider it to be] good detail."
Perelman says his Babel Generator also proves how easy it is to game the system. While students are not going to walk into a standardized test with a Babel Generator in their back pocket, he says, they will quickly learn they can fool the algorithm by using lots of big words, complex sentences, and some key phrases - that make some English teachers cringe.
"For example, you will get a higher score just by [writing] "in conclusion,'" he says.
Gaming the system?
But Nitin Madnani, senior research scientist at Educational Testing Service (ETS), the company that makes the GRE's automated scoring program, says that's not exactly a hack.
"If someone is smart enough to pay attention to all the things that an automated system pays attention to, and to incorporate them in their writing, that's no longer gaming, that's good writing," he says. "So you kind of do want to give them a good grade."
GRE essays are still always scored by a human reader as well as a computer, Madnani says. So pure babble would never pass a real test.
But in places like Utah, where tests are graded by machines only, scampish students are giving the algorithm a run for its money.
"Students are genius, and they're able to game the system," notes Carter, the assessment official from Utah.
One year, she says, a student who wrote a whole page of the letter "b" ended up with a good score. Other students have figured out that they could do well writing one really good paragraph and just copying that four times to make a five-paragraph essay that scores well. Others have pulled one over on the computer by padding their essays with long quotes from the text they're supposed to analyze, or from the question they're supposed to answer.
But each time, Carter says, the computer code is tweaked to spot those tricks.
"We think we're catching most things now," Carter says, but students are "very creative" and the computer programs are continually being updated to flag different kinds of ruses.
"In this game of cat and mouse, the vendors have already identified [these] strategy[ies]," says Mark Shermis, dean and professor at the College of Education at the University of Houston, Clear Lake, who's an expert in automated scoring. As a safeguard, all essays get not only a score, but also a "confidence" rating. "So those essays will be scored with 'low confidence,' and [the computer] will say 'please have a human have a look at this,'" he says.
Critics of robo-grading also worry it will change the way teachers teach. "If teachers are being evaluated on how well their students perform on [standardized tests that are machine-graded] and schools are being evaluated on how well they test, then teachers are going to be teaching to the test," says Perelman. And teachers will be teaching students to produce the wrong thing."
"The facts are secondary"
Indeed, being a good writer is not the same thing as being a "higher-scoring GRE essay writer," says Orion Taraban, executive director of Stellar GRE, a tutoring company in San Francisco.
"Students really need to appreciate that they're writing for a machine ... [and when students] agonize over crafting beautiful, wonderfully logically coherent and empirically validated paragraphs, it's like pearls before swine. The computer can't appreciate what this person has done and they don't get the score that they deserve."
Instead, Taraban tutors students to give the computer what it wants. "I train them in fabricating evidence and fabricating fake studies, which is a lot of fun," he says, quickly adding, "but I also tell them not to do this in real life."
For example, when writing a persuasive essay, Taraban advises students to use a basic formula and get creative. It goes something like this:
"A [pick any year] study by Professor [fill in any old name] at the [insert your favorite university] in which the authors analyze [summarize the crux of the debate here], researchers discovered that [insert compelling data here] ... and that [offer more invented, persuasive evidence here.] This demonstrates that [go to town boosting your thesis here!]"
His students do this all the time, using the name of say, their roommate, and citing that fake expert's fake research to bolster their argument. More often than not, they've been rewarded with great scores.
"Yeah, we see a lot of that," concedes Madnani, who works for ETS on the GRE automated scoring program. "But it's not the end of the world." Even human readers, who may have two minutes to read each essay, would not take the time to fact check those kind of details, he says. "But if the goal of the assessment is to test whether you are a good English writer, then the facts are secondary."
It's a different story on achievement tests that are meant to test a student's mastery of history, for example. In those cases it would matter if a student writes that the War of 1812 began in 1945. AI systems can check facts against a database, he says, but that only works on very narrow questions. "If you might have millions of facts that could come in, there's no possible way any automated system could verify all of them," he says. "So that's why we have humans in the loop."
Ultimately, he says, computer programs are doing what they were designed to do: to assess whether a student knows how to construct an essay, with a thesis, evidence, and a conclusion, all in good English. It's true that a transitional phrase like "in conclusion" signals to the algorithm that you've got one, just as "firstly," "secondly," and "thirdly," broadcasts that a student is moving through a multifaceted argument. Purists may turn their nose up at that kind of formulaic writing, but as developers note, the computers learn what good writing is from teachers, and just mirror that. "Only if teachers think that writing 'in conclusion' is a good structure to use, then students will tend to be rewarded for it," says Foltz, from Pearson.
So, in conclusion, robo-grading technology may indeed be "demonstrating proficiency" and "learning new skills." But experts say, it's also still got plenty of "room for improvement."
The original version of this article incorrectly referred to Mark Shermis as David Shermis and referred to the College of Education as the School of Education.
SCOTT SIMON, HOST:
Little pop quiz now. Who writes our theme music - A, Snoop Dogg, B, Dolly Parton, C, Philip Glass or, D, B.J. Leiderman? A computer would quickly know D is the correct answer. But now computers are also starting to grade students' essays. As NPR's Tovia Smith reports, many teachers see that as a mistake.
TOVIA SMITH, BYLINE: Developers of the so-called robo-graders say they understand the skepticism. But they say if computers are already driving cars, detecting cancer and carrying on conversations, they can also handle grading a high school essay on, say, the fall of the Roman Empire.
PETER FOLTZ: Yeah, I've been working on this now for about 25 years, so I feel that it's something that - the time is right. And it's really starting to be used now.
SMITH: Peter Foltz is a professor at the University of Colorado and a researcher for Pearson, a company whose automated scoring program graded some 34 million student essays on state and national high-stakes tests last year. Foltz says computers learn what's considered good writing by analyzing essays graded by humans, and then they simply scan for those same features.
FOLTZ: We have artificial intelligence techniques which can judge anywhere from about 50 to a hundred features - whether a student is on topic, the coherence or the flow of an argument, the complexity of word choice. And we've done a number of studies to show that the scoring can be highly accurate.
SMITH: To demonstrate, he takes a not-so-stellar sample essay rife with spelling mistakes and sentence fragments, and he runs it by the robo-grader, which instantly spits back a not-so-stellar score.
FOLTZ: So it gives an overall score of two out of four on these different writing traits. And it gets a one on spelling and grammar. It gives a two on task and focus, and...
SMITH: Several states already use automated grading on their standardized tests. Utah, for example, started cautiously with human eyes backing up every computer score. But officials say the computers have proven spot-on, and now more states are considering it.
(SOUNDBITE OF ARCHIVED RECORDING)
JEFF WULFSON: I asked Alexa whether she thought we'd ever be able to use computers to reliably score student tests, and she said absolutely.
SMITH: Massachusetts Department of Education Deputy Commissioner Jeff Wulfson introduced the idea at a recent meeting. He's one of many now intrigued by the potential cost savings and the prospect of getting test results back in minutes rather than months. But many teachers are unconvinced.
KELLY HENDERSON: The idea is bananas as far as I'm concerned. An art form, a form of expression being evaluated by an algorithm is patently ridiculous.
ROBYN MARDER: Agreed.
SMITH: Kelly Henderson and Robin Marder teach English at Newton South High School just outside Boston.
HENDERSON: What about original ideas? Where's room for creativity of expression? A computer's going to miss all of that.
SMITH: Even worse, Henderson worries robo-graders will encourage the worst kind of formulaic writing.
HENDERSON: What is a computer program going to reward? Is it going to reward some vapid drivel that happens to be structurally sound?
LES PERELMAN: That's a very easy question to answer. And that's what we'll see in the Babel Generator.
SMITH: MIT researcher Les Perelman designed his Babel Generator to expose what he sees as the absurdity of robo-scoring. It works like a computerized Mad Libs, creating essays that makes zero sense but earn top scores from robo-graders.
PERELMAN: OK, so we'll generate an essay.
SMITH: To demonstrate, he gets an online practice question for the GRE exam that's graded with the same algorithms as actual tests. Then on his Babel Generator, he enters three words related to the essay prompt and presto, a 500-word wonder.
PERELMAN: Motive is a scrutinization that has not and no doubt never will be disrupting yet somehow assimilated.
SMITH: This is hilarious.
SMITH: It makes no sense.
PERELMAN: It makes absolutely no sense.
SMITH: But Perelman promises that won't matter to the robo-grader and submits the essay for a score.
SMITH: Big moment of truth.
PERELMAN: Six points - perfect score. It's so scary that it works.
SMITH: It proves, Perelman says, that real ideas and facts don't matter to the algorithm and how easy it is to game the system. Even without a Babel Generator, he says, students can fool the computer by just using lots of big words, complex sentences and some key phrases like in conclusion. But Nitin Madnani, a researcher at ETS, the company that makes the GRE's robo-grader, says that's not exactly a hack.
NITIN MADNANI: If somebody is smart enough to pay attention to all the things that a - you know, an automated system pays attention to, to incorporate them in their writing, that's no longer gaming. That's good writing. So you kind of do want to give them a good grade.
SMITH: Madnani says actual GRE essays are always scored by a human reader as well as a computer, so pure babble would never pass a real test. And while other tests are graded only by machines, they're getting better at picking up student tricks and flagging them for human review. For example, some students have written one perfect paragraph and just repeated it four more times. Others have padded their essays with long quotes. David Shermis is a dean at the School of Education at the University of Houston, Clear Lake.
DAVID SHERMIS: In this game of cat and mouse, the vendors have already identified that as a strategy. And so the essay will be scored with very low confidence, and it will say, please have a human rater take a look at this.
SMITH: So in conclusion, robo-grading technology may indeed be demonstrating proficiency, but experts say it's also still got plenty of room for improvement. Tovia Smith, NPR News. Transcript provided by NPR, Copyright NPR.