Scoring errors jeopardize tests: Poor oversight raises risk

Thousands of students’ scores wrong


This newspaper had reported extensively on cheating but hadn’t dug deeply into a more basic question: Just how good are the exams used to make critical decisions in schools?

No one, it turned out, had documented testing errors’ scope, causes or consequences since the 2001 No Child Left Behind Act.

Staff writer Heather Vogell decided to examine the issue during a prestigious Spencer Education Fellowship at the Columbia University Graduate School of Journalism in New York City, where she did most of the reporting.

She requested documents on testing from all 50 states and the District of Columbia, conducted scores of interviews and reviewed news stories and federal reports.

The AJC also reviewed the statistics for more than 90,000 test questions given on roughly 1,700 tests in 42 states and Washington, D.C. Vogell worked with Teachers College, Columbia University, testing expert Matthew Johnson, who is also the editor of the Journal of Educational and Behavioral Statistics.

The examination found that for nearly 9 percent of exams, one in 10 or more test items showed signs of potential problems, threatening the tests’ overall quality and raising questions about fairness. The finding was based on a statistic commonly referred to as “discrimination” (typically the “point biserial” or “item-total correlation”). Low-discrimination questions such as those the newspaper identified should often be revised or thrown out, experts say.

The statistical review and documents revealed persistent problems in the quality of questions on the standardized tests . The AJC explored those issues in a Sept. 15 story.

Today’s story is based on hundreds of pages of records on testing breakdowns from government agencies – including audits, reports, memos and correspondence between state agencies and test contractors. The newspaper scrutinized more than 100 testing failures across the country.

Tommy Parker knows that failing one of Mississippi’s high-stakes tests in high school can change a life.

It can mean the difference between college and a factory job; between scraping by and a chance for more. The former principal is still haunted by the few times he told parents their children wouldn’t receive a high school diploma because they had failed the exams.

So when Parker, now a school district superintendent, learned last year that three of his students had actually passed a state graduation test they’d long thought they’d failed, his disillusionment deepened.

“You put your faith and confidence in the state department of education and the companies they have contracts with to build and validate these tests,” he said. “To discover that something is wrong with it, it throws up a lot of doubts.”

Getting each child an accurate score is, in the end, the single most important task states undertake when they give standardized tests.

Yet a year-long investigation by The Atlanta Journal-Constitution found that testing companies and the education agencies that hire them continue to make scoring mistakes that can have dire consequences.

In Mississippi, a computer glitch on a test led high schoolers to drop out. A scoring miscalculation in Massachusetts nearly cost students college scholarships. In New York City, multiple errors caused thousands of children to be told they were ineligible for gifted programs when they had, in fact, qualified.

Such problems keep happening despite outrage from the public and repeat apologies from test contractors and state officials, hundreds of pages of documents show – including reports, audits, memos and correspondence between government agencies and testing contractors.

Testing executives and experts say the industry has improved its practices since the No Child Left Behind Act of 2001 made standardized testing a foundation of federal education policy.

Yet some also say the unrelenting push to speed up the return of test results, along with states’ ongoing budget problems, at times hamper quality-control. The K-12 testing business, they say, is immensely fast-paced and complex. Pitfalls abound.

“The capacity at the state level is being minimized and the vendors are doing more and more and more,” said Gary Cook, a University of Wisconsin test expert who has been both a testing company executive and director of a state testing program.

“The fact that we don’t have a major blow-up every week is amazing,” he said.

Minnesota resident Karin Noren worked as a manual scorer of students’ handwritten answers for several testing companies over 12 years.

The pressure to work faster, she said, only intensified as time went on.

“Just speed them up, speed them up, speed them up,” said Noren, who began scoring in 2000. “You were supposed to just sit glued to that computer.”

***

To Melissa Fincher, there is no worse kind of mistake in testing.

“When a kid gets an incorrect score,” said Fincher, the associate superintendent over testing in Georgia, “that’s the cardinal sin.”

The potential for such errors has rattled policy makers and the public since NCLB made testing an annual rite for millions more children and attached stiffer consequences to scores.

As far back as 2002, Nevada state school board members worried that some of the 736 students mistakenly told that they had failed the state’s high school math test might have dropped out.

In Ohio, education officials in 2005 discovered that contractor Measurement Incorporated had made errors calculating final scores on a test mandatory for graduation. About 890 high school students thought they had flunked when they had actually passed.

Ohio officials were more dismayed when they discovered they had missed problems with MI’s work for several years, a state memo to the contractor shows. Luckily, the state caught the errors before anyone was denied a diploma.

By 2006, states were conducting testing on a magnitude — and with ramifications — never before seen in this country. That year, a mistake affecting the scores of 4,000 students who took the SAT, the pre-college test, sparked an outcry from testing critics.d

Then-Secretary of Education Margaret Spellings called together executives from the biggest test companies. She asked whether they could handle the immense volume of work NCLB had created.

Their answer: An unequivocal yes. One company boasted it could handle triple the amount of business, according to news accounts.

The testing program continued as planned.

Three years later, in 2009, a federal inspector general published three alarming reports on the scoring of exams.

In Florida, the federal office found errors in how a scanner read students’ bubbled-in answer choices. What’s more, investigators could not initially find a sample of 14 completed answer sheets during their visit to a contractor’s warehouse.

Every year since 2006 in Wyoming, the federal inquiry found, the state had cut back the time local and state testing officials could take to verify test results — despite finding errors annually.

And in Tennessee, federal investigators said the state inadequately monitored the contractor in charge of scoring the state’s writing test.

Around the same time, public officials elsewhere found themselves contending with regular testing mistakes. For Georgia and Mississippi, slip-ups were so common that the states at times used money from penalties charged to testing contractors to help pay for the exams the next year.

Mississippi sought fines from test companies 10 times in a decade, records show. Problems included bad questions, test booklets with missing reading passages, delays, and incorrectly calculated student scores.

But no mistake was as dire as the hidden error discovered last year in a question on the state biology test students must pass to graduate.

In 2007, a subcontractor was working on an online version of the test for students who failed prior attempts. During the formatting of a graphic, computer programming became muddled. The test began to record a correct choice on one question as incorrect.

As a result, for almost four years students were penalized when they selected the right answer.

Over that time, 126 students had flunked the test solely because of the problem question. Most tried again and passed. But five had dropped out without diplomas.

How could such a grave mistake slip through?

An audit found a "crude" quality-control check the subcontractor used was inadequate. Auditors criticized contractor Pearson's proposals for preventing more such mistakes, saying they were "conspicuously missing" details.

Walter Sherwood, president of state services for Pearson, called the mistake “awful.” The company has significant testing contracts in 19 states.

“That error should have been caught and corrected at its origin,” he said. “It’s just absolutely incumbent on us that we get this right. It’s inexcusable that we don’t provide completely accurate results.”

To compensate, Pearson offered the 126 students more than $600,000 in scholarships if they enrolled in college.

Mississippi Assessment Director James Mason said the five students who missed graduating only because of the bad question were awarded degrees. A few superintendents staged small ceremonies for families. The state fired the subcontractor and hired consultants to provide independent reviews of test data.

The response from the state, districts and Pearson was “as good an outcome as you can hope for” to a scoring debacle, he said. “It’s tough.”

Parker, the superintendent, said that for many students hurt by the error, the scholarships may have been an empty gesture. The fact that they were struggling with the test may have meant they never made it to college.

His district was only able to locate two of the three students affected. Parker said educators and students feel “totally at the mercy of the testing companies.”

“It just causes a lot of grief,” the Jones County Schools superintendent said. “You just wonder how many other times something like this could have happened.”

***

Testing mishaps continued even as testing became as ingrained in American schooling as winter break, a decade’s worth of state testing records show.

This spring, news headlines were littered with a hodgepodge of testing problems from New York to Oklahoma to Indiana. The stories helped fuel a growing backlash against testing. Some parents even refused to let their children sit for the exams.

Mistakes caught after scores are released have forced states to backtrack in very public ways, further eroding confidence in the tests.

Two data errors in 2009 in New Mexico were more than bureaucratic slip-ups: The state had already publicized the results and some schools had sent letters to parents telling them their children could change schools because theirs had failed to meet testing targets, known as adequate yearly progress.

“The school districts and the public will be astonished to learn that their AYP results were affected by data file errors,” Tom Dauphinee, a state official, wrote to contractor American Institutes for Research.

The state ordered AIR to send a letter of apology to each of the superintendents whose schools were involved.

School district officials, teachers, parents, and even students have all caught mistakes.

In Illinois, a fourth-grade Chicago Public Schools teacher noticed an entire class received zeros for responses to written-answer questions on the 2011 state test. The students were not alone: 144 students in five schools had wrongly received zeros.

The state sought nearly $1.7 million from contractor Pearson, which, according to a consultant’s report, added quality-control checks but could not explain exactly why the error occurred.

In Massachusetts that same year, high school sophomore Michael Safran was examining his scores on the state math test when he noticed something odd. He had only lost one point out of 60. Yet, when that raw score was converted to a final, or “scale” score, he’d lost four more points than expected.

Safran had done so well, he knew he’d still graduate with a state scholarship. But, he was curious.

“If something doesn’t seem right, doesn’t make sense, I want to make sense of it,” he said.

He told his dad, a former state education official, who recommended looking at the state’s scoring table online. Safran did, then called the department to report what he’d seen.

It turned out his test results weren’t the only that were wrong: The final scores for more than 40,000 students statewide were incorrect. More than 3,200 students — all of whom had passed — would jump to the next performance level because of the mistake. The error could have derailed scholarships for some.

Contractor Measured Progress compensated the state $202,500 for the mistake. The company also sent money to schools — three of which had initially missed federal test-score goals because of the error.

That payment equaled 50 cents per child.

****

The burden of overseeing testing has fallen hard on state education departments, documents and interviews show, undermining their attempts to effectively monitor contractors.

For years, the federal government allocated just over $400 million a year to supporting states’ testing programs.

The grants didn’t cover all testing costs. Thomas Toch, of the Carnegie Foundation for the Advancement of Teaching, said it wasn’t enough, either, to support calls for more sophisticated, less flawed tests.

“It hasn’t taken into account inflation even,” he said, “let alone our desire to create more effective tests.”

In fact, by 2013, federal grants to states for testing had dropped to $369 million.

As the recession trudged on, states tightened up their own spending, too. That prompted many education departments to place “significantly more weight” on price, versus quality, than in the past when choosing test contractors, said Jon Cohen, executive vice president over testing for AIR.

That shift, Cohen said, “can hamstring a state.”

He said he thinks the industry can provide an excellent product at a low cost. But, he added, if states are unhappy with the trade-off between quality and price, all they need to do when seeking testing bids is change the formula to emphasize quality more.

Ultimately, state workers in testing divisions are dependent on how much money lawmakers direct to their departments.

“You just have to be aware that you can’t do for $10 a test the same amount of quality checking you can do for a $24 test or a $15 test,” said Robert Lee, chief analyst for Massachusetts’ testing program.

Tight budgets aggravate another challenge for state education agencies: finding and holding onto knowledgeable testing program staff.

Universities churn out few doctorates in the science of testing – a critical expertise needed for both test development and scoring. Not many of the 100 or so graduates who earn such degrees each year end up in state testing programs.

“There’s immense and growing competition among companies and states for talent,” Toch said. “In most instances, states are losing.”

That means fewer skilled eyes overseeing tests and less institutional knowledge in state departments, which often have high rates of turnover. A 2009 federal survey found 13 states had no Ph.D. on staff in their testing divisions.

Quality control suffers as a result, said Cook, the testing expert.

“Unless you have somebody covering your back, even as a vendor, you’re likely to make mistakes,” he said. “Caveat emptor. You cannot rely on the vendor to be completely responsible for the information that you’re providing … to your school, district and state.”

Massachusetts — a state with perhaps the most revered testing program nationwide — learned that lesson the hard way with the wide-reaching scoring error student Michael Safran discovered. Records show the mistake occurred after the state cut back oversight in hopes of posting test results earlier.

The state had reduced the time allowed for final quality control checks – from one week, to one day — and shifted the job from state employees to contractor Measured Progress, an internal report says.

Measured Progress said the error occurred when an employee calculating final scores grabbed the wrong table. Quality checks missed the problem.

“This was my worst moment in testing,” said Lee, the state’s chief analyst for testing.

The breakdown, however, was far from atypical.

The use of incorrect scoring tables for two Georgia tests was among five exam-processing mistakes Pearson made there in less than a year – none of which became public at the time.

Fincher, Georgia’s testing director, said her staff struggled to keep tabs on the contractor. “They taxed our capacity pretty badly during that time period,” she said.

Pearson's mistakes were largely preventable, a late 2009 audit found. Company officials say they made changes to keep such failures from occurring again.

After seeing other testing companies commit errors in recent years, too, Georgia accepted it must stay vigilant, Fincher said. “We feel like we are their quality control,” she said. “We continue to catch things.”

“Everything we do is high stakes,” she said. “It keeps me up at night.”

Echoes of Georgia’s problems can be found in Oklahoma, where records show the state identified 18 significant problems with Pearson’s tests in 2011 alone. Among them, Pearson had twice failed to follow its own quality-check processes, records show.

Sherwood, of Pearson, said the company fixed the problems the following year. It and the state reached an $8 million settlement.

Deeper causes for testing problems – such as budget cutbacks – that can be traced directly to failures are hard to pinpoint, Sherwood said.

But, he added, the fact that states don’t carry out testing in a standard way adds to the complexity of the business, which in turn increases the risk of error.

“We operate,” Sherwood said, “in an environment that is inherently risky.”