Fighting the Gender Gap:Standardized Tests Are Poor Indicators of Ability in Physics
Women and underrepresented minorities typically score significantly lower than men on the standardized tests designed to predict performance in undergraduate and graduate physics and math courses, and are hence more likely to be disqualified during the initial admissions screening process. But according to speakers at a Friday afternoon session at the 1996 Joint APS/AAPT Meeting, standardized tests such as the SAT and GRE are in reality very poor indicators of students' success in these rigorous subject areas.
Anne Marie Zolandz, who works on test development for the Educational Testing Service (ETS), reported that in 1994-1995, women represented about 28 percent of the total population for the SAT II physics test. However, while more women are taking the test than ever before, their scores continue to show a 50-point difference from the men taking the test. African-Americans, Hispanics, and Native Americans also score consistently lower than white and Asian-American students. Fewer students take the GRE physics subject test, with women comprising only 16 percent of the sample, and a comparison of scores of men and women reveal a standard deviation of about 150. The GREs are primarily taken by white students, followed by Asian-Americans, and these groups typically score significantly higher than other minorities.
The gender gap that favors boys persists across all other demographic characteristics, including family income, parental education, grade point average, course work, and class rank, according to Pamela Zappardino, a professional psychologist and executive director of FairTest, a Cambridge, Massachusetts organization that focuses solely on assessment reform, while working against misuses and abuses of standardized testing. "I think there's a fallacy in the assumption that the SAT or GRE is actually telling us something," said Zappardino. "At best, the SAT only accounts for about 16 percent of the variance in first-year college grades. That isn't a great predictor, by anybody's yardstick." The SAT math test, for example, consistently underpredicts women's performance in college math courses.
An April 1995 study at the University of California, Berkeley, found that women with identical academic indexes to men obtained higher grade point averages in every major on campus, including math and physical sciences. The report concluded that women should have about 140 points added to their index to compensate for the SAT's underprediction, and that non-test criteria, such as high school GPA, were much better predictors for women in all academically rigorous and male-dominated fields. David Morin, a physics graduate student at Harvard, conducted his own study last year of the correlation between GRE scores and performance in graduate school, focusing on Harvard students. He found that while there was a very slight correlation between GRE scores and graduate course grades, there was no correlation with other measures of success in graduate school, including oral exam scores and overall completion time for the Ph.D. degree.
The gaps in scores do not seem to arise from inherent gender or ethnic bias in the test itself. ETS has implemented numerous procedures in both test development and analysis to ensure the fairness of its tests. Specifically, it aims for broad representation on the test development committees, culling members from college and high school, public and private institutions, with geographic distribution and at least one minority and female representative. All tests are subject to "sensitivity reviews" to eliminate any potentially offensive language or content, and are checked for sufficient references to minorities and women when the subject matter warrants it.
Statistical analysis is also performed to identify test items for which subgroups of the population may perform differently. For example, on biology tests, it was discovered that women generally performed better on questions concerning the reproductive system. ETS uses a method called differential item function (DIF) to identify potentially biased items. Those which show a large differential factor of 15 percent or more are reviewed and sometimes discarded. Surprisingly, some of those items with a high DIF are standard physics problems in kinematics, electrostatics, or optics, with no obvious pattern in terms of content or skill levels to explain the wide differentials.
So where does the problem lie? A joint study by the ETS and the College Board concluded that multiple choice formats favor men over women, partly because men are more willing to guess on tests when they don't know the answer. Men also perform better on timed tests. Another ETS study found that when the time limit was removed from SAT subtests, girls' scores improved markedly, while boys' scores changed very little. At present, there are no plans to alter the format of the tests. "The ETS is dedicated to developing tests that are as equitable as possible to all groups," explained Zolandz. "But we are operating under the strengths associated with administering large-scale tests at a reasonable cost, which presently means multiple choice questions."
According to Zappardino, gender differences can certainly be manipulated by selected different test items. For example, for the first several years when the SAT was offered, boys scored higher than girls on the math section, while girls achieved higher scores on the verbal section. The ETS decided the verbal test needed to be balanced more in favor of boys, and added more questions pertaining to politics, business and sports. No similar efforts were made to balance the math section. "Since then, boys have outscored girls on both the math and verbal sections," said Zappardino. "So when girls show a superior performance, balancing is required; when boys show the superior performance, no adjustment is necessary."
Foreign students, especially those from China, also do well on the GRE subject test, although their performance in graduate school isn't any better or worse than their American colleagues. "That suggests to me that the physics subject test measures some specific skill that can be taught, and it is taught very effectively in China, but it is not at all clear how much this skill has to do with what we want to know about potential physics students," said Howard Georgi, who has been involved with graduate admissions at Harvard University for more than 20 years. Jennifer Siders, a recent physics Ph.D. from the University of Texas who is now at Los Alamos National Laboratory, took the GRE subject test four times to meet her department's minimum requirement of 700. She finally managed to raise her score 200 points, not by learning more physics, but by learning how to take standardized tests, often at the expense of her actual coursework.
The standardized test format also seems to favor students Georgi describes as "idiot savants": those with strong mathematical skills who are very good at manipulating symbols without learning any of the real physics behind them, but who nevertheless tend to perform exceptionally well on the GREs. In contrast, two of his most impressive undergraduate physics students, both women with excellent undergraduate records, scored much lower than expected. Phyllis Rossiter, author of The SAT Gender Gap, concluded that, "This highly speeded test rewards the facile test taker, rather than the sophisticated, thoughtful thinker who gathers new information, organizes and evaluates and expresses original thoughts clearly and concisely."
The impact of gender gaps in standardized test scores can be devastating. Female students are twice as likely as males to be disqualified by minimum cutoff score requirements, even though their overall academic performance tends to be higher. Many talented women and minority students may be discouraged from applying to top institutions if they feel their scores are too low. In addition to lower self-confidence and career expectations, the gender gap may decrease women's chances of earning fellowships.
Zolandz emphasized that ETS policy dictates that test scores should never be the sole basis for an admissions decision or rejection, and also discourages the use of cut-off scores below which applicants are summarily rejected. "The test scores are only one piece of information about a student," she said. "They may help contribute to your decision, but they are never designed to be the sole indicator." However, institutions often ignore this dictum, as in the case of Siders' experiences with the University of Texas graduate admissions committee.
The reliance on standardized testing for admission is slowly beginning to change. FairTest compiles an annual list of four-year colleges that do not require standardized test scores, and there are currently 241, compared with 112 in 1989. While Harvard's graduate admissions committee requires both the GRE general and physics subject test, its members rely more heavily on letters of recommendation, the personal essay, and undergraduate records when deciding whom to admit.
While Siders emphatically believes that GRE requirements should be dropped for graduate admissions, Georgi favors a modified version of the GRE physics subject test, reducing the number of questions by half to give students time to think through the answers and eliminate time pressures, focusing instead on basic skills and knowledge. However, "As presently constituted, it's quite possible that the GRE physics subject test does more harm than good, and we should either fix it, or seriously consider getting rid of it altogether," he said.