Does the language an exam is written in actually change a student's measured STEM achievement?

Yes, by educationally meaningful margins. In Ethiopia, students assessed in their mother tongue have scored on average 42.3% across STEM versus 36% for peers assessed in English. A 2024 Ethiopian study found English-medium assessment lowered Grade 7–8 mathematics scores by about 0.2–0.25 standard deviations. The same direction of effect appears in South Africa (TIMSS), across PISA countries, and in controlled item experiments in the United States.

Why is the language penalty larger in science than in mathematics or reading?

Science carries the heaviest load of academic-register language: dense technical vocabulary, long noun phrases, passive constructions, and logical connectors that compress reasoning into single sentences. Conversational fluency develops in ~2 years; the academic language register needed for science exams takes 5–7 years. A student can sound fluent and still lack the register the test demands.

What does this mean for Ethiopia's English-medium ESSLCE, now delivered as a CBT?

For a meaningful share of students, an ESSLCE STEM score is a blend of subject knowledge and English reading proficiency, which makes it a partly invalid measure of science or mathematics ability. The penalty falls hardest on rural and lower-income students, so a language-loaded examination can widen existing gaps rather than measure across them fairly. The CBT interface adds a separate, additive cognitive load that competes for the same attention.

Is the answer to switch the ESSLCE out of English?

No. English is the language of Ethiopian higher education and the labour market, and competence in it is a legitimate goal of secondary schooling. The argument is narrower: examinations should measure science and mathematics as accurately as possible, and unnecessary linguistic complexity in items works against that without serving any defensible purpose.

What is the most practical lever for fairer STEM assessment under an English-medium exam?

Item design. Reducing the non-essential linguistic complexity of a question — short sentences, active voice, common non-technical vocabulary, fewer embedded clauses — measurably raises scores for English learners while leaving native-speaker scores unchanged. This indicates the language load was a measurement artifact, not part of the construct.

What is Atenu Lab doing about this in the Ethiopian context?

We are designing a within-subjects item experiment: matched STEM item triples (English, mother-tongue, linguistically reduced English) randomly assigned on the Atenu platform, analysed with differential item functioning methods to produce a direct local estimate of the language penalty in current ESSLCE-style items.

Atenu Lab · Research review

The Language of Assessment: Performance Differences Between Mother-Tongue and English-Medium Items in STEM Subjects

By Atenu Educational Technology Lab · 22 May 2026 · ~14 min read

Filed under Digital Assessment & CBT Research .

Language of assessment
ESSLCE
Mother-tongue instruction
STEM achievement
Test validity
Computer-based testing
Ethiopia

Key questions this review answers

Does the language an exam is written in actually change a student's measured STEM achievement?: Yes, by educationally meaningful margins. In Ethiopia, students assessed in their mother tongue have scored on average 42.3% across STEM versus 36% for peers assessed in English. A 2024 Ethiopian study found English-medium assessment lowered Grade 7–8 mathematics scores by about 0.2–0.25 standard deviations. The same direction of effect appears in South Africa (TIMSS), across PISA countries, and in controlled item experiments in the United States.
Why is the language penalty larger in science than in mathematics or reading?: Science carries the heaviest load of academic-register language: dense technical vocabulary, long noun phrases, passive constructions, and logical connectors that compress reasoning into single sentences. Conversational fluency develops in ~2 years; the academic language register needed for science exams takes 5–7 years. A student can sound fluent and still lack the register the test demands.
What does this mean for Ethiopia's English-medium ESSLCE, now delivered as a CBT?: For a meaningful share of students, an ESSLCE STEM score is a blend of subject knowledge and English reading proficiency, which makes it a partly invalid measure of science or mathematics ability. The penalty falls hardest on rural and lower-income students, so a language-loaded examination can widen existing gaps rather than measure across them fairly. The CBT interface adds a separate, additive cognitive load that competes for the same attention.
Is the answer to switch the ESSLCE out of English?: No. English is the language of Ethiopian higher education and the labour market, and competence in it is a legitimate goal of secondary schooling. The argument is narrower: examinations should measure science and mathematics as accurately as possible, and unnecessary linguistic complexity in items works against that without serving any defensible purpose.
What is the most practical lever for fairer STEM assessment under an English-medium exam?: Item design. Reducing the non-essential linguistic complexity of a question — short sentences, active voice, common non-technical vocabulary, fewer embedded clauses — measurably raises scores for English learners while leaving native-speaker scores unchanged. This indicates the language load was a measurement artifact, not part of the construct.
What is Atenu Lab doing about this in the Ethiopian context?: We are designing a within-subjects item experiment: matched STEM item triples (English, mother-tongue, linguistically reduced English) randomly assigned on the Atenu platform, analysed with differential item functioning methods to produce a direct local estimate of the language penalty in current ESSLCE-style items.

Every assessment makes a quiet assumption: that the student can read the question. For monolingual students tested in their first language, that assumption is usually safe and easy to overlook. For the millions of students worldwide who learn and are examined in a second or third language, it is not safe at all.

Why the language of a test is not a neutral background detail

The problem is structural rather than incidental. A word problem in mathematics, or a multi-clause question in biology, requires the student to do two things at once. First, decode the language of the question and build an accurate mental model of what is being asked. Second, apply the relevant scientific or mathematical reasoning. When the language of the item is unfamiliar, the first task consumes cognitive resources that the student would otherwise spend on the second. The result is a lower score that reflects, in part, a reading difficulty rather than a gap in subject understanding (Abedi & Lord, 2001; Schleppegrell, 2004).

This matters for Ethiopia in a specific and immediate way. Most Ethiopian children begin school in their mother tongue, transition to English-medium instruction at some point in the primary or secondary years depending on region, and ultimately sit the Ethiopian School Leaving Certificate Examination (ESSLCE) in English. The ESSLCE is now delivered as a computer-based test (CBT). The question this review examines is straightforward to state and consequential to answer: when an Ethiopian student scores poorly on an English-medium STEM item, how much of that score reflects what they know about science, and how much reflects the language the question happened to be written in?

This is the central question of assessment validity. A test is valid to the extent that it measures the thing it claims to measure. If a chemistry item is partly measuring English reading proficiency, then it is, to that extent, an invalid measure of chemistry. The literature reviewed here suggests the language contribution is neither small nor confined to a few unusual cases.

What the Ethiopian evidence shows

Ethiopia is, in research terms, an unusually informative setting. Because different regions transitioned to English-medium instruction at different grades, and because the country has run national learning assessments for two decades, researchers have been able to compare achievement across regions and language regimes rather than relying on a single school or cohort.

The most influential body of evidence comes from the national study of medium of instruction commissioned by the Ethiopian Ministry of Education and conducted by Heugh and colleagues (Heugh, Benson, Bogale & Yohannes, 2007). The study team visited regions across the country, observed classrooms, surveyed teachers, and analysed results from the national learning assessments. Their central finding was consistent and quantitative. Across biology, mathematics, chemistry and physics, the average score was 42.3 percent in regions where instruction was delivered in the mother tongue, against 36 percent where it was not. In Grade 8 biology specifically, the gap widened to nearly 11 points. Biology is worth noting here, because it is the most vocabulary-dense of the school sciences, a pattern we will return to.

Figure 1 — Ethiopian national learning assessment: STEM scores by medium of instruction

Mother-tongue medium English / non-mother-tongue

Mother-tongue settings outperform English-medium settings. Average score across biology, mathematics, chemistry and physics, and the wider gap observed in Grade 8 biology. Data: Heugh et al. (2007), analysis of Ethiopian national learning assessments.

More recent work has confirmed the direction of the effect with stronger identification. A 2024 study published in the International Journal of Educational Development, drawing on empirical data from Ethiopian schools, found that teaching and assessing Grade 7 and 8 students through English rather than their mother tongue reduced mathematics test scores by roughly 0.2 to 0.25 standard deviations. An effect of that size is educationally meaningful: it is comparable to a substantial fraction of a year’s expected learning growth. The same study reported a finding that is, at first glance, counterintuitive but turns out to be common in this literature. Students taught through English did not score any higher on English-language tests than students taught through their mother tongue. The expected payoff of earlier and heavier English exposure did not appear, while the cost in mathematics did.

The study also found that the timing of the transition matters. An early, well-supported transition to English carried positive effects for English outcomes without significantly harming mathematics. A late transition, by contrast, was associated with weaker mathematics outcomes and no offsetting English gain. This nuance is important for policy: the evidence does not argue against English, which is indispensable for Ethiopian students’ access to higher education and the labour market. It argues that the sequencing and support of the language transition shape how much STEM learning is lost along the way.

The UNICEF case study of language and learning in Ethiopia (UNICEF, 2016) reaches a compatible conclusion and emphasises that the policy is implemented unevenly across regions. That unevenness is itself a source of inequity, because two students with the same underlying ability can receive systematically different scores depending on where they happened to attend primary school.

The pattern is not unique to Ethiopia

A finding from a single country, even a well-designed one, invites the question of whether something specific to that setting is driving the result. The strongest reason to take the Ethiopian evidence seriously is that the same pattern recurs across education systems that share little except the structural feature of testing students in a non-home language.

South Africa offers the closest comparison. Howie’s analyses of the Trends in International Mathematics and Science Study (TIMSS) found that the language most frequently spoken in a pupil’s home was the single strongest predictor of science achievement, ahead of many factors commonly assumed to dominate (Howie, 2003). Pupils who usually spoke the language of the test at home scored on the order of 60 points higher than pupils who seldom did, a gap larger than that produced by many resource-based variables. Later work using TIMSS 2015 data confirmed that the alignment between home language and test language predicted achievement in both mathematics and science, and that the effect was stronger in science.

The OECD’s Programme for International Student Assessment (PISA) shows the same signal at global scale. An analysis of PISA 2009 found that students who spoke the test language at home substantially outperformed those who did not, with an effect size around Cohen’s d = 0.69 in reading, a large effect by conventional standards. Because PISA is administered across dozens of countries with different languages and curricula, a consistent test-language effect across that variation is difficult to explain by anything other than language itself.

Figure 2 — The home-language advantage in international assessments

Students tested in the language they speak at home score substantially higher. Score-point advantage on TIMSS science (South Africa) and PISA 2009 reading for home-language speakers of the test language. Data: Howie (2003); PISA 2009 test-language analysis (2015).

Research in the United States, focused on students classified as English learners, fills in the mechanism. A large-scale study of elementary science achievement found that multilingual learners began kindergarten roughly 0.59 standard deviations behind their English-only peers in science, a gap more than twice the size of the corresponding gap in mathematics and reading, and that the science gap narrowed as English proficiency developed across the elementary years (Curran et al., 2024). The fact that the gap is largest in science, and that it closes as language develops rather than as science instruction accumulates, is strong evidence that the gap is substantially linguistic in origin.

Figure 3 — The gap for multilingual learners is largest in science

Science carries the heaviest language load. Achievement gap between multilingual learners and English-only peers at kindergarten entry, in standard deviations. The science gap is more than double the mathematics and reading gaps. Data: Curran et al. (2024).

A 2024 systematic review of bilingualism and mathematical performance reached a similar synthesis: performance in mathematics among bilingual students depends jointly on language proficiency, the language used in the test, and the linguistic structure of the items themselves. And a 2024 study of multilingual assessment in science found that linguistic accommodations, changes to how items are presented rather than what they assess, measurably affected science achievement (Frontiers in Communication, 2024).

Four independent settings, four different methodologies, one direction of effect. That convergence is what gives the conclusion its weight.

Why science is hit hardest

Across nearly every study reviewed here, the language penalty is larger in science than in mathematics, and larger in mathematics than in reading-light subjects. This ordering is not random, and understanding it is useful for anyone designing or interpreting STEM assessments.

The explanation lies in the distinction, introduced by Cummins (1979), between basic interpersonal communicative skills and cognitive academic language proficiency, often abbreviated BICS and CALP. Conversational fluency, the language of the playground and the marketplace, develops relatively quickly, often within two years of exposure. Academic language proficiency, the language of textbooks, examinations and formal reasoning, develops far more slowly, typically over five to seven years. A student can sound fluent in English and still lack the academic register that examinations demand. This gap between apparent fluency and academic readiness is one of the most consistently misjudged features of second-language education.

Science sits at the far end of the academic-language spectrum. A biology or chemistry item asks the student to handle a dense layer of technical terms, “photosynthesis”, “covalent”, “homeostasis”, many of which have no everyday equivalent and must be learned as new concepts and new words simultaneously. Schleppegrell’s analysis of the language of schooling describes how scientific text also relies on grammatical features that rarely appear in conversation: long noun phrases that compress whole processes into a single subject, passive constructions, embedded clauses, and logical connectors such as “however”, “therefore” and “whereas” that carry the reasoning of the sentence (Schleppegrell, 2004). Mathematics word problems carry a lighter but still significant load, while a bare computation carries almost none. The penalty tracks the linguistic load of the subject, which is exactly what the cross-subject pattern shows.

This also explains why the penalty falls unevenly across students. The work of Abedi and Lord on linguistically modified mathematics items found that reducing the linguistic complexity of a question raised the scores of English learners and lower-achieving students, while leaving the scores of higher-achieving native English speakers essentially unchanged (Abedi & Lord, 2001; Abedi, 2006). The modified and unmodified items tested the same mathematics. The only thing that changed was the language, and the only students who benefited were the ones for whom language was a barrier. This is close to a clean demonstration that, for some students, a portion of the original score gap was a language artefact rather than a mathematics deficit.

What this means for the ESSLCE and computer-based testing

The ESSLCE is the gateway examination of the Ethiopian education system. It determines university placement, programme of study, and to a large degree the trajectory of a young person’s working life. It is administered in English, and Ethiopia has moved its delivery to a computer-based format. Both facts interact with the evidence above.

The first implication concerns validity. If English-medium STEM items carry a measurable language penalty, then ESSLCE STEM scores are, for a meaningful share of students, a blend of subject knowledge and English reading proficiency. For a student who is strong in physics but still developing academically in English, the score understates physics ability. The university-placement decision that follows is then based on a partly mismeasured quantity.

The second implication concerns equity. The language penalty is not distributed evenly. It is heaviest for students from regions where the transition to English came late or with little support, from rural schools with fewer English-proficient teachers, and from lower-income households with less out-of-school exposure to English. These are, broadly, the same students already facing the steepest disadvantages. A language-loaded examination does not merely add noise to the scores; it adds noise that is correlated with background, which means it can widen existing gaps rather than measure across them fairly.

The move to computer-based testing introduces a separate consideration. CBT changes how a student reads and navigates an examination: reading from a screen, scrolling, and moving between items differ from working with paper. For a student already managing the cognitive load of an English-medium science item, an unfamiliar interface adds a further demand on the same limited attention. The remedy is well established and practical: genuine familiarisation with the CBT format before the high-stakes sitting, so that the interface is automatic and the student’s attention is free for the science. The language penalty and the interface penalty are separate problems, but they compete for the same cognitive budget, and both are worth reducing.

None of this is an argument against English in the ESSLCE. English is the language of Ethiopian higher education and of much of the modern economy, and competence in it is a legitimate goal of secondary schooling. The argument is narrower and, we think, harder to dispute: the examination should measure science and mathematics as accurately as possible, and unnecessary linguistic complexity in the items works against that goal without serving any defensible purpose.

Practical recommendations: design the language out of the way

The most encouraging part of this literature is that the largest single lever is also the most controllable. It is not the language of the test, which is set by policy, but the linguistic complexity of the individual items, which is set by whoever writes them.

Reduce non-essential linguistic complexity

Keep technical vocabulary, because that is the science, but simplify everything around it. Use short sentences. Prefer the active voice. Replace low-frequency non-technical words with common ones. Break long, multi-clause questions into shorter statements. The difficulty of the item should lie entirely in the science, where it belongs.

Separate the reading load from the reasoning load

A question that buries a straightforward calculation inside a long narrative is testing reading comprehension as much as mathematics. If the narrative is not part of the construct, trim it.

Make plain-language item review a standard step

Before an item enters a bank, it should be read by someone whose explicit task is to flag unnecessary linguistic difficulty. This is inexpensive and catches problems that subject-matter experts, fluent in the academic register, tend not to notice.

Provide bilingual support in practice, not in the high-stakes test

During preparation, key-term glossing or bilingual definitions of technical vocabulary can help students attach English scientific terms to concepts they may already understand in their mother tongue. This builds the academic register the examination will eventually demand.

Treat CBT familiarisation as part of test preparation

Repeated exposure to the computer-based format before the real examination removes the interface as a competing demand, freeing attention for the content.

Use item analysis to find language-loaded questions empirically

When practice data is available, items that show unusually large gaps between otherwise similar students are candidates for linguistic review. Differential item functioning analysis is the formal tool for this and is worth building into any serious item-development cycle.

A research agenda for Atenu Lab

This review synthesises existing evidence. It does not, on its own, produce new evidence specific to the Ethiopian secondary STEM context under the current CBT regime, and that gap is worth naming honestly. The literature on Ethiopia is strongest at the primary level and predates the CBT transition. The natural next step for Atenu Lab is a study designed to measure the language penalty directly in our own setting.

The design we propose is a within-subjects item experiment. A sample of STEM items would be authored in matched pairs: each pair tests the identical concept and difficulty, but one version is presented in English and the other in a major Ethiopian language of schooling, such as Amharic or Afaan Oromo, with technical terms handled consistently. A third version, English with reduced linguistic complexity along the Abedi and Lord lines, would let us separate the effect of language from the effect of complexity. Students on the Atenu platform would be randomly assigned versions, and the resulting score differences, analysed with differential item functioning methods, would give a direct, local estimate of how much of an English-medium STEM score is language rather than science.

Conclusion

The language a test is written in is a design decision, not a neutral backdrop. The evidence assembled here, from Ethiopia’s own national assessments, from South Africa’s TIMSS results, from PISA across dozens of countries, and from controlled item experiments, points in a single direction. Assessing STEM subjects in a language students have not fully mastered lowers their measured achievement, the effect is largest in the most vocabulary-dense subjects, and it falls hardest on the students who are already most disadvantaged.

For Ethiopia, with a high-stakes English-medium examination now delivered by computer, this is a question of measurement fairness that deserves deliberate attention. The reassuring conclusion is that the most powerful response is also the most achievable. We cannot quickly change which language the ESSLCE is written in, but item writers can, starting today, stop making questions harder than the science requires. A fairer measure of what Ethiopian students know in science and mathematics begins with the language of a single test item.

References

Abedi, J. (2006). Language issues in item development. In S. M. Downing & T. M. Haladyna (Eds.), Handbook of Test Development. Lawrence Erlbaum Associates.
Abedi, J., & Lord, C. (2001). The language factor in mathematics tests. Applied Measurement in Education, 14(3), 219–234.
Cummins, J. (1979). Linguistic interdependence and the educational development of bilingual children. Review of Educational Research, 49(2), 222–251.
Cummins, J. (2000). Language, Power and Pedagogy: Bilingual Children in the Crossfire. Multilingual Matters.
Curran, F. C., Pacheco, M. B., Boza, L., Harris-Walls, K., Tan, T., & Deig, A. (2024). Multilingual learners and elementary science achievement: Exploring trends and heterogeneity across subgroups. AERA Open.
Heugh, K., Benson, C., Bogale, B., & Yohannes, M. A. G. (2007). Final report: Study on medium of instruction in primary schools in Ethiopia. Commissioned by the Ministry of Education, Ethiopia.
Howie, S. J. (2003). Language and other background factors affecting secondary pupils' performance in mathematics in South Africa. African Journal of Research in Mathematics, Science and Technology Education, 7(1), 1–20.
English medium instruction in multilingual contexts: Empirical evidence from Ethiopia. (2024). International Journal of Educational Development, 105.
On the relationship between bilingualism and mathematical performance: A systematic review. (2024). Education Sciences, 14(11), 1172.
Schleppegrell, M. J. (2004). The Language of Schooling: A Functional Linguistics Perspective. Lawrence Erlbaum Associates.
Solano-Flores, G. (2008). Who is given tests in what language by whom, when, and where? Educational Researcher, 37(4), 189–199.
Test language effect in international achievement comparisons: An example from PISA 2009. (2015).
The differing effect of language factors on science and mathematics achievement using TIMSS 2015 data: South Africa. (2018). Research in Science Education.
The dynamics of multilingual assessment: Exploring the impact of linguistic accommodations on science achievement. (2024). Frontiers in Communication.
UNICEF. (2016). Language and Learning: The Ethiopia Case Study.