NAPLAN writing test

AERO’s writing report is causing panic. It’s wrong. Here’s why.

October 24, 2022Australian Education Research Organisation, NAPLAN, teaching writingAARE blog, AERO, Australian Educational Research Organisation, James Ladwig, NAPLAN, NAPLAN writing test

If ever there was a time to question public investment in developing reports using ‘data’ generated by the National Assessment Program, it is now with the release of the Australian Educational Research Organisation’s report ‘Writing development: What does a decade of NAPLAN data reveal?’

I am sure the report was meant to provide reliable diagnostic analysis for improving the function of schools.

It doesn’t. Here’s why.

There are deeply concerning technical questions about both the testing regime which generated the data used in the current report, and the functioning of the newly created (and arguably redundant) office which produced this report.

There are two lines of technical concern which need to be noted. These concerns reveal reasons why this report should be disregarded – and why media response is a beatup.

The first technical concern for all reports of NAPLAN data (and any large scale survey or testing data) is how to represent the inherent fuzziness of estimates generated by this testing apparatus.

Politicians and almost anyone outside of the very narrow fields reliant on educational measurement would like to talk about these numbers as if they are definitive and certain.

They are not. They are just estimates – but all of the summary statistics reports are just estimates.

The fact these are estimates is not apparent in the current report. There is NO presentation of any of the estimates of error in the data used in this report.

Sampling error is important, and, as ACARA itself has noted, (see, eg, the 2018 NAPLAN technical report) must be taken into account when comparing the different samples used for analyses of NAPLAN. This form of error is the estimate used to generate confidence intervals and calculations of ‘statistical difference’.

Readers who recall seeing survey results or polling estimates being represented with a ‘plus or minus’ range will recognise sampling error.

Sampling error is a measure of the probability of getting a similar result if the same analyses were done again, with a new sample of the same size, with the same instruments, etc. (I probably should point out that the very common way of expressing statistical confidence often gets this wrong – when we say we have X level of statistical confidence, that isn’t a percentage of how confident you can be with that number, but rather the likelihood of getting a similar result if you did it again.)

In this case, we know about 10% of the population do not sit the NAPLAN writing exam, so we already know there is sampling error.

This is also the case when trying to infer something about an entire school from the results of a couple of year levels. The problem here is that we know the sampling error introduced by test absences is not random and accounting for it can very much change trend analyses, especially for sub-populations So, what does this persuasive writing report say about sampling error?

Nothing. Nada. Zilch. Zero.

Anyone who knows basic statistics knows that when you have very large samples, the amount of error is far less than with smaller samples. In fact, with samples as large as we get in NAPLAN reports, it would take only a very small difference to create enough ripples in the data to show up as being statistically significant. That doesn’t mean, however, the error introduced is zero – and THAT error must be reported when representing mean differences between different groups (or different measures of the same group).

Given the size of the sampling here, you might think it ok to let that slide. However, that isn’t the only short cut taken in the report. The second most obvious measure ignored in this report is measurement error. Measurement error exists any time we create some instrument to estimate a ‘latent’ variable – ie something you can’t see directly. We can’t SEE achievement directly – it is an inference based on measuring several things we can theoretically argue are valid indicators of that thing we want to measure.

Measurement error is by no means a simple issue but directly impacts the validity of any one individual student’s NAPLAN score and an aggregate based on those individual results. In ‘classical test theory’ a measured score is made of up what is called a ‘true score’ and error (+/-). In more modern measurement theories error can become much more complicated to estimate, but the general conception remains the same. Any parent who has looked at NAPLAN results for their child and queried whether or not the test is accurate is implicitly questioning measurement error.

Educational testing advocates have developed many very mathematically complicated ways of dealing with measurement error – and have developed new testing techniques for improving their tests. The current push for adaptive testing is precisely one of those developments, in the local case being rationalised as adaptive testing (where which specific test item is asked of the person being tested changes depending on prior answers) does a better job of differentiation those at the top and bottom end of the scoring range (see the 2019 NAPLAN technical report for this analysis).

This bottom/top of the range problem is referred to as a floor or ceiling effect. When large proportion of students either don’t score anything or get everything correct, there is no way to differentiate those students from each other – adaptive testing is a way of dealing with floor and ceiling effects better than a predetermined set of test items. This adaptive testing has been included in the newer deliveries of the online form of the NAPLAN test.

Two important things to note.

One, the current report claims the performance of high ‘performing’ students’ scores has shifted down – despite new adaptive testing regimes obtaining very different patterns of ceiling effect. Second, the test is not identical for all students (they never have been).

The process used for selecting test items is based on ‘credit models’ generated by testers. Test items are determined to have particular levels of ‘difficulty’ based on the probability of correct answers being given from different populations and samples, after assuming population level equivalence in prior ‘ability’ AND creating difficulties score for items while assuming individual student ‘ability’ measures are stable from one time period to the next. That’s how they can create these 800 point scales that are designed for comparing different year levels.

So what does this report say about any measurement error that may impact the comparisons they are making? Nothing.

One of the ways ACARA and politicians have settled their worries about such technical concerns as accurately interpreting statistical reports is by introducing the reporting of test results in ‘Bands’. Now these bands are crucial for qualitatively describing rough ranges of what the number might means in curriculum terms – but they come with a big consequence. Using ‘Band’ scores is known as ‘coarsening’ data – when you take a more detailed scale and summarise it in a smaller set of ordered categories – and that process is known to increase any estimates of error. This later problem has received much attention in the statistical literature, with new procedures being recommended for how to adjust estimates to account for that error when conducting group comparisons using that data.

As before, the amount of reporting of that error issue? Nada.

This measurement problem is not something you can ignore – and yet the current report is worse than careless on this question.

It takes advantage of readers not knowing about it.

When the report attempts to diagnose which component of the persuasive writing tasks were of most concern, it does not bother reporting that the error for each of the separate measures of those ten dimensions of writing has far more error than the total writing score, simply because the number of marks for each is a fraction of the total. The smaller the number of indicators, the more error (and less reliability).

Now all of these technical concerns simply raises the question of whether or not the overall findings of the report will hold up to robust tests and rigorous analysis – there is no way to assess that from this report, but there is even bigger reason to question why it was given as much attention as it was. That is, for any statistician, there is always a challenge to translate the numeric conclusions into some for of ‘real life’ scenario.

To explain why AERO has significantly dropped the ball on this last point, consider its headline claim that year 9 students have had declining persuasive writing scores and somehow representing that as a major new concern.

First note that the ONLY reporting of this using the actual scale values is a vaguely labelled line graph showing scores from 2011 until 2018 – skipping 2016 since the writing task that year wasn’t for persuasive writing (p 26 of the report has this graph). Of those year to year shifts, the only two that may be statistically significant, and are readily visible, are from 2011 to 2012, and then again from 2017 to 2018. Why speak so vaguely? From the report, we can’t tell you the numeric value of that drop, because there is no reporting of the actual number represented in that line graph.

Here is where the final reality check comes in.

If this data matches the data reported in the national reports from 2011 and 2018, the named mean values on the writing scale were 565.9 and 542.9 respectively. So that is a drop between those two time points of 23 points. That may sound like a concern, but recall those scores are based on 48 marks given for writing. In other words, that 23 point difference is no more than one mark difference (it could be far less since each different mark carries a different weighting in formulation that 800 scale).

Consequently, even if all the technical concerns get sufficient address and the pattern still holds, the realistic title of Year 9 claim would be ‘Year 9 students in 2018 NAPLAN writing test scored one less mark than the Year 9 students of 2011.’

Now assuming that 23 point difference has anything to do with the students at all, start thinking about all the plausible reasons why students in that last year of NAPLAN may not have been as attentive to details as they were when NAPLAN was first getting started. I can think of several, not least being the way my own kids did everything possible to ignore the Year 9 test – since the Year 9 test had zero consequences for them.

Personally, these reports are troubling for many reasons, inclusive of the use of statistics to assert certainty without good justification, but also because saying student writing has declined belies that obvious fact that is hasn’t been all that great for decades. This is where I am totally sympathetic to the issues raised by the report – we do need better writing among the general population. But using national data to produce a report of this calibre, by an agency beholden to government, really does little more than provide click-bait and knee jerk diagnosis from all sides of a debates we don’t really need to have.

James Ladwig is Associate Professor in the School of Education at the University of Newcastle. He is internationally recognised for his expertise in educational research and school reform. Find James’ latest work in Limits to Evidence-Based Learning of Educational Science, in Hall, Quinn and Gollnick (Eds) The Wiley Handbook of Teaching and Learning published by Wiley-Blackwell, New York. James is on Twitter @jgladwig

AERO’s response to this post

ADDITIONAL COMMENTS FROM AERO provided on November 9: information about the statistical issues discussed, a more detailed Technical Note is available at AERO.

On Monday, EduResearch Matters published the above post by Associate Professor James Ladwig which critiqued the Australian Education Research Office’s Writing development: what does a decade of NAPLAN data reveal?

AERO’s response is below, with additional comments from Associate Professor Ladwig.

AERO: This article makes three key criticisms about the analysis presented in the AERO report, which are inaccurate.

Ladwig claims that the report lacks consideration of sampling error and measurement error in its analysis of the trends of the writing scores. In fact, those errors were accounted for in the complex statistical method applied. AERO’s analysis used both simple and complex statistical methods to examine the trends. While the simple method did not consider error, the more complex statistical method (referred to as the ‘Differential Item Analysis’) explicitly considered a range of errors (including measurement error, and cohort and prompt effects).

Associate Professor Ladwig: AERO did not include any of that in its report nor in any of the technical papers. There is no overtime DIF analysis of the full score – and I wouldn’t expect one. All of the DIF analyses rely on data that itself carries error (more below). There is no way for the educated reader to verify these claims without expanded and detailed reporting of the technical work underpinning this report. This is lacking in transparency, falls shorts of the standards we should expect from AERO and makes it impossible for AERO to be held accountable for its specific interpretation of their own results.

AERO: Criticism of the perceived lack of consideration of ‘ceiling effects’ in AERO’s analysis of the trends of high-performing students’ results, omits the fact that AERO’s analysis focused on the criteria scores (not the scaled measurement scores). AERO used the proportion of students achieving the top 2 scores (not the top score), for each criterion, as the matrix to examine the trends. Given only a small proportion of students achieved a top score for any criterion (as shown in the report statistics), there is no ‘ceiling effect’ that could have biased the interpretation of the trends.

Associate Professor Ladwig made his ‘ceiling effect’ comments while explaining how the NAPLAN writing scores are designed not in relation to the AERO analysis.

AERO: The third major inaccuracy relates to the comments made about the ‘measurement error’ around the NAPLAN bands and the use of adaptive testing to reduce error. These are irrelevant to AERO’s analysis because the main analysis did not use scaled scores, it did not use bands, and adaptive testing is not applicable to the writing assessment.

Associate Professor Ladwig’s comment was about the scaling in relation to explaining the score development, not about the AERO analysis.

In relation to the AERO use of NAPLAN criterion score data in the writing analysis, however, please note that those scores are created either through scorer moderation processes or (increasingly where possible) text interpretative algorithms. Here again the address of the reliability of these raw scores was absent, but with one declared limitation noted, in AERO’s own terms:

Another key assumption underlying most of the interpretation of results in this report is that marker effects (that is, marking inconsistency across years) are small and therefore they do not impact on the comparability of raw scores over time. (p[.66)

This is where AERO has taken another short cut, with an assumption that should not be made. ACARA has reported the reliability estimates to include that in the scores analysis. It is readily possible to report those and use them for trend analyses.

AERO: A final point: the mixed-methods design of the research was not recognised in the article. AERO’s analysis examined the skills students were able to achieve at the criterion level against curriculum documents. Given the assessment is underpinned by a theory of language, we were able to complement quantitative with a qualitative analysis that specifically highlighted the features of language students were able to achieve. This was validated by analysis of student writing scripts.

Associate Professor Ladwig says this is irrelevant to his analysis. The logic of this is also a concern. Using multiple methods and methodologies does not correct for any that are technically lacking. In relation to the overall point of concern, we have a clear example of an agency reporting statistical results in a manner that elides external scrutiny accompanied by an extreme media positioning. Any of the qualitative insights to the minutia these numbers represent will probably very useful for teachers of writing – but whether or not they are generalisable, big, or shifting depends on those statistical analysis themselves.

Surprising findings from new analysis of declining NAPLAN writing test results

November 16, 2020NAPLANDamon Thomas, decline in writing skills, NAPLAN test modifications, NAPLAN writing test, writing prompts

Despite the considerable annual investments of money and school resources to hold the NAPLAN tests, almost no research has sought to investigate patterns of student achievement in the NAPLAN writing test data over time. I wanted to know what the NAPLAN writing test results tell us about male and female student performance over time.

My research study found Year 9 males write at a similar standard as Year 7 females. There has been a rapid decline in student writing scores for both genders, however the gap between male and female writing scores widens with every tested year level, to an equivalent of two years of learning by Year 9.

Most significantly I also found that the NAPLAN writing test’s design and the way we implement the test may be factors that make it difficult to trust the test’s outcomes over time.

Writing is a skill that is basic to the economy, to people’s wellbeing, and to their life trajectory. It underpins our activity and experiences in education, science, governance, law, the economy, religion, and cultural life. Writing is essential for the day-to-day operations of most employees across all global industries and services according to the US National Assessment Governing Board. A person’s success in education, the workplace, and broader society is strongly influenced by their capacity to write.

Lack of research into NAPLAN writing data

To ensure that Australian students are developing adequate skills in writing, reading, language conventions, and mathematics for adult life, the Ministerial Council on Education, Employment, Training and Youth Affairs introduced the National Assessment Program – Literacy and Numeracy (NAPLAN) tests in 2008. Since then, over one million students in Years 3, 5, 7, and 9 completed the tests each year (the test was cancelled in 2020 due to COVID-19). However almost no research has sought to investigate patterns of student achievement in the NAPLAN writing test data over time.

My research

So, what does the last decade of NAPLAN testing tell us about student writing outcomes? My research drew on the NAPLAN results provided by the Australian Curriculum, Assessment and Reporting Authority (ACARA) in annual NAPLAN reports for 2011-2018. According to ACARA (2016a), “in 2016, the narrative prompt was placed onto the existing persuasive writing scale, creating a NAPLAN writing scale comparable for both genres… [meaning] that the results can be compared and trends analysed in NAPLAN writing data from 2011 onwards but not for results before then” (para. 3). For this reason, my research compared male and female student performance on the writing tests between 2011 and 2018.

I also drew on the Grattan Institute’s Equivalent Year Levels approach which calculates student progress using a different method to ACARA and which results in a cohort’s equivalent year level rather than the seemingly arbitrary and difficult to interpret numbers in the NAPLAN reports. For example, a NAPLAN achievement score of 536 would equate to an equivalent year level of 7.5 or halfway through Year 7. A cohort’s equivalent year level can be subtracted from their equivalent year level on the previous NAPLAN test to work out their progress in the two years between NAPLAN tests. If a cohort scored 536 in Year 7 and 548 when tested again in Year 9, they would be performing at the equivalent of a Year 8 standard, making approximately six months of progress in the two years between tests.

I used the NAPLAN achievement scores and the equivalent year level approach to provide the first in-depth picture of how male and female students have performed on the writing test between 2011 and 2018.

Figure 1

Year 3 writing achievement by gender, 2011-2018

Figure 2

Year 5 writing achievement by gender, 2011-2018

Figure 3

Year 7 writing achievement by gender, 2011-2018

Figure 4

Year 9 writing achievement by gender, 2011-2018

My findings reveal a clear gender gap in writing outcomes for all four tested year levels. Year 3 male students’ scores were, on average, the equivalent of 8.16 months of learning behind female scores. The gender gap widened across the year levels, to 11.8 months of learning in Year 5, 20.1 months of learning in Year 7, and 24.1 months of learning in Year 9. Despite a considerable gender gap across all tested year levels, writing achievement declined rapidly for both genders over the selected eight years.

Test modifications make a difference

While these results paint a dismal picture of student progress with writing in a decade of testing, my research highlighted four modifications that ACARA have made to the writing test across the years that make it difficult to tell if the tests have been equally challenging for students.

Modification 1: Text type switching

Between 2011 and 2018, four NAPLAN writing tests required students to write narrative texts (stories), while seven required them to write persuasive texts (arguments). This is problematic because educational linguists have shown for decades that writing narratives involves very different linguistic and structural choices than persuasive writing. Despite this, ACARA have treated the results of all NAPLAN writing tests as directly comparable, despite the focus on either narrative or persuasive writing each year.

Modification 2: Age-appropriate writing prompts

Between 2008 and 2014, students in all year levels responded to one writing prompt each year. Because certain prompts were deemed too challenging for primary students or too simplistic for secondary students, from 2015, ACARA introduced separate, age-appropriate prompts for primary and secondary school students. The move to age-appropriate prompts altered the test conditions, yet scores over time are still treated as directly comparable.

Modification 3: Knowledge of the target genre focus prior to test

From 2008 to 2013, teachers and students were made aware of the genre focus (either narrative or persuasive text) before the test date. Since 2014, ACARA has not revealed the genre focus until the time of the test. The decision to reveal the focus genre at the time of the test aimed to prevent teachers from over-preparing students for one genre of writing; however, knowing the genre prior to the test gave those completing it before 2014 an advantage over those completing it since. Despite this change to the test conditions, ACARA treats all scores as directly comparable.

Modification 4: Shift to online testing

From 2008 to 2017, students completed paper-based writing tests. In 2018, 20% of students completed the test online. The 2018 test results were higher for those who completed it online, but despite this, online test results are directly compared with paper-based results.

Taken together, the modifications made to the NAPLAN writing test raise questions about whether each test has been equally challenging, and therefore whether the decline reported through the NAPLAN annual reports is real.

The future of NAPLAN writing tests

As the future of the NAPLAN writing test is debated, my research highlights two important points. First, any new version of the NAPLAN writing test should be designed and implemented carefully, learning from the current test’s history to avoid the need for modifications that call into question whether scores can be reliably compared over time.

Second, every NAPLAN writing test has found the same concerning gender gap that widens as students progress through school. While comparing NAPLAN writing scores year after year is clearly problematic, any single test on its own can be considered a valid measure of writing achievement for that point in time, so we can say with confidence that the gender gap does exist and widens across the school years.

To understand what is behind the writing gender gap, further research is needed into the personal and environmental factors that influence the writing development of male and female students. If we can understand what is happening, it is more likely we will be able to improve writing outcomes for all students.

Those interested can read more about Rapid decline and gender disparities in the NAPLAN writing data

Damon Thomas is a Senior Lecturer in English Education at the University of Tasmania. His PhD investigated the persuasive writing choices made by primary and secondary school students who scored highly on the NAPLAN writing test and critiqued the test’s design. His research interests include reading and writing development and pedagogy, assessment, social semiotics and theories of persuasive communication.