NAPLAN

NAPLAN: Time to think differently

It’s not the results of NAPLAN that are the problem. It is NAPLAN testing itself. These standardised tests contribute to the maintenance of a deeply unequal system. 

The release of NAPLAN results in August prompted an avalanche of responses from politicians, commentators and researchers  all with a take on how to understand the continued ‘declining results’ in the national standardised testing program. 

The federal minister for education Jason Clare responded the morning the results were released, noting the inequities in the system: “There’s about one in ten children who sit these tests that are below what we used to call the minimum standard. But it’s one in three kids from poor families, one in three kids from the bush, one in three Indigenous kids. In other words, your parents’ pay packet, where you live, the colour of your skin affects your chances in life.”

He also said, “The results showed why school funding talks were crucial — not just to supply extra money, but to reform classroom practices.”  Jordana Hunter and Nick Parkinson from the Grattan Institute agreed: NAPLAN results laid bare stark inequities within our education system. And “high quality teaching and support”’ leads to almost all students learning to read competently. 

Other perspectives

But other perspectives, from experienced education researcher Jim Tognolini warns “there is only so much ‘growth’ that can occur across one or two years of learning”. And Gore, another experienced researchers, argues: “Students are more than their brains . .  they learn in social and emotional conditions that also need to be addressed.”

Calls for evidence-backed solutions to the problem have also abounded. While it is important not to dismiss the role of evidence in addressing these problems, there is also room to consider how a structural analysis that takes into consideration some different theoretical lenses, might both reveal different insights on the problems and different possible solutions. 

Take, for example, the decline in the mental health of young people. On the same morning (and on the same radio station), Jason Clare responded to the NAPLAN results, Patrick McGorry, Executive Director of Orygen and lead author of the Lancet Psychiatry Commission on Youth Mental Health report revealed findings that the ‘mental health of young people has been declining over the past two decades, signalling a warning that global megatrends and changes in many societies are increasing mental ill health.’

Correlations

It is worth noting that the global trend in standardised testing and comparison also emerged over the last two decades. The correlations of a number of these issues is significant: NAPLAN results have been declining; youth mental health has been declining; school exclusion and refusal has been increasing; the disruptive and distressing effects of global warming have been increasing; global inequality has been increasing; surveillance capitalism has been increasing; and we are currently watching a genocide live streamed to our phones while students and staff are discouraged from talking about it.

When viewed together these trends point towards a global system of inequity, in which as noted above the unequal schooling system in Australia is but one component. This means the inequities in the education system cannot be fixed by providing more funding and supporting better quality teaching (although these things are of course, incredibly important), but require a closer look at the broader system of inequality. And what we find when we look at that system is that inequality is required in the system. 

Capitalism and colonialism, systems that our societies and schools have grown out of, and continue to inform their operations, are based on maintaining gaps between the ‘haves’ and ‘have nots’, between those who ‘succeed’ and those who ‘fail’. 

The system must be reckoned with

While Jason Clare and others might be concerned about such NAPLAN gaps in achievement of the poor, those who live rurally and First Nations students, the system that produces these gaps must be reckoned with if a solution is to be found. 

Anthropologist Jason Hickel, based in Barcelona, points out ‘capitalism is predicated on surplus extraction and accumulation; it must take more from labour and nature than it gives back…such a system necessarily generates inequalities and ecological breakdown.’ 

Further, he notes ‘what makes capitalism distinctive, and uniquely problematic, is that it is organised around, and dependent on, perpetual growth.’ And he shows how this perpetual growth has relied for centuries on colonial appropriation of land and resources, enclosure, enslavement and exploitation, and cheapening of labour to underpin capitalist growth. 

This is the system that schooling sits within. Thus the white, wealthy, urban families that Jason Clare points out, have children who achieve on the NAPLAN test, demonstrate the colonial, capitalist system working as it is designed to.

Unequal by design

US education researcher  Wayne Au, argues high-stakes testing (such as NAPLAN) is unequal by design and operates to standardise inequality. Au explores how ‘the data produced by the tests are used as the metric for determining value, which in turn is used for comparison and competition in the educational marketplace.’ He also outlines how high-stakes, standardised tests ‘perpetuate institutionalized racism and white supremacy, and they are functionally weaponized against working-class communities of color’.

This leads, therefore, to a situation in which it’s not so much about which evidence-based teaching strategies are working but which schools in the unequal market system have the capacity to extract test results from students that produce the greatest market value. That is, the results in Australia that get recorded on the MySchool website and enable the marketing of their school as a higher achieving school. The recording of other attributes of the school community on the MySchool website also contribute to the institutionalised racism and classism that Au outlines.

I’d argue this then gives some schools more power in the market system and allows them to accumulate surpluses. Surpluses might be in the form of more teachers wanting to teach at their school which can lead to smaller class sizes (particularly in the teacher shortage) or it might be in the form of more students wanting to attend their school which leads to greater resourcing when resources are attached to student enrollments. More research is needed to understand this phenomenon.

Organised abandonment

Through these processes of extraction and accumulation, violence also occurs. North American theorist Ruth Wilson Gilmore argues that certain racialized and impoverished communities are subject to ‘group differentiated vulnerability to premature death.’ In other words, the capitalist state deliberately under-resources particular groups so that they are more vulnerable to premature death. Wilson Gilmore calls these practices ‘organised abandonment.’ 

There are welcome calls for better funding of disadvantaged schools. But the long-standing practice of under-funding public schools in poorer communities in Australia is an example of organised abandonment,entrenches inequality in ways that increased funding alone will have little chance of shifting.

A different possible solution to the problem is to abolish standardised testing and the MySchool website and undo the market based system of schooling. If we are serious about addressing academic achievement, mental wellbeing, poverty, racial discrimination and global warming we must build an education system that is anti-colonial and anti-capitalist. This requires abolishing harmful systems of competition, extraction, accumulation and corporate growth and investing in systems of deep care for, in Jason Hickel’s words, ‘human needs (use-value) through de-accumulation, de-enclosure and de-commodification.’

 Sophie Rudolph is a senior research fellow at the Faculty of Education, University of Melbourne. Her research and teaching involves sociological and historical analyses and is informed by critical theories. She is currently working on a DECRA project investigating the history and politics of racialised school discipline and exclusion in Victoria.

NAPLAN: There is no need to panic

Jim Tognolini: What do the results really mean

Every year when results for large scale tests such as NAPLAN are released, there is a need to remind parents – and people in general – about the need to reflect judiciously on  what they really mean. 

It is also very important to address the misconceptions that are promulgated by journalists who start off with a preconceived notion of what they want the results to say (for one reason or another) and then proceed to misinterpret and draw unsubstantiated conclusions that they argue support their notions.

This year’s NAPLAN is just another case in point.

Overall, the results are best summarised by the CEO of ACARA, Stephen Gniel when he says, “The data shows that while there were small increases and decreases across domains and year levels, overall the results were broadly stable.”

There are some good reasons for drawing this conclusion. The results, apart from some minor perturbations up-and-down in different domains, are indeed relatively stable. 

JT: Only so much “growth”

To be honest, this should be expected because there is only so much “growth” that can occur across one or two years of learning and the only other data we can compare to is the NAPLAN 2023 data because the scale that is being used for comparison was only calibrated in 2023. 

A trend requires more than two points to be able to be reliably interpreted. It is a relatively naïve view that would expect strategies that have been introduced to address issues identified in the 2023 results would generate significant changes across a system in one year.

It is also important when reflecting on these results to stress several points. Firstly, there is some emotive language used to summarise performance which should not be allowed to go unchallenged. 

Students who have performed in the bottom two proficiency levels have been summarised as having “failed”.  However, when interpreting results like NAPLAN it is important to go beyond the “label” and look at what skills these students have displayed. The proficiency levels describe what students in these levels know and can do and an analysis of these skill sets suggests that they have a wide range of skills that will serve them well in later studies. 

JT: A sound springboard

The students in the bottom two levels have not “failed”. Knowledge and skills that students have displayed in the developing proficiency level are a sound springboard for learning within disciplines and through life.

 Let’s focus on what it is that students know and can do rather than jumping to labels that detract from the real meaning of the results.

While the NAPLAN is a battery of psychometrically sound tests, they are only tests of literacy and numeracy (there is a lot more to schooling than a test result on literacy and numeracy only). 

In addition, the results represent the outcomes on a particular day and a particular time. The key point here is that these results are only indicative. It is the trend data that are important at a system level. At an individual student level it is the cumulation of a range of data which provides the best evidence as to the overall performance of the student. The NAPLAN test scores must be interpreted by teachers using a wide range of data collected under different circumstances in the classroom. 

Parents who are concerned because the results are not consistent with what they expect from their child/children should seek clarification from the teachers.

Jennifer Gore: We know what to do. Let’s do it

These NAPLAN results are not new and not surprising. They reflect the results we saw last year with the new NAPLAN testing and reporting process and results we’ve seen for years. The fact that a third of students are not meeting proficiency standards is of great concern and the fact they disproportionately come from disadvantaged and other equity backgrounds reflects our nation’s failure to reduce educational inequality.

Education Minister Jason Clare is correct that we need reforms. The important thing is we get the reforms right.

First, we need to fully fund our public schools and end the political football over funding. Second, we need to support teachers to deliver excellent teaching. The current push for explicit teaching and synthetic phonics can only be part of the solution. Students are more than their brains. They learn in social and emotional conditions that also need to be addressed. For example, after a decade of explicit teaching and synthetic phonics, students in England are at an all-time low for enjoyment of reading, languishing toward the bottom of all OCED countries on this measure.

JG: You too can be like Cessnock

A decade of research at the University of Newcastle, including five randomised controlled trials, offers an alternative approach to school reform. Results from Cessnock High School, one of the most disadvantaged schools in NSW, shows how our evidence-based approach to improving teaching quality, regardless of the instructional strategies used, can change lives. Cessnock High achieved the most improved NAPLAN growth from Year 7 to 9 in the Hunter region and 11th overall in the state by engaging in whole school Quality Teaching Rounds. Simultaneously, teachers reported greater morale and improved school culture, which are critical factors in addressing the current teacher shortage crisis.

Thanks to funding from the Australian Government, thousands of teachers from across the country can now access this evidence-backed professional development for free.

From left to right: Jim Tognolini is Professor and Director of the Centre for Educational Measurement and Assessment (CEMA) at the University of Sydney. Jennifer Gore is the Laureate Professor and director of the Teachers and Teaching Research Centre at the University of Newcastle. She developed the Quality Teaching Rounds.

Why you can’t identify gifted students using NAPLAN

Some schools rely on NAPLAN results to identify gifted students, a trend that is leading to many high-potential learners being overlooked and neglected. New research outlines the mistake of using this standardised assessment as the only identification tool for giftedness when it was never designed or intended for this purpose.

There are over 400,000 gifted students in Australia’s schools (approximately 10% of school students), but there are no national identification practices or national means of collecting information about Australian school students who are gifted.It has been over 20 years since the last national inquiry into the education of gifted and talented children in Australia. Despite two senate inquiries (one in 1988 and one in 2001), there are no national initiatives aimed at reducing the impact of ongoing problems in identifying and supporting the needs of gifted learners. It is a national disgrace that gifted students are among some of the most underserved and neglected students in our schools.

The Contentious Belief in NAPLAN for Identifying Giftedness

In education, we constantly strive to uncover and nurture the gifts of our students and develop these into talents, hoping to unleash the full extent of their potential across their lifespan. In Australia, the National Assessment Program–Literacy and Numeracy (NAPLAN) plays a controversial role in evaluating student performance and guiding educational policies and practices. However, there exists a contentious belief that NAPLAN data alone can accurately identify high-potential gifted students. In this blog post, I delve into the fallacy of exclusively using NAPLAN data to identify gifted students. 

A Snapshot of NAPLAN

NAPLAN is a nationwide standardised assessment, conducted annually in Australia, designed to assess the proficiency of students in Years 3, 5, 7 and 9, in key learning areas, specifically reading, writing, language conventions, and numeracy. Its main goal is to gauge the effectiveness of the education system and pinpoint areas that may require improvement. NAPLAN was never designed, intended, or validated as a tool to identify giftedness. It was also never designed to make leagues tables for comparing schools.

What is giftedness?

Gifted students typically exhibit advanced cognitive abilities, exceptional problem-solving skills, and have a high capacity for critical thinking. They often demonstrate creativity, strong motivation to learn (in areas of interest), and an insatiable curiosity. In Australia, the terms gifted and talented are often used as synonyms where in fact they have separate meanings. Giftedness is defined using Gagné’s Differentiating Model of Giftedness and Talent (DMGT). In this Model, gifted individuals are understood to have (sometimes as yet unidentified) potential to excel across various domains, including intellectual, (e.g., general intelligence); creative (e.g., problem-solving); social (e.g., leadership); and motor control (e.g., agility).

On the other hand, the Model associates the term talent with performance, accomplishment or achievement, which is outstanding mastery of competencies in a particular field. The term talented is used to only describe individuals who are among the top 10 percent of peers (e.g., leading experts in their field) in any of nine competencies, including academic (e.g., mathematics); technical (e.g., engineering); science and technology (e.g., medical); the arts (e.g., performing); or sports (e.g., athletic talents).

Giftedness seems to be a misunderstood word in Australia. It is often incorrectly construed as referring to people who apparently ‘have it all’, whatever the elusive ‘it’ might be! Anyone who has any experience with giftedness would know that this is an elitist and unrealistic view of gifted learners and indeed, gifted education. In Australian education systems that are based on Gagné’s Model, giftedness focuses on an individual’s potential and ways to foster that potential through programs and practices that support the development of giftedness into talent.

Identifying Giftedness

The quest to identify gifted students has been a long-standing objective for education systems that seek to be genuinely inclusive. Research recommends that we should aim to identify exceptional potential as early as possible, providing tailored education to further nurture abilities. Naturally, the notion of using standardised test data, such as NAPLAN results, can be appealing because of its relative ease of implementation and data generated. But giftedness is not always demonstrated through achievement or performance. Rather, what NAPLAN may identify is some form of talent if we are using Gagné’s definitions.

Giftedness can coexist with other exceptionalities, such as disabilities, where a student is said to be twice-exceptional (or a gifted learner with disability). The twice-exceptionality stems from the two exceptionalities—individuals who are gifted (exceptional potential ) and have coexisting disabilities (e.g., learning, physical, or emotional), and therefore, require unique educational support that addresses both exceptionalities.

Why is Identification Important?

Many students can have their educational needs addressed in a typical classroom, but gifted learners often need specific interventions (e.g., extension, acceleration), or something different (e.g., specific curriculum differentiation), that engages their potential, in areas such as creativity, problem-solving, and curiosity, to develop these natural abilities into competencies and mastery.

There remains a persistent myth that gifted students are so clever that they will always do just fine on their own, without specific support. Yet, we would never expect a gifted tennis player, or a gifted violinist to do “just fine” on their own—the expectation would be for expert, tailored coaching along with extensive opportunities for practice and rehearsal to develop the student’s potential. Coaches focus on the individual needs of the student, rather than a standardised teaching program designed to suit most, but not all. Still, in Australia many claim to have misgivings about introducing anything ‘special’ for gifted students, while not having the same reservations with respect to athletically gifted or musically gifted students.

What Happens if Gifted Learners are Not Supported?

Failing to support the unique needs of gifted students at school can have significant and detrimental consequences on the students and on education systems and societies. Gifted students who are not appropriately challenged and supported may become disengaged and underachieve academically. Some researchers have estimated that 60%-75% of gifted students may be underachieving.

Becoming bored in the classroom can cause disruptive behaviour and a lack of interest in school, leading to problems such as school ‘refusal’ or ‘school can’t’, disengagement and school ‘drop out’ (estimated at up to 40% of gifted students). This perpetuates a cycle of missed opportunities and undeveloped potential. Furthermore, without appropriate support, gifted students may struggle with social and emotional challenges, feeling isolated from their peers because of their unique interests and abilities. This can lead to anxiety, depression, or other mental health issues.

When gifted students are not recognised and supported so that their giftedness can be transformed into talents, they may develop feelings of inadequacy or imposter syndrome. This can lead to decreased self-efficacy and self-confidence. Failing to identify and support gifted students means missing out on nurturing exceptional gifts that deprives the world of potential future leaders, innovators, medical researchers, and change-makers.

Gifted students from diverse backgrounds, including those from underrepresented or disadvantaged groups, may face additional barriers to identification and support. NAPLAN can be particularly problematic as a misused identification tool for underrepresented populations. Neglecting identification, and subsequently neglecting to address gifted students’ unique needs perpetuates inequity.

Societies and education systems that do not embrace inclusion and equity to the full extent risk continuing cycles of exclusion and inadequate support for giftedness. The OECD makes it clear that equity and quality are interconnected, and that improving equity in education should be a high priority. In Australia, priority equity groups never include giftedness or twice-exceptionality, and fail to recognise intersectionality of equity cohorts (e.g., gifted Aboriginal and Torres Strait Islander students), further compounding disadvantage. When schools fail to support gifted students, these learners can become disengaged and leave school prematurely, impacting social wellbeing and economic growth, and representing a missed opportunity for education environments to be truly inclusive. Inclusive education must mean that everyone is included, not everyone except gifted learners.

The Fallacy Unveiled: Limitations of NAPLAN Data to Identify Giftedness

While NAPLAN may have some merits as a standardised assessment tool, problems have been identified and there have even been calls to scrap the tests altogether. So, it is vital to recognise NAPLAN’s limitations, especially concerning the identification of high-potential gifted students. Some key factors that contribute to the fallacy are the narrow assessment scope, because NAPLAN primarily focuses on literacy and numeracy skills. While these are undoubtedly critical foundational skills, they do not encapsulate the full spectrum of giftedness. Moreover, the momentary snapshot provided by NAPLAN of a student’s performance on a particular day may not accurately represent their true capabilities. Factors such as test anxiety, external distractions, or personal issues can significantly impact test outcomes, masking a student’s actual potential.

Giftedness often entails the capacity to handle complexity and to think critically across various domains. Standardised tests like NAPLAN do not effectively measure the multidimensionality of giftedness (from academic precocity, or potential to achieve academically, to creative thinking and problem solving). Relying solely on NAPLAN data to identify gifted students overlooks those who have potential to excel in non-traditional fields or those who possess such unique gifts.

Embracing Comprehensive Identification Practices

To accurately identify and cultivate giftedness, we must embrace a comprehensive and holistic approach for the purpose of promoting inclusive and supportive educational environments, and for developing talent. Using data from multiple sources in identifying giftedness, including both objective and subjective measures (i.e., comprehensive identification) is the gold standard.  

Comprehensive identification practices involve using multiple measures to identify giftedness, with the expectation that appropriate educational support follows. These identification practices should be accessible, equitable, and comprehensive to make sure identification methods are as broad as possible. Comprehensive identification may consist of student portfolios showcasing their projects, psychometric assessment, artwork, essays, or innovative solutions students have devised. This allows educators to gain a deeper understanding of a gifted student’s interests, passions, abilities, and potential.

Additionally, engaging parents, peers, and the student in the identification process can yield valuable perspectives on a student’s unique strengths and gifts, activities and accomplishments, which they may be involved in outside school. This may offer a more well-rounded evaluation. Experienced educators who have completed professional learning in gifted education could play a crucial role in recognising gifted traits in their students. 

By appropriately identifying, recognising, and addressing the needs of gifted students, we can create inclusive and enriched educational settings that foster the development of gifted potential in education environments that are genuinely inclusive.

 

Michelle Ronksley-Pavia is a Special Education and Inclusive Education lecturer in the School of Education and Professional Studies, and a researcher with the Griffith Institute for Educational Research (GIER), Griffith University. She is an internationally recognised award-winning researcher working in the areas of gifted education, twice-exceptionality (gifted students with disability), inclusive education, learner diversity, and initial teacher education. Her work centres on disability, inclusive educational practices, and gifted and talented educational practices and provisions. 

NAPLAN: Where have we come from – where to from here?

With the shift to a new reporting system and the advice from ACARA that the NAPLAN measurement scale and time series have been reset, now is as good a time as any to rethink what useful insights can be gleaned from a national assessment program.

The 2023 national NAPLAN results were released last week, accompanied by more than the usual fanfare, and an overabundance of misleading news stories. Altering the NAPLAN reporting from ten bands to four proficiency levels, thereby reducing the number of categories students’ results fall into, has caused a reasonable amount of confusion amongst public commentators, and many excuses to again proclaim the demise of the Australian education system. 

Moving NAPLAN to Term 1, with all tests online (except Year 3 writing) seems to have had only minimal impact on the turnaround of results.

The delay between the assessments and the results has been a limitation to the usefulness of the data for schools since NAPLAN began. Added to this, there are compelling arguments that NAPLAN is not a good individual student assessment, shouldn’t be used as an individual diagnostic test, and is probably too far removed from classroom learning to be used as a reliable indicator of which specific teaching methods should be preferred. 

But if NAPLAN isn’t good for identifying individual students’ strengths and weaknesses, thereby informing teacher practices, what is it good for?

My view is that NAPLAN is uniquely powerful in its capacity to track population achievement patterns over time, and can provide good insights into how basic skills develop from childhood through to adolescence. However, it’s important that the methods used to analyse longitudinal data are evaluated and interrogated to ensure that conclusions drawn from these types of analyses are robust and defensible.

Australian governments are increasingly interested in students’ progress at school, rather than just their performance at any one time-point. The second Gonski review (2018) was titled Through Growth to Achievement. In a similar vein, the Alice Springs (Mparntwe) Education Declaration (2019) signed by all state, territory and federal education ministers, argued,

“Literacy and numeracy remain critical and must also be assessed to ensure learning growth is understood, tracked and further supported” (p.13, my italics)

Tracking progress over time should provide information about where students start and how fast they progress, and ideally, allow  insights into whether policy changes at the system or state level have any influence on students’ growth.

However, mandating a population assessment designed to track student growth, does not always translate to consistent information or clear policy directions – particularly when there are so many stakeholders determined to interpret NAPLAN results via their own lens.

One recent example of contradictory information arising from NAPLAN, relates to whether students who start with poor literacy and numeracy results in Year 3 fall further behind as they progress through school. This phenomenon is known as the Matthew Effect. Notwithstanding widespread perceptions that underachieving students make less progress on their literacy and numeracy over their school years compared with higher achieving students, our new research found no evidence of Matthew Effects in NAPLAN data from NSW and Victoria.

In fact, we found the opposite pattern. Students who started with the poorest NAPLAN reading comprehension and numeracy test results in Year 3 had the fastest growth to Year 9. Students who started with the highest achievement largely maintained their position but made less progress.

Our results are opposite to those of an influential Grattan Institute Report published in 2016. This report used NAPLAN data from Victoria and showed that the gap in ‘years of learning’ widened over time. Importantly, this report applied a transformation to NAPLAN data before mapping growth overall, and comparing the achievement of different groups of students.

After the data transformation the Grattan Report found,  

“Low achieving students fall ever further back. Low achievers in Year 3 are an extra year behind high achievers by Year 9. They are two years eight months behind in Year 3, and three years eight months behind by Year 9.” (p.2)

How do we reconcile this finding with our research? My conclusion is that these opposing findings are essentially due to different data analysis decisions.

Without the transformation of data applied in the Grattan Report, the variance in NAPLAN scale scores at the population level decreases between Year 3 and Year 9. This means that there’s less difference between the lowest and highest achieving students in NAPLAN scores by Year 9. Reducing variance over time can be a feature of horizontally-equated Rasch-scaled assessments – and it is a limitation of our research, noted in the paper.

There are other limitations of NAPLAN scores outlined in the Grattan Technical report. These were appropriately acknowledged in the analytic strategy of our paper and include, modelling the decelerating growth curves, accounting for problems with missing data, allowing for heterogeneity in starting point and rate of progress, modelling measurement error, and so on. The latent growth model analytic design that we used is very suited to examining research questions about development, and the type of data generated by NAPLAN assessments.

In my view, the nature of the Rasch scores generated by the NAPLAN testing process does not require a score transformation to model growth in population samples. Rasch scaled scores do not need to be transformed into ‘years of progress’ – and indeed doing so may only muddy the waters.

For example, I don’t think it makes sense to say that a child is at a Year 1 level in reading comprehension based on NAPLAN because the skills that comprise literacy are theoretically different at Year 1 compared with Year 3. We already make a pretty strong assumption with NAPLAN that the tests measure the same theoretical construct from Year 3 to Year 9. Extrapolating outside these boundaries is not something I would recommend.

Nonetheless, the key takeaway from the Grattan report, that “Low achieving students fall ever further back” (p.2) has had far reaching implications. Governments rely on this information when defining the scope of educational reviews (of which there are many), and making recommendations about such things as teacher training (which they do periodically). Indeed, the method proposed by the Grattan report was that used by a recent Productivity Commission report, which subsequently influenced several Federal government education reviews. Other researchers use the data transformation in their own research, when they could use the original scores and interpret standard deviations for group-based comparisons.

Recommendations that are so important at a policy level should really be underpinned by robustly defended data analysis choices. Unfortunately the limitations of an analytic strategy can often be lost because stakeholders want takeaway points not statistical debates. What this example shows is that data analysis decisions can (annoyingly) lead to opposing conclusions about important topics.

Where to from here

Regardless of which interpretation is closer to the reality, NAPLAN 2023 represents something of a new beginning for national assessments in Australia. The key change is that from 2023 the time series for NAPLAN will be reset. This means that schools and states technically should not be comparing this year’s results with previous years. 

The transformation to computer adaptive assessments is also now complete. Ideally this should ensure more precision in assessing the achievement of students at the both ends of the distribution – a limitation of the original paper-based tests. 

Whether the growth patterns observed in the old NAPLAN will remain in the new iteration is not clear: we’ll have to wait until 2029 to replicate our research, when the 2023 Year 3s are in Year 9.  

Sally Larsen is a Lecturer in Learning, Teaching and Inclusive Education at the University of New England. Her research is in the area of reading and maths development across the primary and early secondary school years in Australia, including investigating patterns of growth in NAPLAN assessment data. She is interested in educational measurement and quantitative methods in social and educational research. You can find her on Twitter @SallyLars_27

 

 

 



How this oppressive test is killing the magic of childhood

NAPLAN is taking the fun out of early childhood learning. Early childhood learning encompasses education for children from birth to eight years of age and it is widely known that play-based programs planned with intentionality are the best way for teachers to engage young children in learning. Unfortunately, a focus on NAPLAN scores has resulted in many schools paying more attention to literacy and numeracy programs for children in primary school to perform better in tests in Years 3 and 5. This is impacting on the learning engagement of children in the earlier years. 

Research over decades has shown that play is how young children learn. Through interacting with their environment and their peers, children are making sense of the world and their place in it. These ideals are reflected in the Early Years Learning Framework for Australia that sets out what children aged up to five should be engaging with: “Play-based learning with intentionality can expand children’s thinking and enhance their desire to know and to learn, promoting positive dispositions towards learning” (p.21). The Early Years Learning Framework document applies for all children in the early years of school across Australia, yet the focus on Literacy and Numeracy is narrowing the curriculum and taking away the opportunities for children and teachers to engage in play-based learning.

Although NAPLAN does not happen until Year 3, when children are about 8 years old, it has been identified that teachers in the lower grades are being asked to teach Literacy and Numeracy in more formal ways. The concentrated focus on these two subject areas has led to an increase in the use of whole school commercial programs, some of which are specifically scripted. This practice reduces the autonomy of teachers to make decisions about their teaching based not only on their training but their knowledge of the children in their class. This raises concerns for the teacher and their practice, as well as the engagement of the children through more formalised learning practices earlier in their school experience.

With the publishing of results on the MySchool website and other unintended consequences of the standardised tests, including principal’s performance in some states being measured by these results, NAPLAN has become high stakes. For school leaders, there is pressure to do well, and this is being transferred to teachers and sometimes children and their families which may negatively impact on wellbeing. Even in schools where children traditionally perform well and there are programs focusing on wellbeing, some children are still feeling stressed about doing the tests and what the results will mean for them. This pressure is leading to children doing more formalised learning in literacy and numeracy from an earlier age and ‘play’ is often relegated to Friday afternoons if all other tasks are completed.

Play, or more specifically play-based learning, is often misunderstood within education, despite the evidence of its value. Play is often situated at one end of a continuum with learning at the other when in fact, intentional teachers can implement programs across this continuum to engage children in learning across multiple and integrated subject areas. When children are enjoying their learning through the play, they are often unaware that they are learning science, mathematics, and engineering in the block corner; or geography, history, and science when they are exploring gardening, including investigating how it was done by their grandparents.

Teachers who do not understand play and play-based learning approaches may be uncomfortable with the reduced control that comes through children learning in this way. Research conducted in both the science and technology domains, however, have shown that children often are more engaged and learn more than expected when they are interested in the learning and it is happening in a way that is authentic to their experience. Not only are the children in these research projects learning specific content, but they are also learning Literacy and Numeracy when they plan explorations, calculate results, represent findings, and use technology to create, research, record and share information. The multi-modal options that play facilitates, ensures that all children can feel a sense of accomplishment and can learn from their peers as well as their teacher.

Children do need to be literate and numerate, but NAPLAN scores are not showing improvement despite the increased focus on these two specific learning areas over recent years. At the same time, children are becoming increasingly anxious and disengaged from school from an earlier age. Research in early childhood continues to identify that children engage with and learn through play-based approaches, and through the intentionality of the planning, teachers have autonomy over their programs to suit the needs of the children in their classroom. Perhaps it is time that the fun is brought back to classrooms, not only for children under five but for all children in schools, so that they can engage and enjoy their learning. Engaged children may be less likely to resort to negative behaviour to gain attention, and a reduction in the use of prescribed programs and a little more fun may also help teachers feel valued for their knowledge and expertise. The potential is there for broader approaches and happier children and teachers through increased fun, perhaps helping to bring some teachers back to the workforce – a win all around!

Pauline is a senior lecturer in the Early Childhood program at ECU. Her teaching and research are focused on a range of issues in early childhood education including assessment, curriculum, workforce and reflective practice.

Pausing NAPLAN did not destroy society – but new changes might not fix the future

NAPLAN is again in the news. Last week, it was the Ministers tinkering with NAPLAN reporting and timing. This week it is media league tables ranking schools and sectors, according to NAPLAN results, coinciding with the upload of latest school-level data to the ‘My School’ website. We are now about one month out from the new March test window so expect to hear a lot more in the coming weeks. Many schools will be already deeply into NAPLAN test preparation. 

NAPLAN and My School website were initially introduced by PM Julia Gillard as levers for parental choice. Last week’s ACARA media release reiterates that their primary purpose is so parents can make ‘informed choices’ about their children’s schooling. Media analysis of NAPLAN results correctly identifies what researchers know only too well: that affluence skews educational outcomes to further advantage the already advantaged. 

The Sydney Morning Herald notes that “Public schools with and without opportunity classes, high-fee private institutions and Catholic schools in affluent areas have dominated the top 100 schools…” The reporters are careful to draw attention to a couple of atypical public schools, achieving better results than might be expected from their demographics. A closer look at the SMH table of Top Performing primary schools shows that most low ICSEA public schools ‘punching above their weight’ are very small regional schools. 

No doubt there is a lot to learn from highly effective and usually overlooked small rural schools, but few families can move to them from the city. Parental choice is constrained by income, residential address, work opportunities and a myriad of other factors. In any case, as Stewart Riddle reminds us, what makes a ‘good school’ is far more subtle and complex than anything that a NAPLAN can tell us. 

NAPLAN has gradually morphed into a diagnostic tool for individual students, though there are other tools more fit for this purpose. Notably, the pandemic-induced NAPLAN pause did not lead to the collapse of Australian education but was seen by many teachers as a relief when they were dealing with so many more important aspects of young people’s learning and well-being. 

Education Ministers’ adjustments to NAPLAN indicate that they are at last responding to some of the more trenchant critiques of NAPLAN. The creation of a teacher panel by ACARA as part of the process of setting standards hints that the professional expertise and voices of teachers are valued. Bringing NAPLAN testing forward will hopefully make it more useful where it really matters – in schools and classrooms.

The move to four levels of reporting will be more accessible to parents. Pleasingly, the new descriptor for the lowest quartile – ‘Needs additional support’ – puts the onus on the school and school systems to respond to student needs.

Yet one of the keenest critiques of NAPLAN has not been addressed. There have been widespread calls from educators and academics for the NAPLAN writing test to be withdrawn. It has been found to have a narrowing effect on both the teaching of writing and students’ capacity to write. There is also a whole “how to do NAPLAN” industry of tutors and books pushing formulaic approaches to writing and playing on families’ anxieties.

The failure of the current round of changes to address the NAPLAN writing test leaves students writing like robots. Meanwhile, the release of ChatGPT means that students doing NAPLAN writing for no real purpose or audience of their own are wasting their time. Robots can do it better! These changes needed to map writing better to the National Curriculum, and endorse more meaningful, creative, multimodal and life-relevant writing practices.

 As a single point in time test, NAPLAN has always been just one source of data that teachers and schools can draw upon to design targeted interventions to support student learning. Nevertheless, earlier results will mean that schools will have robust evidence about their need for additional resources. Professional expertise in literacy, numeracy and inclusive education support must be prioritised. 

Parents might be able to resist the inclination to shuffle their children from school to school as a reaction to media headlines, school rankings, and promotional campaigns from the independent sector. Alliances might form between parents and schools to support greater action by state and federal Ministers to address the deeply entrenched divisions that have become baked into Australian schooling.

Attention to NAPLAN continues to mask serious ongoing questions about why Australian governments have created conditions where educational inequities, segregation and stratification are now defining characteristics of our education system. Numerous reports and inquiries have identified flaws and perverse effects from NAPLAN as high stakes testing, especially in relation to the writing test. There is a lot of work yet to be done if NAPLAN is to really be useful and relevant for Australian schools, teachers, parents and learners.

Professor Susanne Gannon is an expert in educational research across a range of domains and methodologies. Much of her research focuses on equity issues in educational policy and practices. Recent research projects include investigations of the impact of NAPLAN on the teaching of writing in secondary school English, young people’s experiences of school closures due to COVID-19 in 2020, and vocational education for students from refugee backgrounds in NSW schools.

Dr Lucinda McKnight is an Australian Research Council Fellow in Deakin University’s Research for Education Impact (REDI) centre. She is undertaking a three year national project examining how the conceptualisation of writing is changing in digital contexts. Follow her Teaching Digital Writing project blog or her twitter account @lucindamcknight8

Header image courtesy of Rory Boon.

The good, the bad and the pretty good actually

Every year headlines proclaim the imminent demise of the nation due to terrible, horrible, very bad NAPLAN results. But if we look at variability and results over time, it’s a bit of a different story.

I must admit, I’m thoroughly sick of NAPLAN reports. What I am most tired of, however, are moral panics about the disastrous state of Australian students’ school achievement that are often unsupported by the data.

A cursory glance at the headlines since NAPLAN 2022 results were released on Monday show several classics in the genre of “picking out something slightly negative to focus on so that the bigger picture is obscured”. 

A few examples (just for fun) include:

Reading standards for year 9 boys at record low, NAPLAN results show 

Written off: NAPLAN results expose where Queensland students are behind 

NAPLAN results show no overall decline in learning, but 2 per cent drop in participation levels an ‘issue of concern’ 

And my favourite (and a classic of the “yes, but” genre of tabloid reporting)

‘Mixed bag’ as Victorian students slip in numeracy, grammar and spelling in NAPLAN 

The latter contains the alarming news that “In Victoria, year 9 spelling slipped compared with last year from an average NAPLAN score of 579.7 to 576.7, but showed little change compared with 2008 (576.9). Year 5 grammar had a “substantial decrease” from average scores of 502.6 to 498.8.”

If you’re paying attention to the numbers, not just the hyperbole, you’ll notice that these ‘slips’ are in the order of 3 scale scores (Year 9 spelling) and 3.8 scale scores (Year 5 grammar). Perhaps the journalists are unaware that the NAPLAN scale ranges from 1-1000? It might be argued that a change in the mean of 3 scale scores is essentially what you get with normal fluctuations due to sampling variation – not, interestingly, a “substantial decrease”. 

The same might be said of the ‘record low’ reading scores for Year 9 boys. The alarm is caused by a 0.2 score difference between 2021 and 2022. When compared with the 2008 average for Year 9 boys the difference is 6 scale score points, but this difference is not noted in the 2022 NAPLAN Report as being ‘statistically significant’ – nor are many of the changes up or down in means or in percentages of students at or above the national minimum standard.

Even if differences are reported as statistically significant, it is important to note two things: 

1. Because we are ostensibly collecting data on the entire population, it’s arguable whether we should be using statistical significance at all.

2. As sample sizes increase, even very small differences can be “statistically significant” even if they are not practically meaningful.

Figure 1. NAPLAN Numeracy test mean scale scores for nine cohorts of students at Year 3, 5, 7 and 9.

The practical implications of reported differences in NAPLAN results from year to year (essentially the effect sizes) are not often canvassed in media reporting. This is an unfortunate omission and tends to enable narratives of largescale decline, particularly because the downward changes are trumpeted loudly while the positives are roundly ignored. 

The NAPLAN reports themselves do identify differences in terms of effect sizes – although the reasoning behind what magnitude delineates a ‘substantial difference’ in NAPLAN scale scores is not clearly explained. Nonetheless, moving the focus to a consideration of practical significance helps us ask: If an average score changes from year to year, or between groups, are the sizes of the differences something we should collectively be worried about? 

Interestingly, Australian students’ literacy and numeracy results have remained remarkably stable over the last 14 years. Figures 1 and 2 show the national mean scores for numeracy and reading for the nine cohorts of students who have completed the four NAPLAN years, starting in 2008 (notwithstanding the gap in 2020). There have been no precipitous declines, no stunning advances. Average scores tend to move around a little bit from year to year, but again, this may be due to sampling variability – we are, after all, comparing different groups of students. 

This is an important point for school leaders to remember too: even if schools track and interpret mean NAPLAN results each year, we would expect those mean scores to go up and down a little bit over each test occasion. The trick is to identify when an increase or decrease is more than what should be expected, given that we’re almost always comparing different groups of students (relatedly see Kraft, 2019 for an excellent discussion of interpreting effect sizes in education). 

Figure 2. NAPLAN Reading test mean scale scores for nine cohorts of students at Year 3, 5, 7 and 9.

Plotting the data in this way it seems evident to me that, since 2008, teachers have been doing their work of teaching, and students by-and-large have been progressing in their skills as they grow up, go to school and sit their tests in years 3, 5, 7 and 9. It’s actually a pretty good news story – notably not an ongoing and major disaster. 

Another way of looking at the data, and one that I think is much more interesting – and instructive – is to consider the variability in achievement between observed groups. This can help us see that just because one group has a lower average score than another group, this does not mean that all the students in the lower average group are doomed to failure.

Figure 3 shows just one example: the NAPLAN reading test scores of a random sample of 5000 Year 9 students who sat the test in NSW in 2018 (this subsample was randomly selected from data for the full cohort of students in that year, N=88,958). The red dots represent the mean score for boys (left) and girls (right). You can see that girls did better than boys on average. However, the distribution of scores is wide and almost completely overlaps (the grey dots for boys and the blue dots for girls). There are more boys at the very bottom of the distribution and a few more girls right at the top of the distribution, but these data don’t suggest to me that we should go into full panic mode that there’s a ‘huge literacy gap’ for Year 9 boys. We don’t currently have access to the raw data for 2022, but it’s unlikely that the distributions would look much different for the 2022 results.  

Figure 3. Individual scale scores and means for Reading for Year 9 boys and girls (NSW, 2018 data).

So what’s my point? Well, since NAPLAN testing is here to stay, I think we can do a lot better on at least two things: 1) reporting the data honestly (even when its not bad news), and 2) critiquing misleading or inaccurate reporting by pointing out errors of interpretation or overreach. These two aims require a level of analysis that goes beyond mean score comparisons to look more carefully at longitudinal trends (a key strength of the national assessment program) and variability across the distributions of achievement.

If you look at the data over time NAPLAN isn’t a story of a long, slow decline. In fact, it’s a story of stability and improvement. For example, I’m not sure that anyone has reported that the percentage of Indigenous students at or above the minimum standard for reading in Year 3 has stayed pretty stable since 2019 – at around 83% up from 68% in 2008. In Year 5 it’s the highest it’s ever been at 78.5% of Indigenous students at or above the minimum standard – up from 63% in 2008. 

Overall the 2022 NAPLAN report shows some slight declines, but also some improvements, and a lot that has remained pretty stable. 

As any teacher or school leader will tell you, improving students’ basic skills achievement is difficult, intensive and long-term work. Like any task worth undertaking, there will be victories and setbacks along the way. Any successes should not be overshadowed by the disaster narratives continually fostered by the 24/7 news cycle. At the same time, overinterpreting small average fluctuations doesn’t help either. Fostering a more nuanced and longer-term view when interpreting NAPLAN data, and recalling that it gives us a fairly one-dimensional view of student achievement and academic development would be a good place to start.

Sally Larsen is a Lecturer in Learning, Teaching and Inclusive Education at the University of New England. Her research is in the area of reading and maths development across the primary and early secondary school years in Australia, including investigating patterns of growth in NAPLAN assessment data. She is interested in educational measurement and quantitative methods in social and educational research. You can find her on Twitter @SallyLars_27

AERO’s writing report is causing panic. It’s wrong. Here’s why.

If ever there was a time to question public investment in developing reports using  ‘data’ generated by the National Assessment Program, it is now with the release of the Australian Educational Research Organisation’s report ‘Writing development: What does a decade of NAPLAN data reveal?’ 

I am sure the report was meant to provide reliable diagnostic analysis for improving the function of schools. 

It doesn’t. Here’s why.

There are deeply concerning technical questions about both the testing regime which generated the data used in the current report, and the functioning of the newly created (and arguably redundant) office which produced this report.

There are two lines of technical concern which need to be noted. These concerns reveal reasons why this report should be disregarded – and why media response is a beatup.

The first technical concern for all reports of NAPLAN data (and any large scale survey or testing data) is how to represent the inherent fuzziness of estimates generated by this testing apparatus.  

Politicians and almost anyone outside of the very narrow fields reliant on educational measurement would like to talk about these numbers as if they are definitive and certain.

They are not. They are just estimates – but all of the summary statistics reports are just estimates.  

The fact these are estimates is not apparent in the current report.  There is NO presentation of any of the estimates of error in the data used in this report. 

Sampling error is important, and, as ACARA itself has noted, (see, eg, the 2018 NAPLAN technical report) must be taken into account when comparing the different samples used for analyses of NAPLAN.  This form of error is the estimate used to generate confidence intervals and calculations of ‘statistical difference’.  

Readers who recall seeing survey results or polling estimates being represented with a ‘plus or minus’ range will recognise sampling error. 

Sampling error is a measure of the probability of getting a similar result if the same analyses were done again, with a new sample of the same size, with the same instruments, etc.  (I probably should point out that the very common way of expressing statistical confidence often gets this wrong – when we say we have X level of statistical confidence, that isn’t a percentage of how confident you can be with that number, but rather the likelihood of getting a similar result if you did it again.)  

In this case, we know about 10% of the population do not sit the NAPLAN writing exam, so we already know there is sampling error.  

This is also the case when trying to infer something about an entire school from the results of a couple of year levels.  The problem here is that we know the sampling error introduced by test absences is not random and accounting for it can very much change trend analyses, especially for sub-populations So, what does this persuasive writing report say about sampling error? 

Nothing. Nada. Zilch. Zero. 

Anyone who knows basic statistics knows that when you have very large samples, the amount of error is far less than with smaller samples.  In fact, with samples as large as we get in NAPLAN reports, it would take only a very small difference to create enough ripples in the data to show up as being statistically significant.  That doesn’t mean, however, the error introduced is zero – and THAT error must be reported when representing mean differences between different groups (or different measures of the same group).

Given the size of the sampling here, you might think it ok to let that slide.  However, that isn’t the only short cut taken in the report.  The second most obvious measure ignored in this report is measurement error.  Measurement error exists any time we create some instrument to estimate a ‘latent’ variable – ie something you can’t see directly.  We can’t SEE achievement directly – it is an inference based on measuring several things we can theoretically argue are valid indicators of that thing we want to measure.  

Measurement error is by no means a simple issue but directly impacts the validity of any one individual student’s NAPLAN score and an aggregate based on those individual results.  In ‘classical test theory’ a measured score is made of up what is called a ‘true score’ and error (+/-).  In more modern measurement theories error can become much more complicated to estimate, but the general conception remains the same.  Any parent who has looked at NAPLAN results for their child and queried whether or not the test is accurate is implicitly questioning measurement error.

Educational testing advocates have developed many very mathematically complicated ways of dealing with measurement error – and have developed new testing techniques for improving their tests.  The current push for adaptive testing is precisely one of those developments, in the local case being rationalised as adaptive testing (where which specific test item is asked of the person being tested changes depending on prior answers) does a better job of differentiation those at the top and bottom end of the scoring range (see the 2019 NAPLAN technical report for this analysis). 

 This bottom/top of the range problem is referred to as a floor or ceiling effect.  When large proportion of students either don’t score anything or get everything correct, there is no way to differentiate those students from each other – adaptive testing is a way of dealing with floor and ceiling effects better than a predetermined set of test items.  This adaptive testing has been included in the newer deliveries of the online form of the NAPLAN test.

Two important things to note. 

One, the current report claims the performance of high ‘performing’ students’ scores has shifted down – despite new adaptive testing regimes obtaining very different patterns of ceiling effect. Second, the test is not identical for all students (they never have been).  

The process used for selecting test items  is based on ‘credit models’ generated by testers. Test items are determined to have particular levels of ‘difficulty’ based on the probability of correct answers being given from different populations and samples, after assuming population level equivalence in prior ‘ability’ AND creating difficulties score for items while assuming individual student ‘ability’ measures are stable from one time period to the next.  That’s how they can create these 800 point scales that are designed for comparing different year levels.

So what does this report say about any measurement error that may impact the comparisons they are making?  Nothing.

One of the ways ACARA and politicians have settled their worries about such technical concerns as accurately interpreting statistical reports is by introducing the reporting of test results in ‘Bands’.  Now these bands are crucial for qualitatively describing rough ranges of what the number might means in curriculum terms – but they come with a big consequence.  Using ‘Band’ scores is known as ‘coarsening’ data – when you take a more detailed scale and summarise it in a smaller set of ordered categories – and that process is known to increase any estimates of error.  This later problem has received much attention in the statistical literature, with new procedures being recommended for how to adjust estimates to account for that error when conducting group comparisons using that data.  

As before, the amount of reporting of that error issue? Nada.

 This measurement problem is not something you can ignore – and yet the current report is worse than careless on this question.

It takes advantage of readers not knowing about it. 

When the report attempts to diagnose which component of the persuasive writing tasks were of most concern, it does not bother reporting that the error for each of the separate measures of those ten dimensions of writing has far more error than the total writing score, simply because the number of marks for each is a fraction of the total.  The smaller the number of indicators, the more error (and less reliability).

Now all of these technical concerns simply raises the question of whether or not the overall findings of the report will hold up to robust tests and rigorous analysis – there is no way to assess that from this report, but there is even bigger reason to question why it was given as much attention as it was.  That is, for any statistician, there is always a challenge to translate the numeric conclusions into some for of ‘real life’ scenario.

To explain why AERO has significantly dropped the ball on this last point, consider its headline claim that year 9 students have had declining persuasive writing scores and somehow representing that as a major new concern.  

First note that the ONLY reporting of this using the actual scale values is a vaguely labelled line graph showing scores from 2011 until 2018 – skipping 2016 since the writing task that year wasn’t for persuasive writing (p 26 of the report has this graph).  Of those year to year shifts, the only two that may be statistically significant, and are readily visible, are from 2011 to 2012, and then again from 2017 to 2018.  Why speak so vaguely? From the report, we can’t tell you the numeric value of that drop, because there is no reporting of the actual number represented in that line graph.  

Here is where the final reality check comes in.  

If this data matches the data reported in the national reports from 2011 and 2018, the named mean values on the writing scale were 565.9 and 542.9 respectively.  So that is a drop between those two time points of 23 points.  That may sound like a concern, but recall those scores are based on 48 marks given for writing.  In other words, that 23 point difference is no more than one mark difference (it could be far less since each different mark carries a different weighting in formulation that 800 scale).  

Consequently, even if all the technical concerns get sufficient address and the pattern still holds, the realistic title of Year 9 claim would be ‘Year 9 students in 2018 NAPLAN writing test scored one less mark than the Year 9 students of 2011.’

Now assuming that 23 point difference has anything to do with the students at all, start thinking about all the plausible reasons why students in that last year of NAPLAN may not have been as attentive to details as they were when NAPLAN was first getting started.   I can think of several, not least being the way my own kids did everything possible to ignore the Year 9 test – since the Year 9 test had zero consequences for them.  

Personally, these reports are troubling for many reasons, inclusive of the use of statistics to assert certainty without good justification, but also because saying student writing has declined belies that obvious fact that is hasn’t been all that great for decades.  This is where I am totally sympathetic to the issues raised by the report – we do need better writing among the general population.  But using national data to produce a report of this calibre, by an agency beholden to government, really does little more than provide click-bait and knee jerk diagnosis from all sides of a debates we don’t really need to have.

James Ladwig is Associate Professor in the School of Education at the University of Newcastle.  He is internationally recognised for his expertise in educational research and school reform.  Find James’ latest work in Limits to Evidence-Based Learning of Educational Science, in Hall, Quinn and Gollnick (Eds) The Wiley Handbook of Teaching and Learning published by Wiley-Blackwell, New York. James is on Twitter @jgladwig

AERO’s response to this post

ADDITIONAL COMMENTS FROM AERO provided on November 9: information about the statistical issues discussed, a more detailed Technical Note is available at AERO.

On Monday, EduResearch Matters published the above post by Associate Professor James Ladwig which critiqued the Australian Education Research Office’s Writing development: what does a decade of NAPLAN data reveal? 

AERO’s response is below, with additional comments from Associate Professor Ladwig. 

AERO: This article makes three key criticisms about the analysis presented in the AERO report, which are inaccurate.

Ladwig claims that the report lacks consideration of sampling error and measurement error in its analysis of the trends of the writing scores. In fact, those errors were accounted for in the complex statistical method applied. AERO’s analysis used both simple and complex statistical methods to examine the trends. While the simple method did not consider error, the more complex statistical method (referred to as the ‘Differential Item Analysis’) explicitly considered a range of errors (including measurement error, and cohort and prompt effects).

Associate Professor Ladwig: AERO did not include any of that in its report nor in any of the technical papers. There is no overtime DIF analysis of the full score – and I wouldn’t expect one.  All of the DIF analyses rely on data that itself carries error (more below). There is no way for the educated reader to verify these claims without expanded and detailed reporting of the technical work underpinning this report. This is lacking in transparency, falls shorts of the standards we should expect from AERO and makes it impossible for AERO to be held accountable for its specific interpretation of their own results.

AERO: Criticism of the perceived lack of consideration of ‘ceiling effects’ in AERO’s analysis of the trends of high-performing students’ results, omits the fact that AERO’s analysis focused on the criteria scores (not the scaled measurement scores). AERO used the proportion of students achieving the top 2 scores (not the top score), for each criterion, as the matrix to examine the trends. Given only a small proportion of students achieved a top score for any criterion (as shown in the report statistics), there is no ‘ceiling effect’ that could have biased the interpretation of the trends.

Associate Professor Ladwig made his ‘ceiling effect’ comments while explaining how the NAPLAN writing scores are designed not in relation to the AERO analysis.

AERO: The third major inaccuracy relates to the comments made about the ‘measurement error’ around the NAPLAN bands and the use of adaptive testing to reduce error. These are irrelevant to AERO’s analysis because the main analysis did not use scaled scores, it did not use bands, and adaptive testing is not applicable to the writing assessment.

Associate Professor Ladwig’s comment was about the scaling in relation to explaining the score development, not about the AERO analysis.

In relation to the AERO use of NAPLAN criterion score data in the writing analysis, however, please note that those scores are created either through scorer moderation processes or (increasingly where possible) text interpretative algorithms.  Here again the address of the reliability of these raw scores was absent, but with one declared limitation noted, in AERO’s own terms:

Another key assumption underlying most of the interpretation of results in this report is that marker effects (that is, marking inconsistency across years) are small and therefore they do not impact on the comparability of raw scores over time. (p[.66)

This is where AERO has taken another short cut, with an assumption that should not be made.  ACARA has reported the reliability estimates to include that in the scores analysis.  It is readily possible to report those and use them for trend analyses.

AERO: A final point: the mixed-methods design of the research was not recognised in the article. AERO’s analysis examined the skills students were able to achieve at the criterion level against curriculum documents. Given the assessment is underpinned by a theory of language, we were able to complement quantitative with a qualitative analysis that specifically highlighted the features of language students were able to achieve. This was validated by analysis of student writing scripts.

Associate Professor Ladwig says this is irrelevant to his analysis. The logic of this is also a concern. Using multiple methods and methodologies does not correct for any that are technically lacking.  In relation to the overall point of concern, we have a clear example of an agency reporting statistical results in a manner that elides external scrutiny accompanied by an extreme media positioning. Any of the qualitative insights to the minutia these numbers represent will probably very useful for teachers of writing – but whether or not they are generalisable, big, or shifting depends on those statistical analysis themselves. 

Is the NAPLAN results delay about politics or precision?

The decision announced yesterday by ACARA to delay the release of preliminary NAPLAN data is perplexing. The justification is that the combination of concerns around the impact of COVID-19 on children, and the significant flooding that occurred across parts of Australia in early 2022 contributed to many parents deciding to opt their children out of participating in NAPLAN. The official account explains:

“The NAPLAN 2022 results detailing the long-term national and jurisdictional trends will be released towards the end of the year as usual, but there will be no preliminary results release in August this year as closer analysis is required due to lower than usual student participation rates as a result of the pandemic, flu and floods.”

The media release goes on to say that this decision will not affect the release of results to schools and to parents, which have historically occurred at similar times of the year. The question that this poses, of course, is why the preliminary reporting of results is affected, but student and school reports will not be. The answer is likely to do with the nature of the non-participation. 

The most perplexing part of this decision is that NAPLAN has regularly had participation rates below 90% at various times among various cohorts. That has never prevented preliminary results being released before.

What are the preliminary results?

Since 2008, NAPLAN has been a controversial feature of the Australian school calendar for students in Years 3, 5, 7 and 9. The ‘pencil-and-paper’ version of NAPLAN was criticised for how statistical error impacts its precision at the student and school level (Wu, 2016), the impact that NAPLAN has had on teaching and learning (Hardy, 2014), and the time it takes for the results to come back (Thompson, 2013). Since 2018, NAPLAN has gradually shifted to an online, adaptive design which ACARA claims “are better targeted to students’ achievement levels and response styles meaning that the tests “provide more efficient and precise estimates of students’ achievements than do fixed form paper based tests. 2022 was the first year that the tests were fully online. 

NAPLAN essentially comprises four levels of reporting. These are student reports, school level reports, preliminary national reports and national reports. The preliminary reports are usually released around the same time as the student and school results. They report on broad national and sub-national trends, including average results for each year level in each domain across each state and territory and nationally. Closer to the end of the year, a National Report is released which contains deeper analysis on how characteristics such as gender, Indigenous status, language background other than English status, parental occupation, parental education, and geolocation impact achievement at each year level in each test domain.

Participation rates

The justification given in the media release concerns participation rates. To understand this better, we need to understand how participation impacts the reliability of test data and the validity of inferences that can be made as a result (Thompson, Adie & Klenowski, 2018). NAPLAN is a census test. This means that in a perfect world, all students in Years 3, 5, 7 & 9 would sit their respective tests. Of course, 100% participation is highly unlikely, so ACARA sets a benchmark of 90% for participation. Their argument is that if 90% of any given cohort sits a test we can be confident that the results of those sitting the tests are representative of the patterns of achievement of the entire population, even sub-groups within that population. ACARA calculates the participation rate as “all students assessed, non-attempt and exempt students as a percentage of the total number of students in the year level”. Non-attempt students are those who were present but either refused to sit the test or did not provide sufficient information to estimate an achievement score. Exempt students are those exempt from  one or more of the tests on the grounds of English language proficiency or disability.

The challenge, of course, is that non-participation introduces error into the calculation of student achievement. Error is a feature of standardised testing, it doesn’t mean mistakes in the test itself, it rather is an estimation of the various ways that uncertainty emerges in predicting how proficient a student is in an entire domain based on a relatively small sample of questions that make up a test. The greater the error, the less precise (ie less reliable) the tests are. With regards to participation, the greater the non-participation, the more uncertainty is introduced into that prediction. 

The confusing thing in this decision is that NAPLAN has regularly had participation rates below 90% at various times among various cohorts. This participation data can be accessed here.  For example, in 2021 the average participation rates for Year 9 students were slightly below the 90% threshold in every domain yet this did not impact the release of the Preliminary Report. 

Table 1: Year 9 Participation in NAPLAN 2021 (generated from ACARA data)

These 2021 results are not an anomaly, they are a trend that has emerged over time. For example, in pre-pandemic 2018 the jurisdictions of Queensland, South Australia, ACT and Northern Territory did not reach the 90% threshold in any of the Year 9 domains. 

Table 2: Year 9 Participation in NAPLAN 2018 (generated from ACARA data)

Given these results above, the question remains why has participation affected the reporting of the 2022 results, but Year 9 results in 2018, or 2021, were not similarly affected?

At the outset, I am going to say that there is a degree of speculation in answering this question. Primarily, this is because even if participation declines to 85%, this is still a very large sample with which to predict the achievement of the population in a given domain, so it must be that something has not worked when they have tried to model the data. I am going to suggest three possible reasons:

  1. The first is likely, given that it is hinted at in the ACARA press release. If we return to the relationship between participation, error and the validity of inferences, the most likely way that an 85% participation rate could be a problem is if non-participation is not randomly spread across the population. If non-participation was shown to be systematic, that is it is heavily biassed to particular subgroups, then depending upon the size of that bias, the ability to make valid inferences about achievement in different jurisdictions or amongst different sub-groups could be severely impacted. One effect of this is that it might become difficult to reliably equate 2022 results with previous years. This could explain why lower than 90% Year 9 participation in 2021 was not a problem – the non-participation was relatively randomly spread across the sub-groups.
  2. Second, and related to the above, is that the non-participation has something to do with the material and infrastructural requirements for an online test that is administered to all students across Australia. There have long been concerns about the infrastructure requirements of NAPLAN online such as access to computers, reliable internet connections and so on particularly in regional and remote areas of Australia. If these were to influence results, such as through an increased number of students unable to attempt the test, this could also influence the reliability of inferences amongst particular sub-groups. 
  3. The final possibility is political. It has been obvious for some time that various Education Ministers have become frustrated with aspects of the NAPLAN program. The most prominent example of this was the concern expressed by the Victorian Education Minister in 2018 about the reliability of the equation of the online and paper tests. (Education chiefs have botched Naplan online test, says Victoria minister | Australian education | The Guardian) During 2018, ACARA were criticised for showing a lack of responsible leadership in releasing results that seemed to show a mode effect, that is, a difference between students that sat the online vs the pen and paper test not related to their capacity in literacy and numeracy. It may be that ACARA has grown cautious as a result of the 2018 ministerial backlash and feel that any potential problems with the data need to be thoroughly investigated before jurisdictions are named and shamed based on their average scores. 

Ultimately, this leads us to perhaps one of the more frustrating things, we may never know. Where problems emerge around NAPLAN, the tendency is for ACARA and/or the Federal Education Minister to whom ACARA reports, to try to limit criticism by denying access to the data. In 2018, at the height of the controversy of the differences between the online and pencil and paper modes, I formed a team with two internationally eminent psychometricians to research whether there was a mode effect between the online and pencil and paper versions of NAPLAN. The request to ACARA to access the dataset was denied with the words that ACARA could not release item level data for the 2018 online items, presumably because they were provided by commercial entities. In the end, we just have to trust ACARA that there was not one. If we have learnt anything from recent political scandals, perfect opaqueness remains a problematic governance strategy.

Greg Thompson is a professor in the Faculty of Creative Industries, Education & Social Justice at the Queensland University of Technology. His research focuses on the philosophy of education and educational theory. He is also interested in education policy, and the philosophy/sociology of education assessment and measurement with a focus on large-scale testing and learning analytics/big data.

Why appeasing Latham won’t make our students any more remarkable

Are our schools making the kids we think we should? The tussle between politics and education continues and Latham is just the blunt end of what is now the assumed modus operandi of school policy in Australia. 

Many readers of this blog no doubt will have noticed a fair amount of public educational discussion about NSW’s School Success Model (SSM) which, according to the Department flyer, is ostensibly new. For background NSW context, it is important to note that this policy was released in the context of a new Minister for Education who has openly challenged educators to ‘be more accountable’, alongside of an entire set of parliamentary educational inquiries set up to appease Mark Latham, who chairs a portfolio committee with a very clear agenda motivated by the populism of his political constituency.  

This matters because there are two specific logics used in the political arena that have been shifted into the criticisms of schools: the public dissatisfaction leading to accountability question (so there’s a ‘public good’ ideal somewhere behind this), and the general rejection of authorities and elitism (alternatively easily labelled anti-intellectualism.)  Both of these political concerns are connected to the School Success Model.  The public dissatisfaction is motivating the desire for measures of accountability that the public believes can be free of tampering, and ‘matter’.  Test scores dictating students’ futures, so they matter, etc. The rejection of elitism is also embedded in the accountability issue. That is due to a (not always unwarranted) lack of trust.  That lack of trust often gets openly directed to specific people

Given the context, while the new School Success Model (SSM) is certainly well intended, it also represents one of the more direct links between politics and education we typically see.  The ministerialisation of schooling is clearly alive and well in Australia.  This isn’t the first time we have seen such direct links – the politics of NAPLAN was, afterall, straight from the political intents of its creators.  It is important to note that the logic at play has been used by both major parties in government.  Implied in that observation is that the systems we have live well beyond election cycles.

Now in this case, the basic political issues how to ‘make’ schools rightfully accountable, and at the same time push for improvement. I suspect this are at least popular sentiments, if not overwhelmingly accepted as a given by the vast majority of the public.  So alongside from general commitments to ‘delivering support where it is needed, and ‘learning from the past’, the model is most notable for it main driver – a matrix of measures ‘outcome’ targets.  In the public document that includes targets are the systems level and school level – aligned.  NAPLAN, Aboriginal Education, HCS, Attendance, Students growth (equity), and Pathways are the main areas specified for naming targets.

But, like many of the other systems created with the same good intent before it, this one really does invite the growing criticism already noted in public commentary. Since, with luck, public debate will continue, here I would like to put some broader historical context to these debates, take a look under the hood of these measures to show why they really aren’t fit for purpose for school accountability purposes without far more sophisticated understanding of what they can and can not tell you.

In the process of walking through some of this groundwork, I hope to show why the main problem here is not something a reform here or there will change.  The systems are producing pretty much what they are designed to do.  

On the origins of this form of governance

Anyone who has studied the history of schooling and education (shockingly few in the field these days) would immediately see the target-setting agenda as a ramped up version of scientific-management (see Callaghan, 1962), blended with a bit of Michael Barber’s methodology for running government (Barber, 2015), using contemporary measurements.

More recently, at least since the then labelled ‘economic rationalist’ radical changes brought to the Australia public services and government structures in the late 1980s and early 1990s, the notion of measuring outcomes of schools as a performance issue has matured, in tandem with the past few decades of an increasing dominance of the testing industry (which also grew throughout the 20th century). The central architecture of this governance model would be called neo-liberal these days, but it is basically a centralised ranking system based on pre-defined measures determined by a select few, and those measures are designed to be palatable to the public.  Using such systems to instil a bit of group competition between schools fits very well with those who believe market logic works for schooling, or those who like sport.

The other way of motivating personnel in such systems is, of course, mandate, such as the now mandated Phonic Screening Check announce in the flyer.

The devil in details

Now when it comes to school measures, there are many types we actually know a fair amount about most if not all of them – as most are generated from research somewhere along the way. There are some problems of interpretation that all school measures face which relate the basic problem that most measures are actually measures of individuals (and not the school), or vice-versa.  Relatedly, we also often see school level measures which are simply the aggregate of the individuals.  In all of these cases, there are many good intentions that don’t match reality.

For example, it isn’t hard to make a case for saying schools should measure student attendance.  The logic here is that students have to be at school to learn school things (aka achievement tests of some sort). You can simply aggregate individual students attendance to the school level and report it publicly (as on MySchool), because students need to be in school. But it would be a very big mistake to assume that the school level aggregated mean attendance of the student data is at all related to school level achievement.  It is often the case that what is true for individual, is also not true for the collective in which the individual belongs.  Another case in point here is policy argument that we need expanded educational attainment (which is ‘how long you stay in schooling’) because if more people get more education, that will bolster the general economy.  Nationally that is a highly debatable proposition (among OECD countries there isn’t even a significant correlation between average educational attainment and GDP).  Individually it does make sense – educational attainment and personal income, or individual status attainment is generally quite positively related.  School level attendance measures that are simple aggregates are not related to school achievement (Ladwig and Luke, 2011).  This may be why the current articulation attendance target is a percentage of students attending more than 90% of the time (surely a better articulation than a simple average – but still an aggregate of untested effect).  The point is more direct – often these targets are motivated by an goal that has been based on some causal idea – but the actually measures often don’t reflect that idea directly.

Another general problem, especially for the achievement data, is the degree to which all of the national (and state) measures are in fact estimates, designed to serve specific purposed.   The degree to which this is true varies from test to test.   Almost all design options in assessment systems carry trade offs.  There is a big difference between an HSC score – where the HSC exams and syllabuses are very closely aligned and the student performance is designed to reflect that; as opposed to NAPLAN, which is designed to not be directly related to syllabuses but overtly as a measure designed to estimate achievement on an underlying scale that is derived from the populations.  For HSC scores, it makes some sense to set targets but notice those targets come in the forms of percentage of students in a given ‘Band.’

Now these bands are tidy and no doubt intended to make interpretation of results easier for parents (that’s the official rational). However, both HSC Bands and NAPLAN bands represent ‘coarsened’ data.  Which means that they are calculated on the basis of some more finely measured scale (HSC raw scores, NAPLAN scale scores).  There are two known problems with coarsened data: 1) in general they increase measurement error (almost by definition), and 2) they are not static overtime.  Of these two systems, the HSC would be much more stable overtime, but even there much development occurs overtime, and the actual qualitative descriptors of the bands changes as syllabuses are modified.  So these band scores, and the number of students in each, is something that really needs to understood to be very less precise than counting kids in those categories implies. For more explanation and an example of one school which decides to change its spelling programs on the basis of needing one student to get one more item test correct, in order for them to meet their goal of having a given percentage of students in a given band, (see Ladwig, 2018).

There is a lot of detail behind this general description, but the point is made very clearly in the technical reports, such as when ACARA shifted how it calibrated its 2013 results relative to previous test years – where you find the technical report explaining that ACARA would need to stop assuming previous scaling samples were ‘secure’.  New scaling samples are drawn each year since 2013. When explaining why they needed to estimate sampling error in a test that was given to all students in a given year, ACARA was forthright and made it very clear: 

‘However, the aim of NAPLAN is to make inference about the educational systems each year and not about the specific student cohorts in 2013’ (p24).

Here you can see overtly that the test was NOT designed for the purposes to which the NSW Minister wishes to pursue.  

The slippage between any credential (or measure) and what it is supposed to represent has a couple of names.  When it comes to testing and achievement measurements, it’s called error.  There’s a margin within which we can be confident, so analysis of any of that data requires a lot of judgement, best made by people who know what and who is being measured.  But that judgement can not be exercised well without a lot of background knowledge that is not typically in the extensive catalogue of background knowledge needed by school leaders.

At a system level, the slippage between what’s counted and what it actually means is called decoupling.  And any of the new school level targets are ripe for such slippage.  Numbers of Aboriginal students obtaining an HSC is clear enough – but does it reflect the increasing numbers of alternative pathways used by an increasingly wide array of institutions? Counting how many kids continue to Year 12 make sense, but it also is motivation for schools to count kids simply for that purpose. 

In short, while the public critics have spotted potential perverse unintended consequence, I would hazard a prediction that they’ve just covered the surface.  Australia already has ample evidence of NAPLAN results being used as the based of KPI development with significant problematic side effects – there is no reason to think this would be immune from misuse, and in fact invites more (see Mockler and Stacey, 2021).

The challenge we need to take is not how to make schools ‘perform’ better or teachers ‘teach better’ – any of those a well intended, but this is a good time to point out common sense really isn’t sensible once you understand how the systems work.  To me it is the wrong question to ask how we make this or that part of the system do something more or better.

In this case, it’s a question of how can we build systems in which school and teachers are rightfully and fairly accountable and in which schools, educators, students are all growing.  And THAT question can not reached until Australia opens up bigger questions about curriculum that have been locked into what has been a remarkable resilience structure ever since the early 1990s attempts to create a national curriculum.

Figure 1 Taken from the NAPLAN 2013 Technical Report, p.19

This extract shows the path from a raw score on a NAPLAN test and what eventually becomes a ‘scale score’ – per domain.  It is important to note that the scale score isn’t a count – it is based on a set of interlocking estimations that align (calibrate) the test items. That ‘logit’ score is based on the overall probability of test items being correctly answered. 

James Ladwig is Associate Professor in the School of Education at the University of Newcastle and co-editor of the American Educational Research Journal.  He is internationally recognised for his expertise in educational research and school reform.  Find James’ latest work in Limits to Evidence-Based Learning of Educational Science, in Hall, Quinn and Gollnick (Eds) The Wiley Handbook of Teaching and Learning published by Wiley-Blackwell, New York (in press). James is on Twitter @jgladwig

References

Barber, M. (2015). How to Run A Government: So that Citizens Benefit and Taxpayers Don’t Go Crazy: Penguin Books Limited.

Callahan, R. E. (1962). Education and the Cult of Efficiency: University of Chicago Press.

Ladwig, J., & Luke, A. (2013). Does improving school level attendance lead to improved school level achievement? An empirical study of indigenous educational policy in Australia. The Australian Educational Researcher, 1-24. doi:10.1007/s13384-013-0131-y

  Ladwig, J. G. (2018). On the Limits to Evidence‐Based Learning of Educational Science. In G. Hall, L. F. Quinn, & D. M. Gollnick (Eds.), The Wiley Handbook of Teaching and Learning (pp. 639-658). New York: WIley and Sons.

Mockler, N., & Stacey, M. (2021). Evidence of teaching practice in an age of accountability: when what can be counted isn’t all that counts. Oxford Review of Education, 47(2), 170-188. doi:10.1080/03054985.2020.1822794

Main image:

DescriptionEnglish: Australian politician Mark Latham at the 2018 Church and State Summit
Date15 January 2018
Source“Mark Latham – Church And State Summit 2018”, YouTube (screenshot)
AuthorPellowe Talk YouTube channel (Dave Pellowe)