Utilizing large language models for EFL essay grading: An examination of reliability and validity in rubric-based assessments

Yavuz, Fatih; Çelik, Özgür; Çelik, Gamze Yavaş

Utilizing large language models for EFL essay grading: An examination of reliability and validity in rubric-based assessments

dc.authorid	0000-0003-2645-2710	en_US
dc.authorid	0000-0002-0300-9073	en_US
dc.authorid	0000-0003-1571-9686	en_US
dc.contributor.author	Yavuz, Fatih
dc.contributor.author	Çelik, Özgür
dc.contributor.author	Çelik, Gamze Yavaş
dc.date.accessioned	2025-01-14T11:41:51Z
dc.date.available	2025-01-14T11:41:51Z
dc.date.issued	2024	en_US
dc.department	Yüksekokullar, Yabancı Diller Yüksekokulu, Yabancı Diller Bölümü	en_US
dc.description	Çelik, Özgür (Balikesir Author)	en_US
dc.description.abstract	This study investigates the validity and reliability of generative large language models (LLMs), specifically ChatGPT and Google's Bard, in grading student essays in higher education based on an analytical grading rubric. A total of 15 experienced English as a foreign language (EFL) instructors and two LLMs were asked to evaluate three student essays of varying quality. The grading scale comprised five domains: grammar, content, organization, style & expression and mechanics. The results revealed that fine-tuned ChatGPT model demonstrated a very high level of reliability with an intraclass correlation (ICC) score of 0.972, Default ChatGPT model exhibited an ICC score of 0.947 and Bard showed a substantial level of reliability with an ICC score of 0.919. Additionally, a significant overlap was observed in certain domains when comparing the grades assigned by LLMs and human raters. In conclusion, the findings suggest that while LLMs demonstrated a notable consistency and potential for grading competency, further fine-tuning and adjustment are needed for a more nuanced understanding of non-objective essay criteria. The study not only offers insights into the potential use of LLMs in grading student essays but also highlights the need for continued development and research. Practitioner notes What is already known about this topic Large language models (LLMs), such as OpenAI's ChatGPT and Google's Bard, are known for their ability to generate text that mimics human-like conversation and writing. LLMs can perform various tasks, including essay grading. Intraclass correlation (ICC) is a statistical measure used to assess the reliability of ratings given by different raters (in this case, EFL instructors and LLMs). What this paper adds The study makes a unique contribution by directly comparing the grading performance of expert EFL instructors with two LLMs—ChatGPT and Bard—using an analytical grading scale. It provides robust empirical evidence showing high reliability of LLMs in grading essays, supported by high ICC scores. It specifically highlights that the overall efficacy of LLMs extends to certain domains of essay grading. Implications for practice and/or policy The findings open up potential new avenues for utilizing LLMs in academic settings, particularly for grading student essays, thereby possibly alleviating workload of educators. The paper's insistence on the need for further fine-tuning of LLMs underlines the continual interplay between technological advancement and its practical applications. The results lay down a footprint for future research in advancing the use of AI in essay grading	en_US
dc.identifier.doi	10.1111/bjet.13494
dc.identifier.endpage	17	en_US
dc.identifier.issn	0007-1013
dc.identifier.issn	1467-8535
dc.identifier.issue	june	en_US
dc.identifier.scopus	2-s2.0-85195119605
dc.identifier.scopusquality	Q1
dc.identifier.startpage	1	en_US
dc.identifier.uri	https://doi.org/10.1111/bjet.13494
dc.identifier.uri	https://hdl.handle.net/20.500.12462/15755
dc.identifier.volume	2024	en_US
dc.identifier.wos	WOS:001237843800001
dc.identifier.wosquality	Q1
dc.indekslendigikaynak	Web of Science
dc.indekslendigikaynak	Scopus
dc.language.iso	en	en_US
dc.publisher	John Wiley and Sons	en_US
dc.relation.ispartof	British Journal of Educational Technology	en_US
dc.relation.publicationcategory	Makale - Uluslararası Hakemli Dergi - Kurum Öğretim Elemanı	en_US
dc.rights	info:eu-repo/semantics/openAccess	en_US
dc.rights.uri	http://creativecommons.org/licenses/by-nc-nd/3.0/us/	*
dc.subject	AI-Based Grading	en_US
dc.subject	Automated Essay Scoring	en_US
dc.subject	Generative AI	en_US
dc.subject	Large Language Models	en_US
dc.subject	Reliability	en_US
dc.subject	Rubric-Based Grading	en_US
dc.subject	Validity	en_US
dc.title	Utilizing large language models for EFL essay grading: An examination of reliability and validity in rubric-based assessments	en_US
dc.type	Article	en_US