Utilizing large language models for EFL essay grading: An examination of reliability and validity in rubric-based assessments

dc.authorid0000-0003-2645-2710en_US
dc.authorid0000-0002-0300-9073en_US
dc.authorid0000-0003-1571-9686en_US
dc.contributor.authorYavuz, Fatih
dc.contributor.authorÇelik, Özgür
dc.contributor.authorÇelik, Gamze Yavaş
dc.date.accessioned2025-01-14T11:41:51Z
dc.date.available2025-01-14T11:41:51Z
dc.date.issued2024en_US
dc.departmentYüksekokullar, Yabancı Diller Yüksekokulu, Yabancı Diller Bölümüen_US
dc.descriptionÇelik, Özgür (Balikesir Author)en_US
dc.description.abstractThis study investigates the validity and reliability of generative large language models (LLMs), specifically ChatGPT and Google's Bard, in grading student essays in higher education based on an analytical grading rubric. A total of 15 experienced English as a foreign language (EFL) instructors and two LLMs were asked to evaluate three student essays of varying quality. The grading scale comprised five domains: grammar, content, organization, style & expression and mechanics. The results revealed that fine-tuned ChatGPT model demonstrated a very high level of reliability with an intraclass correlation (ICC) score of 0.972, Default ChatGPT model exhibited an ICC score of 0.947 and Bard showed a substantial level of reliability with an ICC score of 0.919. Additionally, a significant overlap was observed in certain domains when comparing the grades assigned by LLMs and human raters. In conclusion, the findings suggest that while LLMs demonstrated a notable consistency and potential for grading competency, further fine-tuning and adjustment are needed for a more nuanced understanding of non-objective essay criteria. The study not only offers insights into the potential use of LLMs in grading student essays but also highlights the need for continued development and research. Practitioner notes What is already known about this topic Large language models (LLMs), such as OpenAI's ChatGPT and Google's Bard, are known for their ability to generate text that mimics human-like conversation and writing. LLMs can perform various tasks, including essay grading. Intraclass correlation (ICC) is a statistical measure used to assess the reliability of ratings given by different raters (in this case, EFL instructors and LLMs). What this paper adds The study makes a unique contribution by directly comparing the grading performance of expert EFL instructors with two LLMs—ChatGPT and Bard—using an analytical grading scale. It provides robust empirical evidence showing high reliability of LLMs in grading essays, supported by high ICC scores. It specifically highlights that the overall efficacy of LLMs extends to certain domains of essay grading. Implications for practice and/or policy The findings open up potential new avenues for utilizing LLMs in academic settings, particularly for grading student essays, thereby possibly alleviating workload of educators. The paper's insistence on the need for further fine-tuning of LLMs underlines the continual interplay between technological advancement and its practical applications. The results lay down a footprint for future research in advancing the use of AI in essay gradingen_US
dc.identifier.doi10.1111/bjet.13494
dc.identifier.endpage17en_US
dc.identifier.issn0007-1013
dc.identifier.issn1467-8535
dc.identifier.issuejuneen_US
dc.identifier.scopus2-s2.0-85195119605
dc.identifier.scopusqualityQ1
dc.identifier.startpage1en_US
dc.identifier.urihttps://doi.org/10.1111/bjet.13494
dc.identifier.urihttps://hdl.handle.net/20.500.12462/15755
dc.identifier.volume2024en_US
dc.identifier.wosWOS:001237843800001
dc.identifier.wosqualityQ1
dc.indekslendigikaynakWeb of Science
dc.indekslendigikaynakScopus
dc.language.isoenen_US
dc.publisherJohn Wiley and Sonsen_US
dc.relation.ispartofBritish Journal of Educational Technologyen_US
dc.relation.publicationcategoryMakale - Uluslararası Hakemli Dergi - Kurum Öğretim Elemanıen_US
dc.rightsinfo:eu-repo/semantics/openAccessen_US
dc.rights.urihttp://creativecommons.org/licenses/by-nc-nd/3.0/us/*
dc.subjectAI-Based Gradingen_US
dc.subjectAutomated Essay Scoringen_US
dc.subjectGenerative AIen_US
dc.subjectLarge Language Modelsen_US
dc.subjectReliabilityen_US
dc.subjectRubric-Based Gradingen_US
dc.subjectValidityen_US
dc.titleUtilizing large language models for EFL essay grading: An examination of reliability and validity in rubric-based assessmentsen_US
dc.typeArticleen_US

Dosyalar

Orijinal paket

Listeleniyor 1 - 1 / 1
Yükleniyor...
Küçük Resim
İsim:
ozgur-celik.pdf
Boyut:
1.76 MB
Biçim:
Adobe Portable Document Format
Açıklama:
Tam Metin / Full Text

Lisans paketi

Listeleniyor 1 - 1 / 1
Yükleniyor...
Küçük Resim
İsim:
license.txt
Boyut:
1.44 KB
Biçim:
Item-specific license agreed upon to submission
Açıklama: