III Simposio de Postgrado 2025: Ingeniería, ciencia e innovación

02 60 *E-mail: varavena@dcc.uchile.cl ¹ Departamento en Ciencias de la Computación, Universidad de Chile ² Facultad de Economía y Negocios, Universidad de Chile ³ Facultad de Ingeniería y Ciencias Aplicadas, Universidad de los Andes Valentina Aravena ¹* Gustavo Zurita ² Nelson Baloian ¹ Claudio Álvarez ³ Prompting Toulmin: A Methodology for Evaluating Ethical Argumentation with Large Language Models Módulo Cs. de la Computación y Cs. de Datos e IA Abstract This study examines the potential of Generative Artificial Intelligence, through Large Language Models (LLMs), to automate formative assessment of ethical arguments in higher education settings mediated by digital technologies. Toulmin’s model of argumentation was used as the theoretical framework to analyze argument structure, and a hard-prompting methodology was designed to improve the consistency of the responses generated by the models. The activity was implemented using EthicApp , an educational platform that promotes ethical learning through digital case-based dilemmas. The first methodological stage involved generating an optimized prompt to ensure consistent responses from LLMs; for this, ten random responses were evaluated consecutively to measure the stability of the models’ assessments. In the second stage, 399 student arguments were analyzed using four state-of-the-art language models—GPT-4o, Gemini-1.5-pro, Claude-3.5-Sonnet, and DeepSeek-Reasoner— and their evaluations were compared to human judgments on a random sample of 71 arguments, validated through the Monte Carlo method. In the third stage, 700 arguments (350 from each phase of the activity) were evaluated, and a new validation was conducted with human raters on a sample of 108 responses. The results indicate that the models—especially GPT-4o and DeepSeek-Reasoner— tend to replicate human evaluative patterns when there is strong consensus among expert raters. The pedagogical implications of using AI in formative feedback are discussed, along with design recommendations for future automated assessment systems. This work contributes to the development of replicable methodologies for integrating generative AI into digital educational environments, promoting more reflective, scalable, and student-centered learning experiences.