Large Language Models and Legal Practice at the STF: A Comparative Study of Decisions in in Extraordinary Appeals (agravos em recurso extraordinário)
Artificial Intelligence; Large Language Models; Federal Supreme Court; Judicial Decisions; Procedural Reliability.
This study assesses the legal-procedural reliability of large language models (LLMs) in drafting judicial decisions for the Brazilian Federal Supreme Court (STF) in interlocutory appeals on extraordinary remedies (ARE). Developed within the Artificial Intelligence Innovation Project at the University of Brasília, the research aligns with the Brazilian AI Strategy (EBIA), the Brazilian AI Plan (PBIA), and the STF’s institutional agenda on technology adoption. Given the widespread use of generative AI by judges, the study underscores the urgent need to establish secure parameters for its use in adjudication. Fifteen recent AREs were selected, covering a range of topics and complexities. For each case, the GPT-4o model independently generated a draft decision based solely on case records, without human intervention. The outputs were compared to the original STF decisions using four analytical axes: (i) factual and informational accuracy; (ii) adherence to jurisdictional filters and cognitive limits of AREs; (iii) structural conformity (report, reasoning, dispositive); and (iv) presence of hallucinations—defined as non-existent, imprecise, or incorrect statements or references. The findings are notable: hallucinations occurred in 80% of cases; undue engagement with the merits in 73%; non-compliance with the opening format in 73%; and improper dispositive drafting in 86.7%. Only one of the fifteen AI-generated decisions (6.7%) fully matched the human ruling, while in 20% of cases, the AI uncritically echoed the lower court's decision. The study concludes that, although LLMs can identify key facts with reasonable accuracy, their high rate of formal and substantive errors prevents their autonomous use in STF rulings. The paper recommends targeted model training for institutional demands, enhanced prompt engineering, and mandatory review by qualified professionals to validate AI-generated outputs. These empirical findings contribute to discussions on AI governance within the judiciary, indicating a path toward responsible integration between humans, technology, and the law.