Translating natural language (NL) into logical formalisms like First-Order Logic (FOL) has long been a challenge across multiple disciplines, including mathematics, computer science, and education. Traditional computational linguistics methods have struggled with this task due to the complexity and ambiguity of natural language. However, advancements in Natural Language Processing (NLP), particularly the introduction of Large Language Models (LLMs), have opened up new possibilities for tackling this challenge. Despite their potential, a systematic approach to evaluating the performance of LLMs in NL-to-FOL translation is still lacking. In this study, we take a first step towards filling in this gap. We examine a large dataset based on students’ efforts in formalizing natural language statements from the book “Language, Proof, and Logic”. Based on this dataset, we propose a preliminary evaluation pipeline to assess LLM performance in NL-to-FOL translation tasks, considering both syntactic and semantic aspects. We then apply this pipeline to evaluate two recent LLMs, Meta’s Llama 3.1 (8B) and Google DeepMind’s Gemma 2 (9B). Our findings validate the proposed approach, revealing key similarities and differences between LLM-generated and student-produced formulas, and provide valuable insights into the current capabilities of LLMs in this domain.
Evaluating LLMs Capabilities at Natural Language to Logic Translation: A Preliminary Investigation
Brunello A.;Ferrarese R.;Geatti L.;Marzano E.;Montanari A.;Saccomanno N.
2025-01-01
Abstract
Translating natural language (NL) into logical formalisms like First-Order Logic (FOL) has long been a challenge across multiple disciplines, including mathematics, computer science, and education. Traditional computational linguistics methods have struggled with this task due to the complexity and ambiguity of natural language. However, advancements in Natural Language Processing (NLP), particularly the introduction of Large Language Models (LLMs), have opened up new possibilities for tackling this challenge. Despite their potential, a systematic approach to evaluating the performance of LLMs in NL-to-FOL translation is still lacking. In this study, we take a first step towards filling in this gap. We examine a large dataset based on students’ efforts in formalizing natural language statements from the book “Language, Proof, and Logic”. Based on this dataset, we propose a preliminary evaluation pipeline to assess LLM performance in NL-to-FOL translation tasks, considering both syntactic and semantic aspects. We then apply this pipeline to evaluate two recent LLMs, Meta’s Llama 3.1 (8B) and Google DeepMind’s Gemma 2 (9B). Our findings validate the proposed approach, revealing key similarities and differences between LLM-generated and student-produced formulas, and provide valuable insights into the current capabilities of LLMs in this domain.File | Dimensione | Formato | |
---|---|---|---|
paper13.pdf
accesso aperto
Tipologia:
Versione Editoriale (PDF)
Licenza:
Creative commons
Dimensione
980.44 kB
Formato
Adobe PDF
|
980.44 kB | Adobe PDF | Visualizza/Apri |
I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.