Translating natural language (NL) into logical formalisms like First-Order Logic (FOL) has long been a challenge across multiple disciplines, including mathematics, computer science, and education. Traditional computational linguistics methods have struggled with this task due to the complexity and ambiguity of natural language. However, advancements in Natural Language Processing (NLP), particularly the introduction of Large Language Models (LLMs), have opened up new possibilities for tackling this challenge. Despite their potential, a systematic approach to evaluating the performance of LLMs in NL-to-FOL translation is still lacking. In this study, we take a first step towards filling in this gap. We examine a large dataset based on students’ efforts in formalizing natural language statements from the book “Language, Proof, and Logic”. Based on this dataset, we propose a preliminary evaluation pipeline to assess LLM performance in NL-to-FOL translation tasks, considering both syntactic and semantic aspects. We then apply this pipeline to evaluate two recent LLMs, Meta’s Llama 3.1 (8B) and Google DeepMind’s Gemma 2 (9B). Our findings validate the proposed approach, revealing key similarities and differences between LLM-generated and student-produced formulas, and provide valuable insights into the current capabilities of LLMs in this domain.

Evaluating LLMs Capabilities at Natural Language to Logic Translation: A Preliminary Investigation

Brunello A.;Ferrarese R.;Geatti L.;Marzano E.;Montanari A.;Saccomanno N.
2025-01-01

Abstract

Translating natural language (NL) into logical formalisms like First-Order Logic (FOL) has long been a challenge across multiple disciplines, including mathematics, computer science, and education. Traditional computational linguistics methods have struggled with this task due to the complexity and ambiguity of natural language. However, advancements in Natural Language Processing (NLP), particularly the introduction of Large Language Models (LLMs), have opened up new possibilities for tackling this challenge. Despite their potential, a systematic approach to evaluating the performance of LLMs in NL-to-FOL translation is still lacking. In this study, we take a first step towards filling in this gap. We examine a large dataset based on students’ efforts in formalizing natural language statements from the book “Language, Proof, and Logic”. Based on this dataset, we propose a preliminary evaluation pipeline to assess LLM performance in NL-to-FOL translation tasks, considering both syntactic and semantic aspects. We then apply this pipeline to evaluate two recent LLMs, Meta’s Llama 3.1 (8B) and Google DeepMind’s Gemma 2 (9B). Our findings validate the proposed approach, revealing key similarities and differences between LLM-generated and student-produced formulas, and provide valuable insights into the current capabilities of LLMs in this domain.
File in questo prodotto:
File Dimensione Formato  
paper13.pdf

accesso aperto

Tipologia: Versione Editoriale (PDF)
Licenza: Creative commons
Dimensione 980.44 kB
Formato Adobe PDF
980.44 kB Adobe PDF Visualizza/Apri

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/11390/1300826
Citazioni
  • ???jsp.display-item.citation.pmc??? ND
  • Scopus 0
  • ???jsp.display-item.citation.isi??? ND
social impact