This paper presents an empirical investigation into the ability of selected Large Language Models (LLMs) to understand and apply symbolic notation from mathematical science and logical theories, such as Natural Deduction. The study evaluates 43 LLMs from diverse vendors and architectures, testing their capacity to ingest and output results in symbolic notation rather than prose. Each model is presented with a prompt containing a formal type and inference system, contextual rules, and questions requiring symbolic reasoning. The experiment uses a Python-based testbed to standardize evaluation, measuring both the correctness of responses and the adherence to symbolic output constraints. Results reveal significant variability in performance across models, with frontier models demonstrating superior correctness and symbolic output capabilities. Findings show that frontier models consistently outperform others in both correctness and symbolic output, while smaller or open-source models exhibit mixed results. This underscores the need for careful model selection in applications requiring strict formalism and minimal prose output.
Building similarity graph...
Analyzing shared references across papers
Loading...
Andreas Schmidt
Building similarity graph...
Analyzing shared references across papers
Loading...
Andreas Schmidt (Tue,) studied this question.
www.synapsesocial.com/papers/69aa70c8531e4c4a9ff5af19 — DOI: https://doi.org/10.5281/zenodo.18867317