What question did this study set out to answer?

This research aims to determine whether a small open-weights model can generate better scientific code using a curated verification substrate for reference.

May 29, 2026Open Access

Verification grounding for open-weights scientific code generation: a pilot study.

Key Points

This research aims to determine whether a small open-weights model can generate better scientific code using a curated verification substrate for reference.
Built a benchmark of 73 prompts across seven scientific fields, including control and treatment conditions.
Utilized a paired t-test to compare the performance of the model in both conditions, focusing on generated scores and token usage.
Tested a new sampling approach using the verification engine after generating code without tools.
The treatment arm with the verification substrate resulted in a mean score delta of -0.220 (p < 0.001) and increased token usage by 6.8x.
The sampling approach improved mean scores from 0.630 to 0.685, with a 97.3% agreement with the test-based oracle on prompts.
The findings suggest that the verification substrate is more effective as a post-generation tool rather than during the code generation process.

Abstract

We test a simple question: if you give a small open-weights model access to a curated, schema-validated corpus of scientific "cards" (formulas, validation envelopes, conservation claims, declared limits), does it write better scientific code? We built a benchmark of 73 prompts across seven scientific fields (50 textbook prompts and 23 harder research-style prompts). For each prompt we ran the same model two ways: control (no tools) and treatment (the model can call into "Lemma", our verification substrate, to look up the relevant card, check dimensions, and cross-check claims). We ran each candidate against numeric test cases and against the verification engine, combined the results into a single score, and compared the two arms with a paired t-test. On Llama 3.1 8B, the treatment arm lowered the mean score (delta = -0.220, p < 0.001). It used about 6.8x more tokens per prompt. The harder prompts regressed more, not less, so the natural story "the substrate helps when the model lacks the formula" did not hold up. A mini-experiment with a self-router showed that even a perfect routing policy would beat plain control by only +0.031 on this prompt set: routing converts the loss into a wash, but does not unlock latent value. We then tried a different way to use the same substrate. We sampled N = 5 candidates from the model at temperature 0.7 with no tools, scored each candidate with the verification engine after the fact, and returned the best one. This version works: mean score goes from 0.630 to 0.685, the rerank agrees with the test-based oracle on 97.3% of prompts, and the result repeats at the same numbers on Mistral Nemo 12B - a different model from a different vendor. So the substrate is useful at sampling time even where it hurts at retrieval time, and the effect is not specific to one model. The takeaway for product design: "Lemma" should be a verifier-and-reranker, not a tool the model calls during generation. The benchmark prompts and all five landmark files are released openly (github.com/artano-ai/humaneval-sci, archived at Zenodo: 10.5281/zenodo.20414774); the verification engine ships separately with the substrate's forthcoming platform paper.

Verification grounding for open-weights scientific code generation: a pilot study.

Key Points

Abstract

Cite This Study