We test a simple question: if you give a small open-weights model access to a curated, schema-validated corpus of scientific "cards" (formulas, validation envelopes, conservation claims, declared limits), does it write better scientific code? We built a benchmark of 73 prompts across seven scientific fields (50 textbook prompts and 23 harder research-style prompts). For each prompt we ran the same model two ways: control (no tools) and treatment (the model can call into "Lemma", our verification substrate, to look up the relevant card, check dimensions, and cross-check claims). We ran each candidate against numeric test cases and against the verification engine, combined the results into a single score, and compared the two arms with a paired t-test. On Llama 3.1 8B, the treatment arm lowered the mean score (delta = -0.220, p < 0.001). It used about 6.8x more tokens per prompt. The harder prompts regressed more, not less, so the natural story "the substrate helps when the model lacks the formula" did not hold up. A mini-experiment with a self-router showed that even a perfect routing policy would beat plain control by only +0.031 on this prompt set: routing converts the loss into a wash, but does not unlock latent value. We then tried a different way to use the same substrate. We sampled N = 5 candidates from the model at temperature 0.7 with no tools, scored each candidate with the verification engine after the fact, and returned the best one. This version works: mean score goes from 0.630 to 0.685, the rerank agrees with the test-based oracle on 97.3% of prompts, and the result repeats at the same numbers on Mistral Nemo 12B - a different model from a different vendor. So the substrate is useful at sampling time even where it hurts at retrieval time, and the effect is not specific to one model. The takeaway for product design: "Lemma" should be a verifier-and-reranker, not a tool the model calls during generation. The benchmark prompts and all five landmark files are released openly (github.com/artano-ai/humaneval-sci, archived at Zenodo: 10.5281/zenodo.20414774); the verification engine ships separately with the substrate's forthcoming platform paper.
Arsalan Akhtar (Wed,) studied this question.