Existing multimodal models (CLIP, ImageBind) align physical and linguisticrepresentations in a single shared space — losing the structure of each modality inthe process. We propose an alternative: keep the physical and linguistic planesseparate, combining them through a reversible addition operation. The centralclaim is that if physics + language = combined, then the physical plane can berecovered from the combined embedding without any linguistic context — purelythrough subtraction. Experiments on synthetic data confirm the viability of thisarchitecture: the physics recovery error was 0.0109, demonstrating zero-shotgeneralization through meaning rather than through language tokens.
Building similarity graph...
Analyzing shared references across papers
Loading...
Artem Gorbunov
Building similarity graph...
Analyzing shared references across papers
Loading...
Artem Gorbunov (Wed,) studied this question.
www.synapsesocial.com/papers/69c620d515a0a509bde19829 — DOI: https://doi.org/10.5281/zenodo.19218011