What question did this study set out to answer?

The aim is to measure and compare the intrinsic learning difficulty of various data modalities while controlling for other factors.

April 25, 2026Open Access

UniMod-MLD: Compute-Normalized Shared-Latent Transformer Training for Measuring Intrinsic Learning Difficulty Across 12 Modalities

Key Points

The aim is to measure and compare the intrinsic learning difficulty of various data modalities while controlling for other factors.
Established a controlled estimation problem by fixing model family, optimization budget, and training scale.
Employed ModalityRank to simulate and assess relative learning difficulty across 12 modalities with standardized datasets.
Defined a Modality Learning Difficulty (MLD) index based on compute-to-threshold across different task categories.
Identified varying compute requirements for each modality to reach matched performance targets.
Developed a robust framework for measuring sample efficiency and convergence speed.
Mapped learning difficulty across modalities, showing disparities in compute efficiency.

Abstract

UniMod-MLD technical report / preprint. Comparing how difficult different data modalities are to learn is challenging because standard results are confounded by architecture choice, pretraining, dataset scale, and unequal compute. We reframe the question as a controlled estimation problem: hold model family, optimization budget, and training scale fixed; vary only the modality interface; and compare how much compute each modality requires to reach matched performance targets. We propose ModalityRank, a simulation-based framework for estimating relative learning difficulty across 12 modalities: text, image, video, audio, protein macromolecule, small molecule, IMU, tactile, gustatory, EEG, point cloud, and tabular data. Each modality is assigned 100K training samples, standardized to comparable token counts, and processed by the same Transformer encoder-decoder with identical hyperparameters. Models are trained from scratch, without pretrained weights, and are optionally aligned into a shared latent space through contrastive pretraining before downstream evaluation. We define a Modality Learning Difficulty (MLD) index based on compute-to-threshold across a balanced task taxonomy spanning four axes: discriminative vs. generative, symmetric vs. asymmetric alignment, intra-modal vs. cross-modal, and low-level vs. high-level tasks. We also introduce complementary diagnostics: sample efficiency, convergence speed, alignment quality, task completion rate, robustness under shift, and an MDL-inspired information-density estimate. Theoretical analysis is intentionally modest: we formalize MLD, discuss conditions under which token-count normalization can reduce tokenizer dependence, derive a rate-distortion-style lower bound under simplified assumptions, and relate the Gaussian special case to intrinsic dimension Shannon, 1948; Cover and Thomas, 2006; Bishop, 2006. Existing OSF archival DOI: 10.17605/OSF.IO/EU5S9; Existing OSF archival page: https://osf.io/eu5s9/. Files include the technical report PDF and the LaTeX source tarball when available.

UniMod-MLD: Compute-Normalized Shared-Latent Transformer Training for Measuring Intrinsic Learning Difficulty Across 12 Modalities

Key Points

Abstract

Cite This Study