UniMod-MLD technical report / preprint. Comparing how difficult different data modalities are to learn is challenging because standard results are confounded by architecture choice, pretraining, dataset scale, and unequal compute. We reframe the question as a controlled estimation problem: hold model family, optimization budget, and training scale fixed; vary only the modality interface; and compare how much compute each modality requires to reach matched performance targets. We propose ModalityRank, a simulation-based framework for estimating relative learning difficulty across 12 modalities: text, image, video, audio, protein macromolecule, small molecule, IMU, tactile, gustatory, EEG, point cloud, and tabular data. Each modality is assigned 100K training samples, standardized to comparable token counts, and processed by the same Transformer encoder-decoder with identical hyperparameters. Models are trained from scratch, without pretrained weights, and are optionally aligned into a shared latent space through contrastive pretraining before downstream evaluation. We define a Modality Learning Difficulty (MLD) index based on compute-to-threshold across a balanced task taxonomy spanning four axes: discriminative vs. generative, symmetric vs. asymmetric alignment, intra-modal vs. cross-modal, and low-level vs. high-level tasks. We also introduce complementary diagnostics: sample efficiency, convergence speed, alignment quality, task completion rate, robustness under shift, and an MDL-inspired information-density estimate. Theoretical analysis is intentionally modest: we formalize MLD, discuss conditions under which token-count normalization can reduce tokenizer dependence, derive a rate-distortion-style lower bound under simplified assumptions, and relate the Gaussian special case to intrinsic dimension Shannon, 1948; Cover and Thomas, 2006; Bishop, 2006. Existing OSF archival DOI: 10.17605/OSF.IO/EU5S9; Existing OSF archival page: https://osf.io/eu5s9/. Files include the technical report PDF and the LaTeX source tarball when available.
Haopeng Jin (Mon,) studied this question.