The rapid growth of Edge Intelligence (EI) and heterogeneous user demands has led to the widespread generation of multimodal data at the network edge. Multimodal Federated Learning (MFL) provides a promising solution for collaborative, privacy-preserving model training across distributed clients. However, existing MFL frameworks often assume homogeneous environments and fail to account for disparities in client data distributions, modality characteristics, and computational resources, limiting their effectiveness in real-world edge deployments. To address these challenges, we propose Multimodal Federated Edge Learning (MFEL), a flexible framework that supports resource-adaptive deployment through variable-capacity submodels. Building upon MFEL, we introduce MFEL-H2B, a heterogeneous-aware approach that integrates three core mechanisms: (1) Prototype Networks for cross-client modality alignment, mitigating representation divergences caused by non-IID data and heterogeneous sensing conditions; (2) Rebalanced Modality Gradient Modulation (R-MGM), which adaptively amplifies gradients of underrepresented modalities while suppressing dominant ones to alleviate intra-client modality imbalance; and (3) Ensemble Momentum-based Knowledge Distillation (E-MKD), which constructs a dynamic ensemble teacher from client predictions and leverages a momentum mechanism to facilitate efficient and robust knowledge transfer among clients with heterogeneous model capacities. Extensive experiments on heterogeneous multimodal datasets demonstrate that MFEL-H2B consistently outperforms state-of-the-art baselines in accuracy, convergence speed, and training stability, while maintaining strong generalization across diverse client architectures and resource profiles.
Jiang et al. (Sun,) studied this question.