Nutrition informatics has undergone a significant paradigm shift in recent years. Approaches historically grounded in rule-based decision support and classical task-specific machine learning pipelines are increasingly being superseded by an ecosystem centered on large language models (LLMs) and multimodal vision-language foundation models. This review synthesizes researches published between 2019 and 2025, with the objectives of clarifying architectural patterns that enable nutrition-oriented perception and reasoning, summarizing advances and identifying gaps across major application scenarios, and outlining strategic directions for reliable translation research in clinical and public health practice. Based on a systematic analysis of 92 representative studies, we organize the current landscape into three interrelated research trajectories: (1) Vision and multimodal modeling for dietary perception, focusing on food recognition, ingredient parsing, portion estimation, and nutrient prediction from meal images and videos. Recent methodologies increasingly adopt Transformer-based encoders and explicit vision-language alignment, leveraging depth cues and scale calibration to improve robustness under complex real-world conditions. (2) LLM-based nutrition agents for interactive guidance, supporting dietary counseling, meal planning, and health coaching. To mitigate challenges such as hallucinations and numerical inconsistency, current research emphasizes domain adaptation, tool-augmented computation, and retrieval-augmented generation (RAG) to ground model responses in reliable nutrition databases and clinical guidelines. (3) Personalization-oriented hybrid systems, which combine foundation models with structured components—such as knowledge graphs and causal inference frameworks—while integrating individual-level multi-omics signals, biomarkers, and lifestyle data. These systems aim to generate and optimize meal plans under strict constraints of safety, clinical feasibility, and patient adherence. Across these trajectories, interpretability has transitioned from an optional feature to a core system requirement, driven by the needs of clinical accountability and risk auditing. Concurrently, evaluation protocols are expanding from image-centric datasets (e.g., Nutrition5k) to comprehensive benchmarking suites designed for multimodal reasoning. Despite rapid progress, limitations persist regarding model factuality, privacy preservation, and external validity across diverse cuisines and socioeconomic settings. We advocate for evidence-grounded pipelines, standardized multimodal datasets with clinical endpoints, and unified evaluation frameworks spanning accuracy, safety, and bias. Human-in-the-loop deployment remains essential to quantify benefit-risk profiles and facilitate the regulatory adoption of AI-driven nutrition services.
ZHANG et al. (Thu,) studied this question.