Los puntos clave no están disponibles para este artículo en este momento.
Machine-learning potentials (MLPs) promise near-first-principles fidelity at scales relevant to heterogeneous catalysis, yet the key determinant of their reliability remains the quality of the training data. A paramount challenge lies in constructing training sets that capture both the system-specific structural domains and the broader chemical variability arising from the complex, dynamic nature of heterogeneous catalytic systems. In this perspective, we argue that the key question is no longer simply how to generate more data, but how to generate the right data. We reinterpret the main strategies used to construct MLP training sets for heterogeneous catalysis through the lens of scope, relevance, and coverage. We then examine the challenges of data generation in catalytic systems and how they have shaped prevailing structure-based sampling practices, which encode system relevance and targeted coverage directly into the data set. We also discuss emerging interaction-based sampling strategies that aim to broaden local-interaction support beyond narrowly predefined systems. We conclude by consolidating a possible hybrid data generation workflow that combines the strengths of both approaches, thereby bringing MLPs closer to the simulation of complex heterogeneous catalytic systems.
Xie et al. (Wed,) studied this question.