AbstractBackground Food composition databases are fundamental for rigorous dietary assessment, yet they often include information only for generic foods. Objective This study aimed to estimate the full nutrient composition of packaged foods using natural language processing (NLP) and optimization modeling. Methods Nutrition Facts tables (NFT) and ingredient lists for 5,371 packaged foods collected by the Food Quality Observatory across 17 food categories available in Québec, Canada, were used. First, an NLP algorithm matched individual ingredients from packaged foods to the closest equivalents in the Canadian Nutrient File (CNF) 2015, which contains full nutrient profiles for over 5,690 ingredients and foods in Canada. Match quality was assessed using cosine similarity scores. Second, an optimization model estimated the proportion of all ingredients (g/100g) from the packaged foods, enabling the reverse-engineering of nutrient composition data found on the NFT. Model performance was assessed using relative errors comparing estimated versus known nutrient values reported on NFTs. Results Over 55% of ingredients were matched to the CNF with cosine similarity scores ≥ 0.9, indicating high-quality matches. Across all food categories combined, the median relative error for the estimates of energy and the 10 nutrients reported on NFT was Conclusions A method based on NLP and optimization modeling can reliably estimate ingredient proportions of a wide variety of packaged foods, allowing for the generation of complete nutrient profiles.
Côté et al. (Mon,) studied this question.