Nutrition Prediction from Food Images Using Foundation Models

Emelianov, V.; Martinel, N.

doi:10.1109/ICME59968.2025.11210000

We consider the problem of nutrition prediction from food images. That is, given an RGB image of a dish composed of several food categories, the goal is to predict its mass, caloric intake, and the macronutrient masses such as fats, carbohydrates, and proteins. In this work, we first identify that the nutrition prediction problem can be considered a special case of a more general problem: the prediction of food category volumes. If such quantities are available (i.e., estimated), we can estimate the nutrition values using a linear mapping. Leveraging such a result, we propose a framework for nutrition prediction based on food volume estimation. It consists of a modular pipeline composed of (i) a food category recognition module, (ii) a semantic segmentation module, and (iii) a depth estimation module-which are common computer vision tasks. Finally, we study the zero-shot performance of vision-language foundation models applied for the aforementioned tasks. The code is provided at www.github.com/vitaly-emelianov/nutrition-fm.