We consider the problem of nutrition prediction from food images. That is, given an RGB image of a dish composed of several food categories, the goal is to predict its mass, caloric intake, and the macronutrient masses such as fats, carbohydrates, and proteins. In this work, we first identify that the nutrition prediction problem can be considered a special case of a more general problem: the prediction of food category volumes. If such quantities are available (i.e., estimated), we can estimate the nutrition values using a linear mapping. Leveraging such a result, we propose a framework for nutrition prediction based on food volume estimation. It consists of a modular pipeline composed of (i) a food category recognition module, (ii) a semantic segmentation module, and (iii) a depth estimation module-which are common computer vision tasks. Finally, we study the zero-shot performance of vision-language foundation models applied for the aforementioned tasks. The code is provided at www.github.com/vitaly-emelianov/nutrition-fm.

Nutrition Prediction from Food Images Using Foundation Models

Emelianov V.;Martinel N.
2025-01-01

Abstract

We consider the problem of nutrition prediction from food images. That is, given an RGB image of a dish composed of several food categories, the goal is to predict its mass, caloric intake, and the macronutrient masses such as fats, carbohydrates, and proteins. In this work, we first identify that the nutrition prediction problem can be considered a special case of a more general problem: the prediction of food category volumes. If such quantities are available (i.e., estimated), we can estimate the nutrition values using a linear mapping. Leveraging such a result, we propose a framework for nutrition prediction based on food volume estimation. It consists of a modular pipeline composed of (i) a food category recognition module, (ii) a semantic segmentation module, and (iii) a depth estimation module-which are common computer vision tasks. Finally, we study the zero-shot performance of vision-language foundation models applied for the aforementioned tasks. The code is provided at www.github.com/vitaly-emelianov/nutrition-fm.
File in questo prodotto:
Non ci sono file associati a questo prodotto.

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/11390/1322035
Citazioni
  • ???jsp.display-item.citation.pmc??? ND
  • Scopus 0
  • ???jsp.display-item.citation.isi??? ND
social impact