Existing methods dealing with object instance re-identification (OIRe-ID) look for the best visual features match of a target object within a set of frames. Due to the nature of the problem, relying only on the visual appearance of object instances is likely to provide many false matches when there are multiple objects with similar appearance or multiple instances of same object class present in the scene. We focus on a rigid scene setup and to limit the negative effects of the aforementioned cases, we propose to exploit the background information. We believe that this would be particularly helpful in a rigid environment with a lot of reoccurring identical models of objects since it would provide rich context information. We introduce an attention-based mechanism to the existing Mask R-CNN architecture such that we learn to encode the important and distinct information in the background jointly with the foreground features relevant to rigid real-world scenarios. To evaluate the proposed approach, we run compelling experiments on the ScanNet dataset. Results demonstrate that we outperform significantly compared to different baselines and SOTA methods.
Where Did i See It? Object Instance Re-Identification with Attention
Bansal V.;Foresti G. L.;Martinel N.
2021-01-01
Abstract
Existing methods dealing with object instance re-identification (OIRe-ID) look for the best visual features match of a target object within a set of frames. Due to the nature of the problem, relying only on the visual appearance of object instances is likely to provide many false matches when there are multiple objects with similar appearance or multiple instances of same object class present in the scene. We focus on a rigid scene setup and to limit the negative effects of the aforementioned cases, we propose to exploit the background information. We believe that this would be particularly helpful in a rigid environment with a lot of reoccurring identical models of objects since it would provide rich context information. We introduce an attention-based mechanism to the existing Mask R-CNN architecture such that we learn to encode the important and distinct information in the background jointly with the foreground features relevant to rigid real-world scenarios. To evaluate the proposed approach, we run compelling experiments on the ScanNet dataset. Results demonstrate that we outperform significantly compared to different baselines and SOTA methods.I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.