In the recent past, the focus of the research community in the field of person re-identification (ReID) has gradually shifted towards video-based ReID where the goal is to identify and associate specific person identities from videos captured by different cameras at different times. A key challenge is to effectively model spatial and temporal information for robust and discrimintative video feature representation. Another challenge arises from the assumption that the clothing of the target persons would remain consistent over long periods of time and thus, most of the existing methods rely on clothing appearance for re-identification. Such assumptions lead to errors in practical scenarios where clothing consistency does not hold true. An additional challenge comes in the form of limitations faced by existing methods that largely employ CNN-based networks since CNNs can only exploit local dependencies and lose significant information due to downsampling operations employed. To overcome all these challenges, we propose a Vision-transformer-based framework exploring space-time self-attention to address the problem of long-term cloth-changing ReID in videos (CCVID-ReID). For more unique discriminative representation, we believe that soft-biometric information such as gait features can be paired with the video features from the transformer-based framework. For getting such rich dynamic information, we use an existing state-of-the-art model for 3D motion estimation, VIBE. To provide compelling evidence in favour of our approach of utilizing spatio-temporal information to address CCVID-ReID, we evaluate our method on a variant of recently published long-term cloth-changing ReID dataset, PRCC. The experiments demonstrate the proposed approach achieves state-of-the-art results which, we believe, will invite further focus in this direction.
Spatio-Temporal Attention for Cloth-Changing ReID in Videos
Bansal V.;Micheloni C.;Foresti G.;Martinel N.
2023-01-01
Abstract
In the recent past, the focus of the research community in the field of person re-identification (ReID) has gradually shifted towards video-based ReID where the goal is to identify and associate specific person identities from videos captured by different cameras at different times. A key challenge is to effectively model spatial and temporal information for robust and discrimintative video feature representation. Another challenge arises from the assumption that the clothing of the target persons would remain consistent over long periods of time and thus, most of the existing methods rely on clothing appearance for re-identification. Such assumptions lead to errors in practical scenarios where clothing consistency does not hold true. An additional challenge comes in the form of limitations faced by existing methods that largely employ CNN-based networks since CNNs can only exploit local dependencies and lose significant information due to downsampling operations employed. To overcome all these challenges, we propose a Vision-transformer-based framework exploring space-time self-attention to address the problem of long-term cloth-changing ReID in videos (CCVID-ReID). For more unique discriminative representation, we believe that soft-biometric information such as gait features can be paired with the video features from the transformer-based framework. For getting such rich dynamic information, we use an existing state-of-the-art model for 3D motion estimation, VIBE. To provide compelling evidence in favour of our approach of utilizing spatio-temporal information to address CCVID-ReID, we evaluate our method on a variant of recently published long-term cloth-changing ReID dataset, PRCC. The experiments demonstrate the proposed approach achieves state-of-the-art results which, we believe, will invite further focus in this direction.I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.