Single-Shot Re-identificación de persona basada en información saliente no supervisada

León Guerra, Reynolds; García Reyes, Edel B.; León Guerra, Reynolds; García Reyes, Edel B.

Mi SciELO

Servicios personalizados

Servicios Personalizados

Articulo

Enviar articulo por email

Indicadores

Citado por SciELO

Links relacionados

Similares en SciELO

Otros
Otros

Permalink

Revista Cubana de Ciencias Informáticas

versión On-line ISSN 2227-1899

Rev cuba cienc informat vol.14 no.2 La Habana abr.-jun. 2020 Epub 01-Jun-2020

Artículo original

Single-Shot Person Re-Identiﬁcation based on Unsupervised Saliency Information

Single-Shot Re-identificación de persona basada en información saliente no supervisada

0000-0003-3433-1740Reynolds León Guerra¹^*, Edel B. García Reyes¹

^¹ Advanced Technologies Application Center (CENATAV), Havana, Cuba.

ABSTRACT

Person re-identiﬁcation task is important in video surveillance to improve security in public place. In recent years there is a lot of investigation about this thematic. However, the performance in these algorithms is aﬀected by diﬀerent problems in the scenes, for example, complex background, atmospheric conditions, etc. Some methods as deep learning and saliency descriptor have been used to solve these problems in the real world. In this paper, we developed a method based on the combination of convolutional neural network without ﬁne-tuning and a saliency descriptor to weight all the information present into a person image. Feature maps are extracted from the last convolutional layer of a neural network and merged with other salient map obtained in spatial domain. Finally, diﬀerent features are generated based on color histograms and local binary patterns. To verify the eﬀectiveness of our proposal, the method is validated using VIPeR dataset and compared with others state of the art algorithms. The results shown that our proposal is easy to implement and is comparable with other approach using the Cumulative Matching Characteristic curve.

Key words: Deep learning; Person re-identiﬁcation; Saliency; Maps; Features

RESUMEN

La re-identiﬁcación de personas es una tarea importante en video protección para mejorar la seguridad en áreas públicas. En los últimos años existe un gran incremento en las investigaciones sobre este tema. Sin embargo, el desempeño de estos algoritmos es afectado por diferentes problemas presentes en las escenas, por ejemplo, fondos complejos, condiciones atmosféricas y otros. Algunos métodos como el aprendizaje profundo y descriptores de saliencia han sido usados para contrarrestar estos problemas en el mundo real. En el presente artículo, es desarrollado un método basado en la combinación de redes neuronales por comvolución sin aplicar ﬁne-tuning y un descriptor de saliencia para ponderar toda la información presente en la imagen de la persona. Los mapas de rasgos son extraídos desde la última capa de convolución de una red neuronal y combinado con otro mapa de saliencia obtenido en el dominio espacial. Finalmente, diferentes rasgos son generados basados en un histograma de color y patrones binarios locales. Para verificar el desempeño del método propuesto, es validado en la base de datos VIPeR y comparado con otros algoritmos del estado del arte. Los resultados muestran que el método propuesto es fácil de implementar y es comparable con otros métodos usando la curva de correspondencia cumulativa.

Palabras-clave: Aprendizaje profundo; Re-identiﬁcación de personas; Saliencia; Mapas; Rasgos

INTRODUCTION

Person re-identiﬁcation (Re-id) aims to identify a person within camera networks, with non-overlapping viewpoints and in diﬀerent moments. Also, the Re-id is categorized as single-shot (only a person image per camera is used) or multiple-shot (if multiple person image per camera are used). In real world, there is a great variety of smart video surveillance centers using Re-id algorithms to increase security in train stations, markets, public places, etc. However, in these places or scenes there are diﬀerent conditions that aﬀect the performance of the computer vision algorithms. General speaking, the Re-id is a challenge task. The problems are usually illumination changes (see Figure 1), pose, occlusion and low resolution (^{Kansal et al, 2019}).

To face these issues aforementioned, in recent years the research has been focused in three aspects. First, to look for hand-crafted features that are robust to the illumination or pose changes (^{Liao et al, 2015}). Second, to learn a metric (no euclidean) such as Mahalanobis distance where the diﬀerence inter-class increase and decrease the relation intra-class (^{Jia et al, 2017}). Third, the features are automatically learned using deep learning based on convolutional neural networks (^{Wang et al, 2017}).

Nowadays, in the Re-id algorithms have been applied diﬀerent saliency detection methods. Saliency detection task aims to detect relevant information and reject the redundant information (^{Álvarez et al, 2018}). ^{Zhao (Zhao et al, 2013}) use two independent methods to obtain salient region on person image based on KNN (K-Nearest Neighbor) and SVM (Support Vector Machine) of one class. Here, it is used diﬀerent soft-features, as color histogram and SIFT (Scale Invariant Features Transformation). Niki (^{Niki et al, 2014}) proposed to learn a distance function from a sub-set of multiple metric learning for features. A salient map is obtained based on graph theory and a color histogram is weighted. ^{Huo (Huo et al, 2015}), proposed a weight for the saliency direction trained with a SVM. Li (^{Li et al, 2018}) extract optimal regions with high similarity from person image. After, the salient maps is built into these regions.

On the other hand, the use of the salient information obtained by traditional methods together with deep learning method has good results in the Re-id performance.Li ^{(Li et al, 2018b} proposed to learn a Harmonious Attention Convolutional Neural Networks for salient maps with global and local features. ^{Rahimpour (Rahimpour et al, 2017}) use a triplet architecture to deep learning where ﬁrst block is used to obtain a salient map of the person and in second block is obtained the feature representation.

This paper is diﬀerent to other works, because is developed a method for person re-identiﬁcation based on detection of salient regions combined with deep learning without ﬁne-tuning. It permits to obtain a ﬁnal unsupervised salient map. For feature extraction are used the traditional methods. The major contribution of this work is a salient map to obtain a weighted person images (WPI) by the combination of two saliency maps, one from deep learning (VVG-F) and another from FqSD algorithm.

COMPUTATIONAL METHODOLOGY

Our Approach

The method is shown in Figure 2; a convolutional neural network is used to obtain a ﬁlter that represent a region salient in the image. This ﬁlter is combined with other salient map obtained from FqSD algorithm. Finally, a ﬁnal salient map is used to weight the information in a person image.

Fig. 1 Samples of diﬀerent pedestrians captured by two cameras on the VIPeR (^{Gray et al, 2007})and PRID2011 (^{Hirzer et al, 2011}) datasets. First row is camera A and second row is camera B. Each column indicate the images of the same person.

Weighted Person Image

In deep learning is very known that last convolutional layer represents salient information of the image (^{Zeiler et al, 2011}). The VVG-F is a model (^{Chatfield et al, 2014}) from the Return of the Devil (CNN-F), which were trained with ImageNet dataset (1.2 million of images). During the step of training the ﬁlters are learned and named coeﬃcient or weight. For general, these weights learned tend to show semantic information present in an image. In person re-identiﬁcation, the datasets are homogeneous because only has person images. For the aforementioned the ﬁlters in last convolutional layer (in ours experiment are the ﬁlters in layer ﬁve) generally represent the salient regions of a person.

The VVG-F in layer ﬁve has 64 ﬁlters. We observe that speciﬁcally the ﬁlter 38 has the best representation of saliency for person image. The feature map (ﬁlter 38) is improved using the transforms, as follows:

Fig. 2 Illustration of the diﬀerent steps used to obtain the WPI.

(1)

(2)

(3)

Where, F₃₈ is the ﬁlter 38 in layer ﬁve using the VVG-F, Γ is a radial ﬁlter and (⋆) convolutional product. SM_F38 is a salient map using ﬁlter 38.

Other method to obtain salient map used is the FqSD algorithm (^{Guerra et al, 2018}). This algorithm is based on following aspect: a Gaussian pyramid is applied to get several images with low resolutions. All images are processed to build salient maps using spatial and frequency information in the domain of the quaternions. This algorithm has as output a salient object, we modiﬁed it to obtain a salient map of whole person image. In step of the image fusion of the FqSD algorithm is applied a normalization as follows:

(4)

Where, SM_FqSD is salient map obtained, IF is the salient map obtained in image fusion step from FqSD algorithm. To obtain a salient map improved, is necessary to make a fusion between each element obtained in SM_F38 and SM_FqSD as show the expression (5).

(5)

Where, m and n are spatial coordinates of the image. Finally, SM_I(m,n) is multiplied with each channel of the original image to obtain a weighted person image (WPI), see Figure 3.

Experimental design

Our goal is to validate the performance of the developed method on a complex person dataset, applying diﬀerent mechanics in the extraction of features. VIPeR dataset: It has 1264 images of 632 persons, which were captured by two cameras in an outdoor academic environment. Only one image of the same person appear by camera (see Figure 1). Some characteristic challenges into dataset are: diﬀerent pose, high illumination changes, background clutter and all person images have size 48x128. Also the images captured have variations from 0 degree to 90 degree (camera A) and from 90 degree to 180 degree (camera B).

To validate the performance of our method is applied the next protocol: The Cumulative Matching Characteristic curve (CMC) is used because provide rank-k recognition rate. The dataset is divided to apply the metric learning in training (316)/test (316) and all experiments were run 10 times. To learn a distance based on Mahalanobis are used Keep It Simple and Straightforward Metric (KISSME) (^{Koestinger et al, 2012}) and Cross-view Quadratic Discriminant Analysis (XQDA) (^{Liao et al, 2015}) approaches. KISSME is a method to obtain a distance from a statistical inference with equivalence constraints and XQDA is an extension of KISSME based cross-view quadratic discriminant analysis, where the metric is learned.

Implementation details: A feature vector (FV₁) is formed from features map in VVG-F using the layer ﬁve. Before building the FV₁ is realized a pre-processing to improve these maps as follows:

Fig. 3 Visual samples of the WPI in the second row and in the ﬁrst row the original person image.

(6)

(7)

Where, F_n are the feature maps present in layer ﬁve, n :{ 1,…,64} and SM_Fn salient maps improved. After, each SM_Fn is divided into eight horizontal rectangular strips and a score (value of saliency) of this area is obtained. Finally, FV₁ has a dimension of 512, note 8x64 = 512.

A second feature vector FV₂ is built using color histograms and Local Binary Patterns (LBP). The color space used are RGB, HSV, normlizedRGB, Lab and Ycbcr. Note, that there are 15 channel, in each one is built a histogram of 256 bins. Moreover, to work with LBP is only used the color space RGB, HSV, normlizedRGB where is built a histogram by color channels but with 64 bins. Equal to FV₁ the images are divided in eight regions, but are only used six regions (ﬁrst and eighth regions are reject because there is not important information, for example, head and foot). The FV₂ has a dimension of 35 328. In the methods (WIP + Original) is applied the FV₁ and FV₂ an original image and WIP.

RESULTS AND DISCUSSION

We can observe in table 1, using only FV₁ that represent the salient information in each one feature maps of the layer ﬁve in VVG-F is obtained a value of 10.28% (KISSME) and 17.03% (XQDA) in rank-1. This result is possible by the intrinsic characteristics that there are in last layer where a similarity with the visual attention mechanism of the human brain has. When the FV₂ is concatenated together with FV₁ the best result is obtained applying XQDA metric with value of 21.68%, but using KISSME the increase in the values is not signiﬁcant. This result is because the XQDA has major stability for large dimension vectors of features against KISSME especially for the ﬁrst rank. On the other hand, is not always possible to ensure that the salient region can be visible in each camera viewpoint. We solve this using the combination (WIP + Original). Table 1 shows how the results increase up 19.03% (KISSME) and 24.34% (XQDA). However, the best results to ranks 1, 5, 10 and 20 is to KISSME metric and only rank 1 is to XQDA. In other words, is necessary work with original and WIP person images information to obtain robust features in diﬀerent cameras.

The comparison with others state of the art algorithms shown that our proposal is competitive in terms of the CMC. Our results are among the last reports in Re-id. However, the best results are obtained with deep learning algorithms for person re-identiﬁcation using training. Note that our proposal is used deep learning without training. The main advantage of the proposed method is to avoid a lot of training time and high dependence of amount data used. But, the disadvantage is the decrease in accuracy in the ﬁrst rank of the CMC.

Table 1 Comparison results among feature vectors and state-of-the-arts algorithms on the VIPeR dataset. The values are expressed in percent in the ranks 1, 5, 10 and 20 of the CMC.

CONCLUSIONS

Without the need to perform a training or ﬁne-tuning for use the VVG-F architecture was possible to extract features to implement a person re-identiﬁcation algorithm. A ﬁnal salient map of person image is obtained using a combination between the features maps in layer ﬁve of the VVG-F and the improved FqSD algorithm. The extraction features in original and WIP image is the best option to increase performance of the algorithm in VIPeR dataset. In future works, a new combination and feature descriptors will be to increase robust in diﬀerent scenes. Further, an experiment will be developed with other datasets and diﬀerent deep learning architectures.

REFERENCES

Kansal, Kajal; Subramanyam, A. V. Hdrnet: Person Re-Identiﬁcation Using Hybrid Sampling in Deep Reconstruction Network. IEEE Access, 2019, vol. 7, p. 40856-40865. [ Links ]

Liao, Shengcai, et al. Person re-identiﬁcation by local maximal occurrence representation and metric learning. En Proceedings of the IEEE conference on computer vision and pattern recognition. 2015. p. 2197-2206. [ Links ]

Jia, Jieru, et al. Multiple metric learning with query adaptive weights and multi-task re-weighting for person re-identiﬁcation. Computer Vision and Image Understanding, 2017, vol. 160, p. 87-99. [ Links ]

Wang, Jiabao; Li, Yang; Miao, Zhuang. (2017). Siamese cosine network embedding for person re-identiﬁcation. En CCF Chinese Conference on Computer Vision. Springer, Singapore. p. 352-362. [ Links ]

Álvarez-Miranda, Eduardo; Díaz-Guerrero, John. (2018). Multicriteria saliency detection: a (exact) robust network design approach. Annals of Operations Research, p. 1-20. [ Links ]

Zhao, Rui; Ouyang, Wanli; Wang, Xiaogang. (2013). Unsupervised salience learning for person re-identiﬁcation. En Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. p. 3586-3593. [ Links ]

Martinel, Niki; Micheloni, Christian; Foresti, Gian Luca. (2014). Saliency weighted features for person re-identiﬁcation. En European Conference on Computer Vision. Springer, Cham. p. 191-208. [ Links ]

Huo, Zhonghua; Chen, Ying; Hua, Chunjian. (2015). Person re-identiﬁcation based on multi-directional saliency metric learning. En International Conference on Computer Vision Systems. Springer, Cham. p. 45-55. [ Links ]

Li, Tiezhu, et al. (2018a). Person re-identiﬁcation using salient region matching game. Multimedia Tools and Applications, vol. 77, no 16, p. 21393-21415. [ Links ]

Li, Wei; Zhu, Xiatian; Gong, Shaogang.(2018b). Harmonious attention network for person re-identiﬁcation. En Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. p. 2285-2294. [ Links ]

Rahimpour, Alireza, et al. (2017).Person re-identiﬁcation using visual attention. En 2017 IEEE International Conference on Image Processing (ICIP). IEEE. p.4242-4246. [ Links ]

Gray, Douglas; Brennan, Shane; Tao, Hai. (2007). Evaluating appearance models for recognition, reacquisition, and tracking. En Proc. IEEE International Workshop on Performance Evaluation for Tracking and Surveillance (PETS). Citeseer. p.1-7. [ Links ]

Hirzer, Martin, et al. (2011) .Person re-identiﬁcation by descriptive and discriminative classiﬁcation. En Scandinavian conference on Image analysis. Springer, Berlin, Heidelberg. p. 91-102. [ Links ]

Zeiler, Matthew D., et al. (2011). Adaptive deconvolutional networks for mid and high level feature learning. En ICCV. p. 6. [ Links ]

Chatfield, Ken, et al. (2014) .Return of the devil in the details: Delving deep into convolutional nets. arXiv preprint arXiv:1405.3531. [ Links ]

Guerra, Reynolds León, et al. (2018) . FqSD: Full-Quaternion Saliency Detection in Images. En Iberoamerican Congress on Pattern Recognition. Springer, Cham. p. 462-469. [ Links ]

Koestinger, Martin, et al. (2012). Large scale metric learning from equivalence constraints. En 2012 IEEE Conference on Computer Vision and Pattern Recognition. IEEE. p. 2288-2295. [ Links ]

Feng, Guanhua, et al. (2019) .Hessian Regularized Distance Metric Learning for People Re-Identiﬁcation. Neural Processing Letters, p. 1-14. [ Links ]

Fang, Wen, et al. (2018) . Perceptual hash-based feature description for person re-identiﬁcation. Neurocomputing, vol. 272, p. 520-531. [ Links ]

Zhang, Chen; LIU, Qiaoling. (2018). Region constraint person re-identiﬁcation via partial least square on Riemannian manifold. IEEE Access, vol. 6, p. 17060-17066. [ Links ]

Zhang, Chengyuan, et al. (2019) .Crossing generative adversarial networks for cross-view person re-identiﬁcation. Neurocomputing,. [ Links ]

Mao, Chaojie, et al. (2018) .Multi-channel pyramid person matching network for person re-identiﬁcation. En Thirty- Second AAAI Conference on Artiﬁcial Intelligence. [ Links ]

Received: February 17, 2020; Accepted: March 31, 2020

^*Autor para la correspondencia. rleon@cenatav.co.cu

Los autores autorizan la distribución y uso del presente artículo.

Reynolds León Guerra: Su contribución es asociada al desarrollo e implementación de la ida general del artículo materializado en el algoritmo propuesto.

Edel B. García Reyes: Su contribución es en la supervisión y mejoras del algoritmo propuesto en el presente artículo.