2021 Vol. 47, No. 3

Display Method:
Volume 47 Issue32021
iconDownload (80967) 848 iconPreview
Image captioning based on dependency syntax
BI Jianqi, LIU Maofu, HU Huijun, DAI Jianhua
2021, 47(3): 431-440. doi: 10.13700/j.bh.1001-5965.2020.0443
Abstract:

Current image captioning model can automatically apply the part-of-speech sequences and syntactic trees to make the generated text in line with grammar. However, the above-mentioned models generally generate the simple sentences. There is no groundbreaking work in language models promoting the interpretability of deep learning models. The dependency syntax is integrated into the deep learning model to supervise the image captioning, which can make deep learning models more interpretable. An image structure attention mechanism, which recognizes the relationship between image regions based on the dependency syntax, is applied to compute the visual relations and obtain the features. The fusion of image region relation features and image region features, and the word embedding are employed into Long Short-Term Memory (LSTM) to generate the image captions. In testing, the content keywords of the testing and training image datasets are produced due to the content overlap of two images, and the dependency syntax template corresponding to the test image can be indirectly extracted. According to the dependency syntax template, the diverse descriptions can be generated. Experiment resultsverify the capacity of the proposed model to improve the diversity of generated captions and syntax complexity and indicate that the dependency syntax can enhance the interpretability of deep learning model.

Video summary generation based on multi-feature image and visual saliency
JIN Haiyan, CAO Tian, XIAO Cong, XIAO Zhaolin
2021, 47(3): 441-450. doi: 10.13700/j.bh.1001-5965.2020.0479
Abstract:

How to extract video content efficiently, that is, video summarization, is a research hotspot in the field of computer vision. Video summary cannot be obtained effectively and completely by simply detecting the image color, texture and other features. Based on the visual attention pyramid model, this paper proposes an improved center-surround video summarization method with variable ratio and double contrast calculation. First, the video image sequence is divided into pixel blocks by superpixel method to speed up image calculation. Then, the contrast feature difference under different color backgrounds is detected and fused. Finally, combined with the optical flow motion information, the static and dynamic saliency results are merged to extract the key frames of the video. When extracting the key frames, the perceived Hash function is used to perform similarity judgment to complete the video summary generation. Simulation experiments are carried out on Segtrack V2, ViSal and OVP datasets. The experimental results show that the proposed method can be used to effectively extract the area of interest, and finally obtain the video summary expressed by the sequence of key frame images.

Heterogeneous remote sensing image change detection based on hybrid network
ZHOU Yuan, LI Xiangrui, YANG Jing
2021, 47(3): 451-460. doi: 10.13700/j.bh.1001-5965.2020.0455
Abstract:

In order to more quickly and accurately perform the change detection task of heterogeneous remote sensing images, this paper presents a heterogeneous remote sensing image change detection algorithm based on a hybrid network. The algorithm uses a pseudo-siamese network to extract change features between the heterogeneous image blocks in spatial dimension, and uses an early fusion network to extract change features between the heterogeneous image blocks in spectral dimension. The features extracted from the two networks are fused and the fused features are input to the sigmoid layer for binary classification to determine whether the feature has changed. In addition, the contrast loss function is added to the pseudo-siamese network, so that in the features space, the spatial features of the unchanged image pair are closer, and the spatial features of the changed image pair are farther away, which is conducive to improving the network's distinction ability and convergence speed.

Causal classification method of transmission lines fitting defect combined with deep features
ZHAO Zhenbing, ZHANG Wei, QI Yincheng, ZHAI Yongjie, ZHAO Wenqing
2021, 47(3): 461-468. doi: 10.13700/j.bh.1001-5965.2020.0456
Abstract:

Aimed at the insufficient transmission lines fitting defect samples and diverse defect target shapes, a causal classification method combining deep network and logistic regression model is proposed to solve low defect classification accuracy when only using deep learning models. Firstly, rich and diverse datasets are obtained through the sample expansion method. Secondly, deep features are extracted based on the fine-tuned VGG16 model, and processed to construct an input feature set that conforms to causality learning. Finally, the causal relationship between fitting defect feature and label is learned through the global balance, and a causal logistic regression model is constructed to complete the classification of the fitting defects. Four types of fitting defect images collected by UAV are used respectively in the experiments to prove the effectiveness of the proposed method. The number of training and testing samples used is about 5 times higher than the original dataset. The experimental results show the proposed method can realize the accurate classification of the fitting defects, the classification accuracy of the shockproof hammer intersection and deformation reach 0.929 9 and 0.911 8 respectively, and the classification accuracy of the shielding ring corrosion and the grading ring damage reach 0.956 7 and 0.966 9 respectively.

Multimodal social sentiment analysis based on semantic correlation
HU Huijun, FENG Mengyuan, CAO Mengli, LIU Maofu
2021, 47(3): 469-477. doi: 10.13700/j.bh.1001-5965.2020.0451
Abstract:

Social platforms allow users to express opinions in a variety of information modalities, and multi-modal semantic information fusion can more effectively predict the emotional tendencies expressed by users. Therefore, multimodal sentiment analysis has received extensive attention in recent years. However, in multi-modal sentiment analysis, there is a problem of unrelated semantics between vision and text, resulting in poor sentiment analysis. In order to solve this problem, this paper proposes the Multimodal Social Sentiment Analysis based on Semantic Correlation (MSSA-SC) method. The MSSA-SC firstly adopts the semantic relevance classification model of image and text to identify the semantic relevance of the image-text social media. If the image and text are semantically related, the image and text semantic alignment multimodal model is used for the image-text feature fusion for the image-text social media sentiment analysis. When the image and text semantics are irrelevant, only the sentiment analysis is performed on the text modality. The experimental results on real social media datasets show that the MSSA-SC method can effectively reduce the influence of unrelated image and text semantics on multimodal social sentiment analysis. Moreover, the Accuracy and Macro-F1 of the MSSA-SC method are 75.23% and 70.18%, respectively, and outperform those of the benchmark model.

Deep multimodal feature fusion for micro-video classification
ZHANG Lijuan, CUI Tianshu, JING Peiguang, SU Yuting
2021, 47(3): 478-485. doi: 10.13700/j.bh.1001-5965.2020.0457
Abstract:

Nowadays, micro-video has become one of the most representative products in the new media era. It has the characteristics of short time and strong editing, which makes the traditional video classification models no longer suitable for micro-video classification task.Based on the characteristics of the micro-video classification problem, the micro-video classification algorithm based on deep multimodal feature fusion is proposed. The proposed algorithm inputs the visual modal information and acoustic modal information into the domain separation network, and divides the entire feature space into a shared domain part shared by all modalities and the private domain part unique to the acoustic and visual modalities respectively. By optimizing the domain separation network, the differences and similarities among different modal features are preserved to the greatest extent. The experiments on the public micro-video classification dataset prove that the proposed algorithm can effectively reduce the redundancy of feature fusion and improve the average classification accuracy to 0.813.

Fast, low-power and high-precision 3D reconstruction of UAV images based on FPGA
LI Jie, LI Yixuan, WU Tiansheng, WANG Haorong, LIANG Min
2021, 47(3): 486-499. doi: 10.13700/j.bh.1001-5965.2020.0452
Abstract:

The existing 3D reconstruction methods based on Unmanned Aerial Vehicle (UAV) images cannot meet the mobile terminal's demand for low power consumption and high time efficiency. To tackle this issue, we propose a fast, low-power and high-precision 3D reconstruction method based on resource-constrained FPGA platforms, which combines instruction optimization strategy and hardware-software co-design method. First, we construct a multi-scale depth map fusion algorithm architecture to enhance the robustness of traditional FPGA phase correlation algorithms to untrustworthy areas, such as low-texture area and rivers. Secondly, based on the high parallel instruction optimization hardware acceleration function strategy, a high-performance hardware-software co-design scheme is proposed to realize the efficient operation of the multi-scale deep map fusion algorithm architecture on the FPGA platform with limited resources. Finally, we comprehensively compare the state-of-the-art CPU and GPU methods with our method. The experimental results show that our method is close to the GPU method in reconstruction time consumption, nearly 20 times faster than the CPU method, but the power consumption is only 2.23% of the GPU method.

Many-to-many voice conversion with sentence embedding based on VAACGAN
LI Yanping, CAO Pan, SHI Yang, ZHANG Yan
2021, 47(3): 500-508. doi: 10.13700/j.bh.1001-5965.2020.0475
Abstract:

To solve the problems of poor speech quality and unsatisfactory speaker similarity for converted speech in existing non-parallel VC methods, this paper presents a novel voice conversion model based on Variational Autoencoding Auxiliary Classifier Generative Adversarial Network (VAACGAN) with sentence embedding, which achieves high-quality many-to-many voice conversion for non-parallel corpora. First, in the ACGAN, the discriminator contains an auxiliary decoder network that predicts the true or false of the spectral feature meanwhile the speaker category to which the training data belongs, thus achieving a more stable training process and a faster iterative convergence. Furthermore, sentence embedding is obtained by training the text encoder, which is introduced into the model as semantic content constraint, can enhance the ability of latent variables to characterize the speech content, effectively solve the over-regularization effect of latent variables and improve the quality of converted speech significantly. Experimental results show that the average value of MCD of the converted speech is decreased by 6.67%, MOS is increased by 8.33%, and ABX is increased by 11.56% compared with baseline method, which demonstrate that the proposed method significantly outperforms the baseline method in both speech quality and speaker similarity and achieves high-quality voice conversion.

An underwater coral reef fish detection approach based on aggregation of spatio-temporal features
CHEN Zhineng, SHI Cuncun, LI Xuanya, JIA Caiyan, HUANG Lei
2021, 47(3): 509-519. doi: 10.13700/j.bh.1001-5965.2020.0444
Abstract:

It is challenging to detect coral reef fish from underwater surveillance videos, due to issues like poor video imaging quality, complex underwater environment, high visual diversity of coral reef fish, etc. Extracting discriminative features to characterize the fishes has become a crucial issue that dominates the detection accuracy. This paper proposes an underwater coral reef fish detection method based on aggregation of spatio-temporal features. It is achieved by designing two modules for visual and temporal feature aggregation and fusing multi-dimensional features. The former designs a top-down partition and a bottom-up merging, which achieve effective aggregation of feature maps of different convolutional layers with varying resolutions. The latter devises a temporal feature fusion scheme based on the pixel difference between adjacent frames. It enhances the feature representation of moving objects and their surrounding area through the fusion of feature maps coming from adjacent frames. Experiments on a public dataset show that, by employing the spatio-temporal aggregation network built on top of the two proposed modules, we can effectively detect coral reef fishes in the challenging underwater environment. Higher detection accuracy are obtained compared with the existing methods and popular detection models.

Multi-granularity hazard detection method for electrical power system
XU Xiaohua, QIAN Ping, WANG Yida, ZHOU Xinyue, XU Hanlin, XU Libing
2021, 47(3): 520-530. doi: 10.13700/j.bh.1001-5965.2020.0491
Abstract:

As the security hazard of the electrical power system can lead to serious economic damage and social impacts, the potential hazard detection has become an indispensable part for electrical power system. With the advances of artificial intelligence, intelligent deep learning based hazard detection methods for electrical power system have emerged. Although the existing methods have made promising progress, most of them only consider the global or local features of the image, which cannot thoroughly characterize the imageand accurately conduct the hazard detection in the context of the electrical power system especially for the complex outdoor background. In the light of this, in this paper, we present a multi-granularity hazard detection network MGNet for the electrical power system. To be specific, we explore the multi-granularity representation of images with both the global and local representation learning networks. Based on that, we conduct the hazard detection at different granularity levels and finally collaboratively fuse the detection results to fulfill the precise hazard detection. Extensive experiments on two real-world datasets of hazard(i.e., tower connection fitting hazard dataset and transmission line channel mechanical hazard dataset) demonstrate the superiority of the detection performance of the proposed model. In particular, the mean average precision is improved by 2.74% and 2.77% on two datasets, respectively, compared with the existing optimal hazard detection benchmark method.

Topological relation detection technology of substation wiring diagram in electric power system
LI Hao, GUAN Ti, WANG Shan, SHI Wei, LIU Zixin, LIU Xiaochuan
2021, 47(3): 531-538. doi: 10.13700/j.bh.1001-5965.2020.0476
Abstract:

Topological relation of electrical components is the core-data required by substation wiring diagram automatic generation technology. At present, known technologies still rely much on artificial access to topological relations. By the combination of deep learning based object detection technology and traditional computer image processing technology, topological relation can be detected automatically. Firstly, to segment the electrical components and connection lines, deep learning based object detection technology was used to identify the electrical components, and the image processing technology was used to preprocess the scalar-format wiring diagram of power plants. Secondly, a contour tracking algorithm was adopted to detect and mark the connected area of the connection lines. Finally, the topological relation of the drawing was acquired according to the obtained information of electrical components and connection lines. Comparative experiments based on a dataset released by the State Grid Corporation of China indicate the effectiveness of the proposed method.

Wiring diagram detection and check based on deep learning and graph matching
LI Hao, WANG Shan, GENG Yujie, WANG Li, SUN Wenchang, MIAO Chunyuan
2021, 47(3): 539-548. doi: 10.13700/j.bh.1001-5965.2020.0478
Abstract:

The drawing and management of the traditional primary wiring diagram of plant and station mainly depend on the operators of power grid, which wastes time and labor and lacks scientific and verifiable reference standards. Towards this problem, based on the deep neural networks and digital image processing, we propose an algorithm for automatic detection, recognition and verification. Specifically, Faster R-CNN is first adopted to detect the electrical components of wiring diagram with 92% detection accuracy. In the meantime, an end-to-end text detection and recognition model is used to recognize the text with 94.2% detection accuracy and 92% character recognition accuracy. Then we take advantage of digital image processing technique to identify the connection of wiring diagram and topological relation. Finally, improved graph matching algorithm VF2 is used to check the difference between the electronic and manually maintained diagrams. The topological data is abstracted into an undirected graph, and the relative position information of components is obtained through the outline number. Based on the improved VF2 algorithm, we can compute the matching rate of two graphs to help the verification. Compared with the matching method of node traversal, the verification accuracy can be improved by 37.5%. Based on the first wiring diagram of some substations provided by a power supply company, this paper marks the electrical components of wiring diagram and contributes a small wiring diagram dataset.

Small sample hyperspectral image classification method based on memory association learning
WANG Cong, ZHAGN Jinyang, ZHANG Lei, WEI Wei, ZHANG Yanning
2021, 47(3): 549-557. doi: 10.13700/j.bh.1001-5965.2020.0498
Abstract:

Hyperspectral Image (HSI) classification is one of the fundamental applications in remote sensing domain. Due to the expensive cost of manual labeling in HSIs, in real applications, only small labeled samples can be obtained. However, limited samples cannot accurately describe the data distribution and often cause the training of classifiers to be overfitting. To address this problem, we present a small sample hyperspectral image classification method based on memory association learning. First, considering that the unlabeled samples also contain a lot of information related to the data distribution, we construct a memory module based on the labeled samples. Then, according to the feature association among labeled and unlabeled samples, we learn the label distribution of the unlabeled sample with the continuously updated memory module. Finally, we build an unsupervised classifier model and a supervised classifier model, and jointly learn these two models. Extensive experimental results on multiple hyperspectral image classification datasets demonstrate that the proposed method can effectively improve the accuracy of small sample HSI classification.

Low-latency video coding techniques
SONG Li, LIU Xiaoyong, WU Guoqing, ZHU Chen, HUANG Yan, XIE Rong, ZHANG Wenjun
2021, 47(3): 558-571. doi: 10.13700/j.bh.1001-5965.2020.0463
Abstract:

With the widespread usage of video coding and transmission techniques, demands for video have increased dramatically. Real-time video communication has become the research focus of the video industry. Its core goal is to provide a better user experience and lower latency. Low-latency video coding is a key component for real-time video communication applications. The overall system latency can be effectively reduced by reducing the coding latency. First, this paper analyzes sources of latency in the video transmission system. Focusing on the general video coding framework, this paper introduces the generation mechanism of the coding latency. Then, mainstream video coding standards at home and abroad are outlined. The detailed description of the principle and models of rate-distortion optimization techniques provides a theoretical basis for the design of low-latency video encoders. Additionally, this paper summarizes how to optimize the coding latency in terms of the reference structures, pipeline design, encoding modes search, rate control, and hardware acceleration and generalizes industrial representative low-latency video coding schemes. Finally, this paper summarizes the limitations of the existing low-latency video coding techniques and presents future research directions.

Saliency guided low-light face detection
LI Kefu, ZHONG Huicai, GAO Xingyu, WENG Chaoqun, CHEN Zhenyu, LI Yongzhou, WANG Shizheng
2021, 47(3): 572-584. doi: 10.13700/j.bh.1001-5965.2020.0469
Abstract:

To deal with the problem that it is hard for convolution neural network to do face detection in low light environment, we propose a method combining image saliency and deep learning and apply it to low-light face detection, which integrates saliency and the original RGB channels of the image into neural network training. Sufficient experiments are implemented on DARK FACE, a low-light face dataset, and the results show that the proposed low-light face detection method achieves better detection accuracy than the existing mainstream face detection algorithms on DARK FACE, thus confirming the validity of the proposed method.

Classification of satellite cloud images of disaster weather based on adversarial and transfer learning
ZHANG Minjing, BAI Cong, ZHANG Jinglin, ZHENG Jianwei
2021, 47(3): 585-595. doi: 10.13700/j.bh.1001-5965.2020.0459
Abstract:

Weather can be forecasted based on clouds. However, how to use deep learning technology to achieve automatic weather forecasting, especially the automatic recognition of disaster weather, is still an unexplored field. Hence, it is necessary to carry out research on the basic problem in the field of automatic identification: the classification of satellite cloud images. Satellite cloud images have serious data imbalance problems. That is, cloud image data related to severe weather accounts for a very small proportion of all cloud image data. Therefore, this paper proposes a framework combining Generative Adversarial Network (GAN) and Transfer Learning (TL) based Convolutional Neural Network (CNN) to solve the problem of low accuracy of disaster weather classification based on satellite cloud images. The framework is mainly divided into a data balancing module based on GAN and a CNN classification module based on transfer learning. The above two modules solve the data imbalance problem from the data and algorithm level respectively, and obtain a relatively balanced dataset and a classification model that can extract relatively balanced features on different types of data. Eventually, the classification of satellite cloud images is achieved and the accuracy of the classification of satellite cloud images in disaster weather is improved. The method proposed in this paper has been tested on self-built large-scale satellite cloud image data. The ablative properties and comprehensive experimental results prove that the proposed data balancing method and transfer learning method are effective, and the proposed framework model has significantly improved the classification accuracy of various disaster weather categories.

Cross-modal video retrieval algorithm based on multi-semantic clues
DING Luo, LI Yifan, YU Chenglong, LIU Yang, WANG Xuan, QI Shuhan
2021, 47(3): 596-604. doi: 10.13700/j.bh.1001-5965.2020.0470
Abstract:

Most of the existing cross-modal video retrieval algorithms map heterogeneous data to a space, so that semantically similar data are close to each other and semantically dissimilar data are far from each other, that is, the global similarity relationship of different modal data is established. However, these methods ignore the rich semantic clues in the data, which makes the performance of feature generation poor. To solve this problem, we propose a cross-modal retrieval model based on multi-semantic clues. This model captures the data frames that play an important role in semantics within video model through multi-head self-attention mechanism, and pays attention to the important information of video data to obtain the global characteristics of the data. Bidirectional Gate Recurrent Unit (GRU) is used to capture the interaction characteristics between contexts within multimodal data. Our method can also mine the local information in video and text data through the joint coding of the slight differences between the local data. Through the global features, context interaction features and local features of the data, the multi-semantic clues of the multi-modal data are formed to better mine the semantic information in the data and improve the retrieval effect. Besides this, an improved triplet distance measurement loss function is proposed, which adopts the difficult negative sample mining method based on similarity sorting and improves the learning effect of cross-modal characteristics. Experiments on MSR-VTT dataset show that the proposed method improves the text retrieval video task by 11.1% compared with the state-of-the-art methods. Experiments on MSVD dataset show that the proposed method improves the text retrieval video task by 5.0% compared with the state-of-the-art methods.

Cross-resolution person re-identification based on attention mechanism
LIAO Huanian, XU Xin
2021, 47(3): 605-612. doi: 10.13700/j.bh.1001-5965.2020.0471
Abstract:

The resolution variation of person images poses great challenges to current person re-identification methods. To address this problem, this paper presents a cross-resolution person re-identification method. This method solves the resolution variation from two aspects. On the one hand, the spatial and channel attention mechanisms are utilized to capture person features and obtain local region; On the other hand, local information of any resolution image is recovered by the nuclear dynamic upsampling module. Comparative experiments have been conducted to verify the effectiveness of the proposed method against state-of-the-art methods on Market1501, CUHK03, and CAVIAR person re-identification datasets. The experimental results show that the proposed method has the best performance.

Multi-scale joint learning for person re-identification
XIE Pengyu, XU Xin
2021, 47(3): 613-622. doi: 10.13700/j.bh.1001-5965.2020.0445
Abstract:

The existing person re-identification approaches mainly focus on learning person's local features to match a specific pedestrian across different cameras. However, in the presence of incomplete conditions of pedestrian data such as motion or occlusion of human body parts, background interference, etc., it leads to an increase in the probability of partial loss of pedestrian recognition information. This paper presents a multi-scale joint learning method to extract the fine-grained person feature. This method consists of three subnets, i.e. coarse-grained global feature extraction subnet, fine-grained global feature extraction subnet, and fine-grained local feature extraction subnet. The coarse-grained global feature extraction subnet enhances the diversity of the global feature by fusing semantic information at different levels. The fine-grained global branching unites all local features to learn the correlation among local components of a pedestrian while describing the global features at a fine-grained level. The fine-grained local feature extraction subnet enhances robustness by traversing local features and finding out pedestrian non-significant information. Comparative experiments have been conducted to evaluate the performance of the proposed method against state-of-the-art methods on Market1501, DukeMTMC-ReID, and CUHK03 person re-identification datasets. The experimental results show that the proposed method has the best performance.

Block-diagonal projective representation for face recognition
LIU Baolong, WANG Yong, LI Danping, WANG Lei
2021, 47(3): 623-631. doi: 10.13700/j.bh.1001-5965.2020.0460
Abstract:

Most feature representation algorithms are susceptible to noise when mining the internal structure of the high-dimensional data. Meanwhile, their feature learning and classifier design are separated, resulting in the limited classification performance in practice. Aimed at this issue, a new feature representation method, Block-Diagonal Projective Representation (BDPR), is proposed in this paper. First, a weighted matrix is imposed on the coding coefficients of samples over each class. By using such local constraints to enhance the similarity between the coefficients and reduce the impact of noise on coefficient learning, the proposed BDPR can well maintain the internal data structure. Second, to closely correlate the data with their coding coefficients and reduce the difficulty of learning the representation coefficients, we construct a block-diagonal constraint to learn a discriminative projection. In this way, the sample representation coefficients can be obtained in the low-dimensional projected subspace, which contains more global structure information between samples and enjoys lower computational complexity. Finally, the representation learning and classifier learning are integrated into the same framework. By increasing the "label distance" between samples of different classes, BDPR updates the discriminative projection and classifier in an iterative manner. In this way, the most suitable classifier can be found for the current optimal feature representation, making the proposed algorithm automatically realize the classification task. The results of experiments on multiple benchmark face datasets show that BDPR has achieved better recognition performance, compared to traditional collaborative representation based classification and several mainstream subspace learning algorithms.

Social image tag refinement and annotation based on noise Cauchy distribution
LIAN Lianrong, XIANG Xinguang
2021, 47(3): 632-640. doi: 10.13700/j.bh.1001-5965.2020.0454
Abstract:

With the rapid development of social networks, images with social tags have increased explosively. However, these tags are usually inaccurate and irrelevant which will make it harder for the relevant multimedia tasks. Although label noise is chaotic and disordered, it still conforms to a certain probability distribution. Most of the current methods use Gaussian distribution to fit the noise, but Gaussian distribution is very sensitive to large noise. Thus we use the Cauchy distribution to fit the noise, which is robust to various noises. In this paper, we propose a weakly-supervised Non-negative Low-rank deep learning model based on Cauchy Distribution (CDNL), which builds the noise model by Cauchy distribution to obtain the ideal label and uses deep neural network to reveal the intrinsic connection between the visual features of the image and the ideal labels. The proposed method can not only correct wrong labels and add missing labels, but also tag new images. Experiments are conducted on two public social network image datasets. Compared with some of the latest related work, the results show the effectiveness of the proposed method.

Ethnic identification by combining features of skull morphology with neural network
SUN Huijie, ZHAO Junli, ZHENG Xin, REZIWANGULI Xiamixiding, LI Yi, ZHOU Mingquan
2021, 47(3): 641-649. doi: 10.13700/j.bh.1001-5965.2020.0446
Abstract:

China is a multi-ethnic country. It is of great significance for the skull identification to realize the skull ethnic identification through computers, which can promote the development of forensic anthropology and exploration of national development. Firstly, according to the skull morphology studies, 36 Uighur and Han geometric features of the skull data are extracted, and the Back-Propagation Neural Network (BPNN) of feature vectors is used for ethnic identification. In order to optimize the network, Adam algorithm is adopted to avoid falling into local minimum, and to ensure the stability of the algorithm with regularization terms. Two network structures are used for comparative experiments. The number of neurons in the input layer, hidden layer and output layer are 36, 6, 2 and 36, 12, 2, respectively, and different initial learning rates are set for comparative experiments. The results show that, when the number of hidden-layer neurons is 12 and the learning rate is 0.000 1, the classification accuracy is the highest and the highest accuracy rate in the test stage is 97.5%. In order to verify the universality of the method in this paper, 116 foreign skull data are generated for experiments, and the accuracy rate of the test stage is 90.96%. Compared with machine learning methods such as Support Vector Machine (SVM), decision-making tree, KNN, and Fisher, the proposed method has stronger learning ability and significantly improved classification accuracy.

Video summarization by learning semantic information
HUA Rui, WU Xinxiao, ZHAO Wentian
2021, 47(3): 650-657. doi: 10.13700/j.bh.1001-5965.2020.0447
Abstract:

Video summarization aims to generate short and compact summary to represent original video. However, the existing methods focus more on representativeness and diversity of representation, but less on semantic information. In order to fully exploit semantic information of video content, we propose a novel video summarization model that learns a visual-semantic embedding space, so that the video features contain rich semantic information. It can generate video summaries and text summaries that describe the original video simultaneously. The model is mainly divided into three modules: frame-level score weighting module that combines convolutional layers and fully connected layers; visual-semantic embedding module that embeds the video and text in a common embedding space and make them lose to each other to achieve the purpose of mutual promotion of two features; video caption generation module that generates video summary with semantic information by minimizing the distance between the generated description of the video summary and the manually annotated text of the original video. During the test, while obtaining the video summary, we obtain a short text summary as a by-product, which can help people understand the video content more intuitively. Experiments on SumMe and TVSum datasets show that the proposed model achieves better performance than the existing advanced methods by fusing semantic information, and improves F-score by 0.5% and 1.6%, respectively.

Multimodal deformable registration based on unsupervised learning
MA Tengyu, LI Zi, LIU Risheng, FAN Xin, LUO Zhongxuan
2021, 47(3): 658-664. doi: 10.13700/j.bh.1001-5965.2020.0449
Abstract:

Multimodal deformable registration is designed to solve dense spatial transformations and is used to align images of two different modalities. It is a key issue in many medical image analysis applications. Multimodal image registration based on traditional methods aims to solve the optimization problem of each pair of images, and usually achieves excellent registration performance, but the calculation cost is high and the running time is long. The deep learning method greatly reduces the running time by learning the network used to perform registration. These learning-based methods are very effective for single-modality registration. However, the intensity distribution of different modal images is unknown and complex. Most existing methods rely heavily on label data. Faced with these challenges, this paper proposes a deep multimodal registration framework based on unsupervised learning. Specifically, the framework consists of feature learning based on matching amount and deformation field learning based on maximum posterior probability, and realizes unsupervised training by means of spatial conversion function and differentiable mutual information loss function. In the 3D image registration tasks of MRI T1, MRI T2 and CT, the proposed method is compared with the existing advanced multi-modal registration methods. In addition, the registration performance of the proposed method is demonstrated on the latest COVID-19 CT data. A large number of results show that the proposed method has a competitive advantage in registration accuracy compared with other methods, and greatly reduces the calculation time.

A few shot segmentation method combining global and local similarity
LIU Yuxuan, MENG Fanman, LI Hongliang, YANG Jiaying, WU Qingbo, XU Linfeng
2021, 47(3): 665-674. doi: 10.13700/j.bh.1001-5965.2020.0450
Abstract:

Few shot segmentation aims at segmenting objects of novel classes with few annotated images, whose key is to extract the general information between support and query images. The existing methods utilize global features or local features to obtain the general information and validate their effectiveness. However, these two kinds of features' similarity are considered separately in the existing methods, while their mutual effect is ignored. This paper proposes a novel few shot segmentation model using both global and local similarity to achieve more generalizable few shot segmentation. Specifically, an attention generator is proposed to build the attention map of query images based on the relationship between support and query images. The proposed attention generator consists of two cascaded modules: global guider and local guider. In global guider, a novel exponential function based global similarity metric is proposed to model the relationship between query images and support images with respect to global feature, and output foreground-enhanced query image features. As for local guider, a local relationship matrix is introduced to model the local similarity between support and query image features, and obtain the class-agnostic attention map. Comprehensive experiments are performed on Pascal-5i idataset. The proposed method achieves a mean IoU value of 59.9% under 1-shot setting, and a mean IoU of 61.9% under 5-shot setting, which outperforms many state-of-the-art methods.