2024 Vol. 50, No. 2

Display Method:
Volume 2 Issue E-journal
Volume 50 Issue22024
iconDownload (49828) 500 iconPreview
An overview of visual SLAM methods
WANG Peng, HAO Weilong, NI Cui, ZHANG Guangyuan, GONG Hui
2024, 50(2): 359-367. doi: 10.13700/j.bh.1001-5965.2022.0376
Abstract:

Simultaneous localization and mapping (SLAM) enables mobile robots to calculate their position and pose by independently building an environment model during movement without any environmental prior conditions by carrying specific sensors. It can greatly improve the autonomous navigation ability of mobile robots and their adaptability to different application environments, and contribute to the subsequent implementation of dynamic path planning, real-time obstacle avoidance and multi-robot collaboration. Visual SLAM refers to using the camera as an external sensor to collect ambient information to create a map and estimate the robot’s own position in real-time. The study describes and examines the various feature detection approaches, back-end optimization, loop closure detection, and the application of visual SLAM in a dynamic environment in addition to introducing the standard classical visual SLAM methods and the visual SLAM methods mixed with deep learning. This study addresses the current state-of-the-art in research and the potential growth of visual SLAM in the future before summarizing the issues with visual SLAM raised here.

Zero-shot object detection based on multi-modal joint semantic perception
DUAN Lijuan, YUAN Ying, WANG Wenjian, LIANG Fangfang
2024, 50(2): 368-375. doi: 10.13700/j.bh.1001-5965.2022.0392
Abstract:

Existing zero-shot object detection maps visual features and category semantic embeddings of unseen items to the same space using semantic embeddings as guiding information, and then classifies the objects based on how close together the visual features and semantic embeddings are in the mapped space. However, due to the singleness of semantic information acquisition, the lack of reliable representation of visual information can easily confuse background information and unseen object information, making it difficult to indiscriminately align visual and semantic information. In order to effectively achieve zero-shot object detection, this paper uses the visual context module to capture the context information of visual features and the semantic optimization module to interactively fuse the text context and visual context information. By increasing the diversity of visual expressions, the model is able to perceive the discriminative semantics of the foreground. Experiments were conducted on two divided datasets of MS-COCO, and a certain improvement was achieved in the accuracy and recall rate of zero-shot target detection and generalized zero-shot target detection. The results proved the effectiveness of the proposed method.

Accurate license plate location based on synchronous vertex and body region detection
XU Guangzhu, LIU Gaofei, KUANG Wan, WAN Qiubo, MA Guoliang, LEI Bangjun
2024, 50(2): 376-387. doi: 10.13700/j.bh.1001-5965.2022.0396
Abstract:

A novel unconstrained license plate accurate location algorithm is designed by simultaneously detecting the four local vertex regions and the body of a license plate and fusing the results to address the issue that the widely used rectangular bounding boxes in mainstream target detection methods cannot meet the license plate location accuracy requirement in many unconstrained environments where the license plate images are not commonly rectangle. At first, the four local rectangular sub-regions with centers on four vertices of a license plate were annotated as vertex-region objects according to the size of the plate’s contour-rectangle and the vertex coordinates. Then, a multi-class image dataset is built up in which the contour-rectangle region covering the whole license plate body is a class and the four kinds of vertex-region construct other four classes. In order to locate these five object classes efficiently, the output structure of the YOLOv5 network is modified by taking accuracy and efficiency into consideration and trained with the newly constructed multi-class dataset.Finally, vertex region grouping and single missing vertex forecasting are carried out as the post-processing to address the issue that there are multiple candidate license plates in an image and a few vertices region false or missing detection errors will happen in some unique instances.By exploiting the relationship among the vertexes, the post-processing can effectively recognize missing and false detection errors in some special complex scenarios and improve the whole system’s performance greatly. The proposed algorithm is evaluated on the Chinese city parking dataset (CCPD), and reaches an average positioning accuracy of 99.25% and an average recall rate of 98.70%. The performance certificates our method not only can accurately predict the coordinates of the four vertices but also can run at 121 frame/s on a moderate GPU hardware platform, which has great application potential.

Image super-resolution reconstruction network based on expectation maximization self-attention residual
HUANG Shuying, HU Hanyang, YANG Yong, WAN Weiguo, WU Zheng
2024, 50(2): 388-397. doi: 10.13700/j.bh.1001-5965.2022.0401
Abstract:

In recent years, most deep learning-based image super-resolution (SR) reconstruction methods mainly improve the quality of image reconstruction by increasing the depth of the model, while also increasing the computational cost of the model. Additionally, a lot of networks have implemented the attention mechanism to enhance their capacity for feature extraction, but it is still challenging to properly understand the properties of various regions. In response to the above problems, this paper proposes a novel SR reconstruction network based on expectation maximization (EM) self-attention residual. The network constructs a feature-enhanced residual block by improving the basic residual block to better reuse the features extracted from the residual block. In order to increase the spatial correlation of the feature information, an EM self-attention residual block is constructed by introducing the EM self-attention mechanism, which is used to enhance the feature extraction capability of each module in the deep network model. Moreover, the feature extraction structure of the entire model is constructed by cascading EM self-attention residual blocks. Finally, a reconstructed high-resolution image is obtained through an up-sampling image reconstruction module.In order to verify the effectiveness of the proposed method, this paper has carried out comparison experiments with some mainstream methods. The experimental results show that the proposed method can achieve better subjective visual effects and better objective evaluation indicators on five popular widely used SR test datasets.

Discrete sparrow search algorithm incorporating rough data-deduction for solving hybrid flow-shop scheduling problems
ZHOU Ning, ZHANG Songlin, ZHANG Chen
2024, 50(2): 398-408. doi: 10.13700/j.bh.1001-5965.2022.0424
Abstract:

To address the shortcomings of the sparrow search algorithm (SSA), such as easy fall into local optimum and inability to solve discrete optimization problems, an improved discrete sparrow search algorithm (IDSSA) is proposed. Firstly, the position update formula of the original sparrow search algorithm is abstracted, with a new discrete heuristic position update strategy designed according to the different identities of individuals, and with the encoding and decoding methods designed for the hybrid flow-shop scheduling problem (HFSP). Secondly, the rough data-deduction theory is introduced, and the feasibility and rationality of the above theory are explained by mathematical proofs, providing theoretical support for the algorithm and improving the interpretability. Then, the nature of upper approximation is adopted to expand the search space, improve the population diversity, and avoid prematurity of the algorithm. Division and rough data-deduction are combined to propose three strategies to promote information sharing among populations, regulate the exploitation ability and exploration ability of populations, and reduce the probability of the algorithm falling into local optimum. Finally, the improved discrete sparrow search algorithm is used to solve the hybrid flow shop scheduling problem. Simulation experiments are carried out on three small-scale practical examples and ten Liao’s classic test sets to verify the feasibility of the improved algorithm. Results show the superiority of the proposed algorithm and the effectiveness of the improved strategy through comparison with classical algorithms such as genetic algorithm and differential evolutionary algorithm.

Aspect sentiment triple extraction for grammar-weighted graph text
HAN Hu, MENG Tiantian
2024, 50(2): 409-418. doi: 10.13700/j.bh.1001-5965.2022.0443
Abstract:

Aspect sentiment triple extraction includes three tasks: aspect term extraction, opinion term extraction, and aspect sentiment classification. However, research methods that solve this task in a pipeline way cannot utilize the interaction information between elements, and will also cause error propagation and redundant training. To solve the above problems, an aspect sentiment triple extraction method based on gated attention and weighted graph text is proposed, which makes full use of the semantic and grammatical relationships between triple elements to enhance element interaction. Firstly, the model uses a bidirectional long-short-term memory network to learn the sequence feature representation of sentences. Secondly, a gated attention unit is used to learn linear connections between words. Thirdly, a grammatical distance-weighted graph convolutional network is employed to enhance the interactions between triplet elements. Finally, a grid tagging inference strategy is applied to predict triples. Experimental results on four public datasets show that the proposed method can effectively enhance the interaction between triple elements and improve the accuracy of triple extraction. Moreover, the F1 values of the proposed method are 57.94%, 70.54%, 61.95% and 67.66%, respectively, which are all improved compared to the baseline model.

Real-time robust visual tracking based on spatial attention mechanism
MA Sugang, ZHANG Zixian, PU Lei, HOU Zhiqiang
2024, 50(2): 419-432. doi: 10.13700/j.bh.1001-5965.2022.0329
Abstract:

A real-time object tracking method coupled with a spatial attention mechanism is suggested in order to enhance the fully convolutional Siamese network (SiamFC) tracker’s tracking capability in complex settings and alleviate the target drift problem in the tracking process. The improved visual geometry group (VGG) network is used as the backbone network to enhance the modeling ability of the tracker for the target deep feature. The self-attention mechanism is optimized, and a lightweight single convolution attention module (SCAM) is proposed. The spatial attention is decomposed into two parallel one-dimensional feature coding processes to reduce the computational complexity of spatial attention. The initial target template in the tracking process is retained as the first template, and the second template is dynamically selected by analyzing the variation of the connected domain in the tracking response map. The target is located after fusing the two templates. The experimental results show that, compared with SiamFC, the success rate of the proposed algorithm on OTB100, LaSOT, and UAV123 datasets is increased respectively by 0.082, 0.045, and 0.045, and the tracking accuracy by 0.118,0.051, and 0.062. On the VOT2018 dataset, the proposed algorithm improves the tracking accuracy, robustness, and expected average overlap by 0.029, 0.276, and 0.134, respectively, compared with SiamFC. Real-time tracking requirements can be satisfied by the tracking speed, which can approach 70 frames per second.

Cross-modality nearest neighbor loss for visible-infrared person re-identification
ZHAO Sanyuan, A Qi, GAO Yu
2024, 50(2): 433-441. doi: 10.13700/j.bh.1001-5965.2022.0422
Abstract:

The goal of the visual-infrared person re-identification task is to search the image of a specific person in a given modality in the image set taken by other cameras in different modality to find out the corresponding image of the same person. Due to the different imaging methods, there are obvious modal differences between images of different modalities. Therefore, from the perspective of metric learning, the loss function is improved to obtain more discriminative information. The cohesiveness of image features is analyzed theoretically, and a re-recognition method based on cohesiveness analysis and cross-modal nearest neighbor loss function is proposed to strengthen the cohesiveness of different modal samples. The similarity measurement problem of cross-modal hard samples is transformed into the similarity measurement of cross-modal nearest neighbor sample pairs and the same modality sample pairs, which makes the optimization of modal cohesion of the network more efficient and stable. The proposed method is experimentally verified on the baseline networks of global feature representation and partial feature representation. Compared with the baseline method, the proposed method can improve the average accuracy of the visual and infrared person re-identification by up to 8.44%. The universality of the proposed method in different network architectures is proved. Moreover, at the cost of less model complexity and less computation, the reliable visual-infrared person re-identification results are achieved.

A Transformer based deep conditional video compression
LU Guo, ZHONG Tianxiong, GENG Jing
2024, 50(2): 442-448. doi: 10.13700/j.bh.1001-5965.2022.0374
Abstract:

Convolutional neural networks (CNN) are the foundation of most recent learning-based video compression algorithms, which also use residual coding and motion compensation architectures. It is difficult to attain the best compression performance given that typical CNN can only use local correlations and the sparsity of prediction residual. To solve the problems above, this paper proposed a Transformer-based deep conditional video compression algorithm, which can achieve better compression performance. The proposed algorithm uses deformable convolution to obtain the predicted frame feature based on the motion information between the front and rear frames. The predicted frame feature is used as conditional information to conditionally encode the original input frame feature which avoids the direct encoding of sparse residual signals. The proposed algorithm further utilizes the non-local correlation between the features and proposes a transformer-based autoencoder architecture to implement motion coding and conditional coding, which further improves the performance of compression. Experiments show that our Transformer based deep conditional video compression algorithm surpasses the current mainstream learning-based video compression algorithms in both HEVC and UVG datasets.

Solidly mounted resonator based on optimized Bragg structure
ZHANG Shifeng, XUAN Weipeng, SHI Linhao, DONG Shurong, PU Shiliang
2024, 50(2): 449-455. doi: 10.13700/j.bh.1001-5965.2022.0436
Abstract:

The effective coupling coefficient and the quality factor of the bulk acoustic wave resonators determine the overall performance of bulk acoustic wave filters. The effective coupling coefficient is dependent on the layer stack structure, especially the piezoelectric material, while the quality factor is highly dependent on the loss mechanism, including the electrical and acoustical loss. For solidly mounted resonator (SMR), the acoustic losses mainly include the acoustic leakage to the substrate. In this work, to enhance the quality factor of the SMR, the structure of the Bragg layer is optimized to reduce the acoustic energy leakage to the substrate. Through optimizing the Bragg stack, the longitudinal and shear waves can be confined in the piezoelectric stack simultaneously, and the quality factor at the anti-resonance frequency is highly improved. Moreover, to suppress the spurious mode of the resonator, the thickness of the top layer of the Bragg is optimized to change the device dispersion characteristic from type II to type I. Experimental results show that the performance of the SMR is improved greatly based on the optimized Bragg structure.

Image captioning model based on divergence-based and spatial consistency constraints
JIANG Wenhui, CHEN Zhiliang, CHENG Yibo, FANG Yuming, ZUO Yifan
2024, 50(2): 456-465. doi: 10.13700/j.bh.1001-5965.2022.0400
Abstract:

The multi-head attention mechanism has been widely adopted in image captioning. It is appealing for the ability to jointly attend to information from different representation subspaces. However, as each head captures distinct properties of the input individually, the diversity between heads’ representations is not guaranteed. In the meanwhile, most existing attention models encounter the problem of “attention defocus”, i.e., they fail to concentrate on correct image regions when generating the target words. Consequently, the generated sentences are not accurate enough. To address these problems, we propose a novel training objective that serves as an auxiliary regularization function to improve the diversity and accuracy of the multi-head attention mechanism. In the beginning, we present a divergence-based regularization that encourages each brain to concentrate on various areas of the goal. Partial representations are aggregated to produce distinct representations of the target. Secondly, we introduce a spatial consistency regularization that builds the spatial relationship among the attended regions. By encouraging the attended regions to be focussed, it enhances image captioning. We proposed a method for the joint action of divergence-based regularization and spatial consistency regularization. We compare the performance of the proposed method with state-of-the-art methods on challenging MS COCO datasets. The experimental results demonstrate the superior performance of the proposed method.

A person re-identification method for fusing convolutional attention and Transformer architecture
WANG Jing, LI Peitong, ZHAO Rongfeng, ZHANG Yun, MA Zhenling
2024, 50(2): 466-476. doi: 10.13700/j.bh.1001-5965.2022.0456
Abstract:

Person Re-identification technology is one of the important methods in intelligent security systems. In order to build a person re-identification model suitable for various complex scenarios, this article proposed a method of Fusing Convolutional Attention and Transformer architecture (FCAT) based on existing convolutional neural networks and Transformer models to enhance the Transformer’s attention to local detail information. This method mainly improves the transformer's ability to extract local detail features indirectly by embedding convolutional space attention and channel attention respectively to enhance the attention to important regions and important channel features in the image. Comparative ablation experiments on three publicly available pedestrian re-identification datasets demonstrate that the proposed method achieves comparable results on non-occluded datasets and significantly improves performance on occluded datasets. Additionally, the proposed model is more lightweight, leading to improved inference speed without increasing additional computational load or model parameters.

Periodic pattern-enhanced multi-view short-term load prediction
SU Wei, XIAO Xiaolong, SHI Mingming, FANG Xin, SI Xinyao
2024, 50(2): 477-486. doi: 10.13700/j.bh.1001-5965.2022.0399
Abstract:

Short-term load prediction is essential to ensure the proper operation of the power system. The existing efforts have two limitations: lack of mining the dependencies between features and ignore the periodic pattern of load changes. To solve the above limitations, we propose periodic Pattern-enhanced MultI-view Short-term power load prediction nEtworks, dubbed EPISODE. The framework includes two core components: a multi-view feature learning component and a periodic pattern-enhanced load prediction component. The former aims to effectively extract static features and time series features to obtain enhanced feature representation; the latter is to perform general time series mining and periodic time series mining to obtain comprehensive historical feature representation. The combination of the two aforementioned qualities results in the realization of the short-term load forecast. Extensive experiments have been conducted on real-world datasets, and the experimental results demonstrate the superiority of our proposed method.

Optimization of office process task allocation based on deep reinforcement learning
LIAO Chenyang, YU Jinsong, LE Xiangli
2024, 50(2): 487-498. doi: 10.13700/j.bh.1001-5965.2022.0290
Abstract:

In the office platform, we often need to face a large number of parallel heterogeneous process tasks. This not only tests the ability of task executors but also puts forward requirements for the performance of the scheduling system. The multi-agent game model based on Markov game theory is proposed in this paper, which adopts the reinforcement learning (RL) approach along with quantitative analysis of the degree of cooperation and relaxation. This model realizes the optimal scheduling system with the overall process degree and maximum completion time as the optimization objectives and enhances the overall execution efficiency. Finally, to confirm the efficacy of this approach, the meta-heuristic algorithm based on ant colony and the reinforcement learning algorithm based on D3QN and deep reinforcement learning (DRL) are contrasted using the real business system process as the experimental data and the identical optimization targets.

Improved YOLOv5s low-light underwater biological target detection algorithm
CHEN Yuliang, DONG Shaojiang, SUN Shizheng, YAN Kaibo
2024, 50(2): 499-507. doi: 10.13700/j.bh.1001-5965.2022.0322
Abstract:

A real-time detection method of low-light underwater biological target based on improved YOLOv5s, known as YOLOv5s-underwater, was proposed to address the issue of low biometric recognition accuracy caused by the significant attenuation of light in water, the complex image environment, and the movement of shooting equipment in the process of underwater optical image target detection. Firstly, to solve the problem of weak underwater light attenuation, the contrast-limited adaptive histogram equalization (CLAHE) algorithm is introduced to preprocess the input image, which solves the problems of color distortion and image roughness. Secondly, the spatial pyramid pooling fast (SPPF) module is proposed to solve the problems of low discrimination and serious feature loss of underwater objects in the complex low-light underwater image environment. Thirdly, a Swin-Transformer module based on the spin window is proposed to improve the generalization ability of the model. Finally, the network model structure is modified to improve the detection ability of small underwater targets. Simulation and experiment prove that the proposed method improves the detection accuracy by 30.7% compared with YOLOv5s. Results from experiments support the method’s efficacy.

Multi group sparrow search algorithm based on K-means clustering
YAN Shaoqiang, LIU Weidong, YANG Ping, WU Fengxuan, YAN Zhe
2024, 50(2): 508-518. doi: 10.13700/j.bh.1001-5965.2022.0328
Abstract:

A K-means multi-group sparrow search algorithm (KSSA) based on K-means clustering is proposed in order to improve the convergence speed of the sparrow search algorithm (SSA) in single population search, which causes redundancy in its convergence speed and makes it simple to ignore the flaw that the high-quality solution falls into local optimization. Firstly, the multi-population mechanism is introduced into SSA to weaken the convergence ability of a single population and reduce the probability of falling into local optimization. Secondly, in order to boost the effectiveness of early search, the sub-population is divided, the differences between the sub-populations are increased, and the members of the sub-population are forced to concentrate on searching within a certain area Then, the weighted center of gravity communication strategy is used to improve the quality of population communication, reduce the interference of its own population, and reduce the risk of all sub populations falling into local optimization due to a sub population falling into local optimization. Finally, dynamic reverse learning is introduced into vigilant to enhance their back feeding behavior and improve the defects of slow convergence speed and insufficient convergence accuracy caused by the increase of factor population. Through the test function simulation experiment, it is proved that KSSA has better optimization performance than SSA and other algorithms.

Multi-agent coverage control based on communication connectivity maintenance constraints
ZHANG Yunlin, MA Zhuangzhuang, SHI Lei, SHAO Jinliang
2024, 50(2): 519-528. doi: 10.13700/j.bh.1001-5965.2022.0340
Abstract:

Coverage control will disperse the agents as much as possible according to the environmental information to achieve a better spatial coverage effect and realize the optimal monitoring of the task area. In this process, the cooperation between agents depends on the connected communication network. Limited by the finite communication range of agents in complex electromagnetic environments, the decentralized behavior in coverage control may cause the interruption of the communication network and task failure. Therefore, to ensure that the coverage cost function lowers while the network connectivity does not fall below the predetermined threshold, this study uses the connectivity of the communication network as a constraint and offers a bounded distributed control law based on the gradient descent approach. A segmented control strategy based on the identification of critical agents is also proposed in order to lessen the impact of communication link maintenance on the coverage effect. By dynamically allocating the control gains of coverage and communication connectivity maintenance, the control oscillation and redundancy caused by the opposite movement trend of the two are reduced. Finally, aiming at the deadlock phenomenon of falling into local optimization, this paper proposes a deadlock elimination control, which can eliminate the deadlock in time and improve coverage performance. The coverage simulation experiment of the signal field generated by high-frequency structure simulation (HFSS) software shows the effectiveness of the proposed control laws.

Multi-hop knowledge graph question answering based on deformed graph matching
LI Xiangyue, FANG Quan, HU Jun, QIAN Shengsheng, XU Changsheng
2024, 50(2): 529-534. doi: 10.13700/j.bh.1001-5965.2022.0375
Abstract:

Knowledge Graph Question Answering (KGQA) is a process in which a given natural language question is semantically understood and parsed, and then the knowledge graph is used to query and reason to get the answer. But knowledge graphs which lack links, bring many challenges to multi-hop question answering. Many methods ignore important path information to evaluate the correlation between paths and multi-relationship problems when using knowledge graph embeddings, and text corpora also limit the scalability of text-enhanced models. Due to the drawbacks of these existing approaches, the Multi-hop Knowledge Graph Question Answering Based on Deformed Graph Matching (DGM-KGQA) method is proposed. This method builds semantic subgraphs using both question and topic entities, which then match the local structure of the knowledge graph to determine the correct solution. The experimental results on the benchmark dataset MetaQA verify the effectiveness of DGM-KGQA. In comparison to PullNet and EmbedKGQA, the accuracy of the answers retrieved on the completed knowledge graph is 4.2% higher than that of PullNet and 0.8% higher than that of EmbedKGQA. The accuracy of the answers retrieved on half of the knowledge graphs is 11.1% higher than that of PullNet and 0.5% higher than that of EmbedKGQA. Experiments show that the proposed deformed graph matching model can effectively enhance the relevance of knowledge graphs and the answer accuracy of multi-hop question answering.

A smooth path planning method based on Dijkstra algorithm
GONG Hui, NI Cui, WANG Peng, CHENG Nuo
2024, 50(2): 535-541. doi: 10.13700/j.bh.1001-5965.2022.0377
Abstract:

When the mobile robot moves along the path planned by the Dijkstra algorithm in a complex environment, due to the planned path having many turning points and some turning angles being small, the mobile robot has to turn frequently or even pause to complete the turning, which seriously affects the working efficiency of the robot. In this study, the mobile robot’s actual scene data is combined with the geometric topology method to propose smooth path planning method based on Dijkstra algorithm. The continuous map is obtained according to the application scenario, and the discrete lattice is randomly generated after the discretization of the continuous map, and the Euclidean distance between the points is calculated. Multiple points which are close to the discrete points and whose connection does not cross the barrier are selected to connect them and generate the discrete graph. The Dijkstra algorithm is used to search the optimal path as the guidance path in the discrete graph. The geometric topology is utilized to determine the optimum action and the running path that the mobile robot should follow at each time as it proceeds along the guidance path in conjunction with the actual scene information. Experimental results show that the proposed method can effectively reduce the cumulative turning angles, increase the minimum average turning angle, and improve the smoothness of the planned path, thus shortening the movement time of the mobile robot and improving the working efficiency of the robot.

Language-guided target segmentation method based on multi-granularity feature fusion
TAN Quange, WANG Rong, WU Ao
2024, 50(2): 542-550. doi: 10.13700/j.bh.1001-5965.2022.0384
Abstract:

The objective of language-guided target segmentation is to match the targets described in the text with the entities they refer to, thereby achieving an understanding of the relationships between text and entities, as well as the localization of the referred targets. This task has significant application value in scenarios such as information extraction, text classification, and machine translation. The paper proposes a language-guided multi-granularity feature fusion target segmentation method based on the Refvos model, which can accurately locate segment-specific targets. Using the Swin Transformer and Bert network to extract multi-granularity visual features and text features respectively, so as to obtain features that have strong expression ability to the whole and part. Through language direction, text features are combined with visual features of varying granularities to improve targeted expression. Ultimately, in order to achieve more precise segmentation results, we enhance multi-granularity fusion features using convolutional long and short-term memory networks to facilitate information flow across features of different granularities. The model was trained and tested on UNC and UNC+ datasets. Experimental results show that the proposed method compared with Refvos, IoU results in UNC dataset Val and testB are improved by 0.92% and 4.1% respectively, and IoU results in UNC+ dataset Val, testA and testB are improved by 1.83%, 0.63%, and 1.75% respectively. The proposed method IoU results of G-Ref and ReferIt data sets are 40.16% and 64.37%, reaching the frontier level. It is proved that the proposed method is effective and advanced.

Image-text matching algorithm based on multi-level semantic alignment
LI Yiru, YAO Tao, ZHANG Linliang, SUN Yujuan, FU Haiyan
2024, 50(2): 551-558. doi: 10.13700/j.bh.1001-5965.2022.0385
Abstract:

The regional features in the image tend to pay more attention to the regional features in the image, and the environmental information is often ignored. How to effectively combine local features and global features has not been fully studied. A image-text maxching algorthm based on multi-level semantic alignment is proposed as a solution to this problem and to improve the association between global concepts and local concepts to provide more accurate visual characteristics. In order to obtain different visual relationship levels and provide more information for the joint visual features, this paper first extracts the local image features to obtain the fine-grained information in the image. It then extracts the global image features to introduce the environmental information into the network learning. To provide a more precise similarity representation, the picture characteristics are next integrated, and finally the combined visual and text features are aligned. Through a lot of experiments and analysis, the effectiveness of the proposed algorithm on two public datasets is proven.

Graph pooling method based on multilevel union
DONG Xiaolong, HUANG Jun, QIN Feng, HONG Xudong
2024, 50(2): 559-568. doi: 10.13700/j.bh.1001-5965.2022.0386
Abstract:

Graph pooling method has been widely used in bioinformatics, chemistry, social networks, recommendation systems and other fields. At present, the graph pooling method does not solve the problem of node selection and node information loss caused by pooling. A new graph pooling method is proposed, namely the graph pooling method based on multilevel union (MUPool). The suggested technique extracts distinct features from several convolution modules by using a multi-view module to obtain the properties of nodes from various angles. At the same time, a multilevel union module is proposed to concatenate the outputs of different pooling layers, each layer fusing information from all previous layers. The suggested approach builds a classifier based on each pooling layer using the late fusion module, then fuses the predicted results to obtain the final classification results. The proposed method is tested on multiple data sets, and the accuracy is improved by 1.62% on average, the proposed method can be combined with the existing hierarchical pooling method, the accuracy of the combined method is improved by 2.45% on average.

Image-text aspect emotion recognition based on joint aspect attention interaction
ZHAO Yicheng, WANG Suge, LIAO Jian, HE Donghuan
2024, 50(2): 569-578. doi: 10.13700/j.bh.1001-5965.2022.0387
Abstract:

Due to the quick development of social media, the sentiment conveyed by users cannot be reliably identified by an Aspect-Category Sentiment Analysis of the text alone. However, the existing Aspect-Category Sentiment Analysis methods for image and text data only consider the interaction between image and text modalities, ignoring the inconsistency and correlation of image and text data. Therefore, this paper proposes a joint aspect attention interaction network (JAAIN) model for aspect-category sentiment identification. The suggested technique improves the representation of image and text modalities in particular aspects by multi-level aspect, image, and text information fusion. It does this by removing the text and images that are unrelated to certain aspects. The text data sentiment representation, image data sentiment representation and aspect category sentiment representation are concatenated, fused and fully connected to realize sentiment discrimination of image and text aspects. The experimental results show that the proposed model can improve the performance of sentiment identification in images and text on the Multi-ZOL Dataset.

Multi-modal mask Transformer network for social event classification
CHEN Hong, QIAN Shengsheng, LI Zhangming, FANG Quan, XU Changsheng
2024, 50(2): 579-587. doi: 10.13700/j.bh.1001-5965.2022.0388
Abstract:

Utilizing both the properties of the text and image modalities to the fullest extent possible is essential for multi-modal social event classification. However,most of the existing methods have the following limitations: They simply concatenate the image features and textual features of events. The existence of irrelevant contextual information between different modalities leads to mutual interference. Therefore,it is not enough to only consider the relationship between modalities of multimodal data,but also consider irrelevant contextual information between modalities (such as regions or words). To overcome these limitations,this paper proposes a novel social event classification method based on multimodal mask transformer network (MMTN) model. Specifically,the authors learn better representations of text and images through an image-text encoding network. To combine multimodal data,the resultant picture and word representations are input into a multimodal mask Transformer network. By calculating the similarity between the multimodal information,the relationship between the modalities of the multimodal information is modeled,and the irrelevant contexts between the modalities are masked. Extensive experiments on two benchmark datasets demonstrate that the proposed model achieves the state-of-the-art performance.

Region-aware real-time portrait super resolution reconstruction network
GONG Kecun, ZHOU Menglin, TANG Dongming
2024, 50(2): 588-595. doi: 10.13700/j.bh.1001-5965.2022.0394
Abstract:

Conventional techniques typically process the entire image uniformly, which leads to low efficiency in the field of portrait super-resolution reconstruction.To reduce the inference latency of the model, this research proposes a real-time super-resolution reconstruction model RASR. The model first uses gating unit to process the low-resolution images and identify the edge of the portrait. Then, a partition reconstruction strategy is adopted, and sub-models of different sizes are used to reconstruct the areas containing or not containing the portrait edge, respectively. The experimental results show that the RASR model is able to reconstruct high-resolution portrait images more efficiently by reducing the inference latency of the RASR model by 88% in a 4-foldsampling reconstruction scene compared to the existing methods.

Multimodal bidirectional information enhancement network for RGBT tracking
ZHAO Wei, LIU Lei, WANG Kunpeng, TU Zhengzheng, LUO Bin
2024, 50(2): 596-605. doi: 10.13700/j.bh.1001-5965.2022.0395
Abstract:

The goal of RGB-thermal infrared (RGBT) visual object tracking, which has drawn increasing interest in recent years, is to take advantage of the complimentary strengths of RGB and thermal infrared picture data to accomplish reliable visual tracking. For obtaining a robust appearance representation of an object, existing mainstream methods introduced the modal weight to fuse information of two modalities. Simply assigning weights to the individual modalities can’t fully explore the complementary benefits of RGB and thermal infrared modalities. To solve these problems, propose a novel multimodal bidirectional information enhancement network for RGBT tracking (MBIENet). Specifically, design a feature aggregation module to aggregate modality-shared and modality-specific features for modeling the appearance information of an object. Further proposes a novel multimodal bidirectional modulation fusion module that can effectively fuse the complementary information of two modalities and alleviate the impact of redundant and useless features on the tracker. The contributions of various modalities in various situations are then adaptively adjusted using a lightweight channel-spatial attention module that is proposed. Experimental results on GTOT, RGBT234, and LasHeR datasets show that the accuracy rate and success rate of the proposed method are better than the existing mainstream trackers.

Multi-source remote sensing image classification based on Transformer and dynamic 3D-convolution
GAO Feng, MENG Desen, XIE Zhengyuan, QI Lin, DONG Junyu
2024, 50(2): 606-614. doi: 10.13700/j.bh.1001-5965.2022.0397
Abstract:

Benefited from the complementarity and synergy of multi-source remote sensing data, deep learning-based methods have made significant progress in remote sensing image classification in recent years. Building a powerful multi-source data joint classification model is typically difficult for the following reasons: the feature fusion is hampered by the heterogeneous gap between HSI and LiDAR data; the representation power, efficiency, and interpretability are constrained by the current static inference paradigm.To solve both problems, we propose a Transformer-based fusion network. Specifically, to bridge the heterogeneous gap between HSI and LiDAR data, we design a feature fusion module based on Transformer to exploit the feature interactions between multi-source data. After that, we create a multi-scale dynamic 3D-convolution module to collect the information from different scales and use it to modulate the 3D-convolution kernel. The method was validated with Houston and Trento datasets. The overall accuracy of the proposed method reached 94.60% and 98.21% respectively. Compared with mainstream methods such as MGA-MFN, the overall accuracy of the two datasets was improved by at least 0.97% and 0.25% respectively. The experimental results demonstrate that our method can effectively improve the accuracy of multi-source remote sensing image classification.

Cross-modal hashing network based on self-attention similarity transfer
LIANG Huan, WANG Hairong, WANG Dong
2024, 50(2): 615-622. doi: 10.13700/j.bh.1001-5965.2022.0402
Abstract:

To further improve the performance of cross-modal retrieval, a cross-modal hashing network model is proposed based on self-attention similarity transfer. A channel spatial hybrid self-attention mechanism is designed to strengthen the key information of the concerned image, and the common attention method is used to enhance the interaction of modal information, thus improving the quality of feature learning. To reconstruct the similarity relationship in the hash space, the transfer learning method is used to guide the generation of hash codes by using the real-valued space similarity. Comparative experiments are carried out on three commonly used datasets, MIRFLICKR-25K, IAPRTC-12 and MSCOCO, with excellent methods such as deep cross-modal hashing (DCMH), pairwise relationship guided deep hashing (PRDH) and cross-modal hamming hashing (CMHH). The results show that when the length of the hash code reaches 64 bit, the mean average precision (MAP) of the text retrieval task with image queries in three datasets is 72.3%, and that the MAP of the image retrieval task with text queries is 70%. These values are higher than those of other methods.

Multi-input Fourier neural network and its sparrow search optimization
LI Liangliang, ZHANG Zhuhong, ZHANG Yongdan
2024, 50(2): 623-633. doi: 10.13700/j.bh.1001-5965.2022.0404
Abstract:

In engineering applications, the back-propagation (BP) neural network often encounters many limitations due to its slow convergence and high noise sensitivity, and the reported Fourier neural networks cannot extract the features of multi-attribute input data. Hereby, this work proposes a gradient descent-based multi-input Fourier neural network after integrating the multi-layer perceptron with an overlapping Fourier neural network. Then to address the difficulty in deciding the global optimal parameter settings, the Cat chaotic map and the mechanisms of population-size adjustment and parameter adaptiveness are designed to promote the sparrow search algorithm’s ability to balance global exploration and local exploitation. An improved sparrow search algorithm is thus developed, optimizing the parameter settings and solving high dimensional function optimization problems. The theoretical analysis shows that the improved algorithm’s computational complexity is decided by its population size and the optimization problem dimension. Numerically comparative experiments have validated that the acquired Fourier neural network can effectively extract the features of multi-attribute data with strong generalization ability, and that the improved algorithm has significant advantages in coping with high dimensional function optimization problems.

Lossy point cloud geometry compression based on Transformer
LIU Gexin, ZHANG Junteng, DING Dandan
2024, 50(2): 634-642. doi: 10.13700/j.bh.1001-5965.2022.0412
Abstract:

Point clouds are widely used for 3D object representation, however, real-world captured point clouds often have huge data, which is unfavorable for transmission and storage. To address the redundancy problem of point cloud data, an end-to-end Transformer-based multiscale point cloud geometry compression method is proposed by introducing the Transformer module based on the attention mechanism. The point cloud is voxelized, features are extracted using sparse convolution at the encoder, multi-scale gradual downsampling is performed, and the Transformer module is combined to enhance the point-space feature perception and extraction; at the decoder, the corresponding multi-scale up-sampling is performed for reconstruction, and the Transformer module is also used to enhance and recover the useful features, and the point cloud is progressively refined and reconstructed. Compared with two standard point cloud coding methods, the proposed method obtains 80% and 75% BD-Rate gain on average; compared with the deep learning-based point cloud compression method, it obtains 16% BD-Rate gain on average, and there is about 0.6 PSNR enhancement at the same bit rate. The experimental results demonstrate the feasibility and effectiveness of Transformer in the field of point cloud compression. In terms of subjective quality, the proposed method also has significant subjective effect improvement, and the reconstructed point cloud is closer to the original point cloud.

Software defect prediction algorithm for intra-membrane sparrow optimizing ELM
TANG Yu, DAI Qi, YANG Mengyuan, CHEN Lifang
2024, 50(2): 643-654. doi: 10.13700/j.bh.1001-5965.2022.0438
Abstract:

The original sparrow search algorithm,easy to fall into local extremum in the later stage of iteration, has the problems of low optimization accuracy. Combining the improved sparrow search algorithm with efficient optimization performance and the membrane computing with parallel computing capability, an intra-membrane sparrow optimization algorithm is proposed (IMSSA). The experimental results on ten CEC2017 test functions show that IMSSA has higher optimization accuracy. In addition, to further verify the performance of IMSSA, the extreme learning machine(ELM) parameters are optimized using IMSSA. An intra-membrane sparrow optimal ELM algorithm to be used in software defect predictionis proposed (IMSSA-ELM). The experimental results show that in the 15 public software defect datasets, the prediction performance of the IMSSA-ELM algorithm is significantly better than the other fourcompared algorithms under the two evaluation indicators of G-mean and MCC. The results also show that the IMSSA-ELM algorithm has better prediction accuracy and stability,and have obvious statistical significance in Friedman ranking and Holm’s post-hoc test nonparametric tests.

Person re-identification method based on attention mechanism and CondConv
JI Guangkai, WANG Rong, PENG Shufan
2024, 50(2): 655-662. doi: 10.13700/j.bh.1001-5965.2022.0454
Abstract:

Person Re-identification is an important part of the field of computer vision, but it is easily affected by the actual collection environment of person images, resulting in insufficient expression of person features and further leading to low model accuracy. An improved person re-identification method based on attention mechanism and CondConv is proposed to fully express pedestrian features. The attention mechanism is introduced into the feature extraction network ResNet50, and the key information in the input image space and channel is weighted, while suppressing possible noise. The CondConv is introduced into the backbone network and the convolution kernel parameters are dynamically adjusted to improve the capacity and performance of the model while maintaining efficient reasoning. Mainstream data sets such as Market1501, MSMT17 and DukeMTMC-ReID are used to evaluate the improved method. Rank-1 is increased by 1.1%, 2.4% and 1.3% respectively, and mAP is increased by 0.5%, 2.3% and 1.3%; respectively. The results show that the improved method can better express person features and improve recognition accuracy.

Adversarial attack method based on loss smoothing
LI Meihong, JIN Shuang, DU Ye
2024, 50(2): 663-670. doi: 10.13700/j.bh.1001-5965.2022.0478
Abstract:

Deep neural networks (DNNs) are susceptible to attacks from adversairial samples. Most existing momentum-based adversarial attack methods achieve nearly 100% attack success rates under the white-box setting, but only achieve relatively low attack success rates under the black-box setting. An adversarial attack method based on loss smoothing is proposed, which can further improve the adversarial transferability. By integrating the locally averaged gradient term into the iterative process for attacks, our methods can suppress the local oscillation of the loss surface, stabilize the update direction and escape from poor local maxima. Empirical results on the standard ImageNet dataset demonstrate that the proposed method could significantly improve the adversarial transferability by 38.07% and 27.77% under single-model setting, and 32.50% and 28.63% under ensemble-model setting than the existing methods.

Remote sensing image-text retrieval based on layout semantic joint representation
ZHANG Ruoyu, NIE Jie, SONG Ning, ZHENG Chengyu, WEI Zhiqiang
2024, 50(2): 671-683. doi: 10.13700/j.bh.1001-5965.2022.0527
Abstract:

Remote sensing image-text retrieval can retrieve valuable information from remote sensing data. It is of great significance to environmental assessment, urban planning and disaster prediction. However, there is a key problem that the spatial layout information of remote sensing images is ignored, which is mainly reflected in two aspects: one is the difficulty of long-distance modeling of remote sensing targets; the other, the submerge of the remote sensing adjacent secondary targets. Based on the above problems, this paper proposes a cross modal remote sensing image-text retrieval model based on layout semantic joint representation, which includes the dominant semantic supervison layout visual feature extraction module (DSSL), Layout visual-global semantic cross guidance (LV-GSCG) and multi-view matching (MVM). The DSSL module realizes the layout modeling of images under the supervision of dominant semantic category features. The LV-GSCG module calculates the similarity between the layout visual features and the global semantic features extracted from text to realize the interaction of different modal features. The MVM module establishes a cross-modal feature-guided multi-view metric matching mechanism to eliminate the semantic gap between the cross-modal data. Experimental validation on four baseline remote sensing image text datasets shows that the model can achieve state-of-the-art performance in most cross-modal remote sensing image text retrieval tasks.

Multilevel relation analysis and mining method of image-text
GUO Ruiping, WANG Hairong, WANG Dong
2024, 50(2): 684-694. doi: 10.13700/j.bh.1001-5965.2022.0599
Abstract:

How to efficiently mine the hidden semantic association between multi-modal data is one of the key tasks of multi-modal knowledge extraction. In order to mine fine-grained relation between image and text, multilevel relation analysis and mining method of image-text (MRAM) was proposed. BERT-Large (bidirectional encoder representation from transformers-large) extracted text feature and constructed text connection graphs, while the Faster-RCNN network extracted image feature to learn spatial position relation and semantic relation, then constructed image connection graphs, so as to complete the calculation of single-modal internal semantic relation. The node segmentation method and graph convolutional network with multi-head attention (GCN-MA) fused local and global relation of text and image. To improve the efficiency of relation mining, edge weight pruning strategy based on the attention mechanism strengthened the representation of important branches, and reduced the interference of redundant information. The proposed method was tested on Flickr30K, MSCOCO-1K and MSCOCO-5K datasets, and was compared with 11 methods. The average recall rate on Flickr30K was increased by 0.97% and 0.57%, the average recall rate on MSCOCO-1K was increased by 0.93% and 0.63%, and the average recall rate on MSCOCO-5K was increased by 0.37% and 0.93%. Experimental results verify the effectiveness of the proposed method.