2022 Vol. 48, No. 2

Display Method:
Volume 48 Issue22022
iconDownload (125939) 565 iconPreview
Industry classification technology based on fastText algorithm
WU Zhen, RAN Xiaoyan, MIAO Quan, LIU Chunyan, ZHANG Dong, WEI Na
2022, 48(2): 193-198. doi: 10.13700/j.bh.1001-5965.2020.0402
Abstract:

With the rapid development of China's economy and the continuous improvement of technological innovation ability, efficient organization and classification information is the basis of providing personalized industry management and tracking analysis. According to the characteristics of industry information and the law of development, a Chinese industry classification model based on fastText is proposed in this paper. First, the keyword database of industry classification is constructed, then word segmentation and weight calculation are carried out by feature lexicon, and finally the classifier model is constructed to realize the automatic classification of industry. In the experiment, 80 000 test documents including business scope, enterprise information and public opinion information were selected. The results show that the classification accuracy of the proposed model is higher than that of Bayes, decision tree, KNN and other classification algorithms. Thus, the proposed model works well in the application.

Adaptive short text keyword generation model
WANG Yongjian, SUN Yaru, YANG Ying
2022, 48(2): 199-208. doi: 10.13700/j.bh.1001-5965.2020.0601
Abstract:

Keyword extraction has a great impact on text processing, and the accuracy and fluency of keyword recognition are the keys to the task. In order to effectively solve the problems such as inaccurate word division, mismatch between keywords and text topics, and multi-language mixing in the process of keyword extraction from short text, we propose an adaptive short text keyword generation model based on graph convolutional neural network (ADGCN). First, the model uses graph neural network as the coding framework of text information feature extraction to solve the problem of irregular short text structure and the existence of complex information between words. Then, according to the location features and context features of words, the self attention mechanism is combined to capture rich context dependent information. Finally, a linear decoding scheme is used to generate interpretable keywords. We collect and publish a tag dataset TH from social media platform, including text and topic tags. We evaluate and analyze the relevance, information and coherence of the model results from the perspective of user needs. The model can not only generate keywords that meet the topic of short text, but also effectively alleviate the impact of data disturbance on the model. It is proved that the model performs well on the public dataset KP20k and has good portability.

Classification of network public opinion propagation pattern based on variational reasoning
TANG Hongmei, TANG Wenzhong, LI Ruichen, WANG Yanyang, WANG Lihong
2022, 48(2): 209-216. doi: 10.13700/j.bh.1001-5965.2020.0538
Abstract:

With the rapid development of online social media, the analysis of the dissemination mode of public opinion information has become a research hotspot.Aiming at the problem of low classification accuracy of small sample data multi-path generation in the classification task of the network public opinion spreading pattern, the definition of the knowledge graph structure in the field of public opinion dissemination is proposed, builds a public opinion dissemination knowledge graph and public opinion dissemination analysis task data set based on Weibo data, uses the GraphDIVA model to classify public opinion propagation patterns, and conducts a 25-sample test experiment of public opinion propagation pattern classification in the self-built data set. The results show that, after 20 rounds of training, the classification accuracy rate of the model has increased from 76% to 89.4%. It can be seen that the GraphDIVA model has a better effect in reducing the number of training and improving the classification accuracy rate.

Fuzzing testing sample set optimization scheme based on heuristic genetic algorithm
WANG Zhihua, WANG Haofan, CHENG Manman
2022, 48(2): 217-224. doi: 10.13700/j.bh.1001-5965.2020.0422
Abstract:

As the most effective method of vulnerability mining at present, fuzzy testing not only is more capable of dealing with complex programs than other vulnerability mining techniques, but also has strong scalability. In the fuzzy testing with a large number of data, the input sample set has the problems of low quality, high redundancy and weak availability. Therefore, we study the input sample set of fuzzy testing, and propose a heuristic genetic algorithm. With the help of the 0-1 matrix, the execution path of the sample is selected and compressed through the heuristic genetic algorithm, so as to obtain the smallest sample set that takes into account the sample quality after optimization, thereby speeding up the efficiency of fuzzy testing. The experimental results show that, without loss, the fuzzy testing time after the sample set is simplified is reduced by 22% compared with that before the sample set is simplified, and the compression rate is increased by about 40% compared with the traditional scheme.

Low redundancy feature selection method for Android malware detection
HAO Jingwei, PAN Limin, LI Rui, YANG Peng, LUO Senlin
2022, 48(2): 225-232. doi: 10.13700/j.bh.1001-5965.2020.0567
Abstract:

A low redundancy feature selection method for Android malware detection is proposed to solve the problem of feature redundancy caused by excessive attention to features with the same frequency distribution between classes. First, the method selects features with frequency distribution bias by Mann-Whitney test, and then quantifies the degree of bias and feature appearance frequency by the appearance ratio interval algorithm to reject features with low bias and low use frequency in the overall software. Finally, the particle swarm optimization algorithm is combined with model detection effect to obtain the optimal feature subset. Experiments were conducted using public datasets DREBIN and AMD. The experimental results show that 294-dimensional features were selected on the AMD dataset, and the detection accuracy of the six classifiers is improved by 1%-5%, 295-dimensional features were selected on the DREBIN dataset less than 4 comparison methods, and the detection accuracy of the six classifiers is improved by 1.7%-5%. The experimental results illustrate that the proposed method can reduce the redundancy of features in Android malware detection and improve the malware detection accuracy.

Traffic classification algorithm of Internet of things devices based on random forest
LI Ruiguang, DUAN Pengyu, SHEN Meng, ZHU Liehuang
2022, 48(2): 233-239. doi: 10.13700/j.bh.1001-5965.2020.0383
Abstract:

The traffic classification of Internet of things (IoT) devices is very important to the management of cyberspace assets. The classification technology based on statistical identification is a hot spot in current academic research. The previous algorithms were mainly based on the flow information to set up the feature vectors, but lesson the packet information. In this paper, we improve the traffic classification algorithm of IoT devices based on random forest. We set up the feature vectors with both the flow information and the flow's packet information. The experimental results show that, compared with previous algorithms, the classification accuracy of the proposed algorithm increases from 56% to 82%, the recall rate improves from 47% to 67%, the F1 score increases from 0.43 to 0.74, and the confusion matrix correlation is also significantly improved. As a result, the proposed algorithm has a better classification effect than previous ones.

Large-scale IoT malware analysis and classification method
HE Qinglin, WANG Lihong, LUO Bing, YANG Libin
2022, 48(2): 240-248. doi: 10.13700/j.bh.1001-5965.2020.0401
Abstract:

Recently, Internet of things (IoT) malware emerges in large numbers and attacks IoT devices in cyberspace. However, the family characteristics of IoT malwares are not obvious due to the open-source problem, a more fine-grained malware classification method is needed to solve the problems of advanced threat malware discovery and attack organization tracking. To address this question, we took a large-scale analysis of 157 911 IoT malwares which have been found from May 2019 to May 2020, and labeled a dataset which includes 9 categories and 12 278 malwares. Then we proposed an IoT malware classification method whose main idea is extracting complex structure features including FCG graph and text by static reverse analysis. The learning features using graph representation learning and text representation learning were used, and the experiments on the labeled dataset show that the average recall rate is 88.1%. Our method has been taken into practice and works well.

Data masking model for heterogeneous big data environment
TONG Lingling, LI Pengxiao, DUAN Dongsheng, REN Boya, LI Yangxi
2022, 48(2): 249-257. doi: 10.13700/j.bh.1001-5965.2020.0403
Abstract:

Due to the variety of data types and desensitization demand in different scenarios, traditional data masking methods cannot meet the user privacy protection requirements in the environment of big data. How to realize the accurate pointing and efficient desensitization of heterogeneous big data for data security, trust and availability, has become the key in this area. In this paper, we propose a data masking model for heterogeneous big data applications, such as texts, images, voices and databases, and four key modules are presented in our model. First, the sensitive data automatic identification and classification in different applications are realized in different application scenarios by desensitization data preprocessing. Second, with data pre-masking method, the data masking evaluation is implemented in five dimensions, including data availability, data relevance, degree of privacy protection, and time and space complexity, to construct the customized desensitization strategy. Finally, after task scheduling, the allocation and execution of the data masking tasks are performed, and the masking data recovery can also be partially supported. Two typical data masking applications are verified and analyzed based on the proposed heterogeneous big data masking model, indicating that effective desensitization can be achieved in different application scenarios.

Malicious code detection based on heterogeneous information network
LIU Yashu, HOU Yueran, YAN Hanbing
2022, 48(2): 258-265. doi: 10.13700/j.bh.1001-5965.2020.0539
Abstract:

Malicious codes poses serious threats to network and information security. How to detect malware rapidly and how to eliminate and reduce the hazard caused by malware are important research topics. The paper presents a method to get dynamic features of malware using dynamic information and heterogeneous information network (HIN), and implements malicious codes detection and classification. Four meta graph schemes about FILE, API and DLL are proposed and malicious code HIN network pattern is described. An improved random walk strategy is used to obtain the context information of the object nodes in the meta graph schemes, which is used as the input of continuous bag of words (CBOW) model in order to get network embedding of word vectors. The method of principal angle is improved by voting to get the classification result of multiple meta graph schemes with feature fusion. The proposed method greatly improves the classification accuracy of malware based on the features of each meta graph when limited information is available.

Optimal choice of order preserving encryption scheme in encrypted document ranking
ZHANG Jiuling, HUANG Daochao, SHEN Shijun
2022, 48(2): 266-272. doi: 10.13700/j.bh.1001-5965.2020.0414
Abstract:

Encrypting the documents before uploading them to the untrusted removable server is one of the ultimate solutions to protect the user privacy. As different encryption schemes give different ranking results, finding the order preserving encryption scheme that gives the best ranking result is the key issue to be solved. A discrimination information based on criterion for selecting the order preserving encryption scheme is proposed and then employed to compare the ranking results of two different order preserving encryption schemes. In the case the distribution of the cipher text is closer to that of the plaintext, the ranking result is closer to that in the plaintext scenario. The proposed criterion is not only instructive in theory, but also helpful practically in choosing the order preserving encryption scheme under the same safety conditions, which further lead to the optimal retrieval result.

Application of space time regional economy visualization based on telecom big data analysis
LI Na, LIU Wenmin, MENG Fanrui, LIU Yan
2022, 48(2): 273-281. doi: 10.13700/j.bh.1001-5965.2020.0388
Abstract:

Currently, the number of mobile phone users in China has reached 1.59 billion. Under the huge population base, the telecom big data characteristics reflect the characteristics of crowd activities to a certain extent, which can reflect the development status of specific regions. The application of space time regional economy visualization processes and extracts the information from massive telecom big data by data mining technology to improve data quality and screens the data in different rules and extracts features by modeling techniques of the data. The data combined with multi-source information, such as electronic map data and traffic data are used to analyze user behavior characteristics from multiple perspectives. The application analysis makes use of the data to visualize and research the space time regional economic situation and analyze life attributes of the residents. At the same time, use the difference-in-differences (DID) model to evaluate regional economic policies. Based on the results of feature analysis, it can provide the decision-making basis for the location of regional economic development and guiding layout of urban business districts, improve the efficiency of urban system operation and expand the range of economic regional benefits.

Malicious code clone detection technology based on deep learning
SHEN Yuan, YAN Hanbing, XIA Chunhe, HAN Zhihui
2022, 48(2): 282-290. doi: 10.13700/j.bh.1001-5965.2020.0400
Abstract:

Malicious code clone detection has become an effective way to analyze malicious code homology and advanced persistent threat (APT) attacks. In this paper, we collect samples of different APT organizations from public threat intelligence, and propose a deep learning based malicious code clone detection framework to detect the similarity between the functions in newly discovered malicious code and the malicious code in known APT organizational resources in order to efficiently analyze malware and quickly identify the source of APT attacks. We perform static analysis of malicious code through disassembly technology, use its key function call graph and disassembly code as the features of the malicious code, and then classify the malicious code in the APT organization library according to the neural network model. Through extensive evaluation and comparison with our previous models (MCrab), the improved model is better than the previous model, which can effectively detect and classify malicious code clones and obtain higher detection rate.

Randomness of traffic data in TLS cipher suite
GUO Shuai, CHENG Guang
2022, 48(2): 291-300. doi: 10.13700/j.bh.1001-5965.2020.0390
Abstract:

Cipher suite is the cornerstone of transport layer security (TLS) to realize secure communication, which includes asymmetric cipher algorithm, symmetric cipher algorithm and message digest algorithm, among which symmetric cipher algorithm is used for data encryption in actual communication. Through the collection and analysis of real traffic, this paper obtains the distribution of different TLS cipher suites in the existing network. Then, an analysis method based on image ciphertext reconstruction, NIST randomness test suite and convolutional neural network (CNN) is designed to analyze the ciphertext randomness of mainstream symmetric cipher algorithms (AES, ChaCha20) and other common symmetric cipher algorithms (DES, 3DES, RC2, RC4). The experimental results show that the ciphertexts of all the symmetric cipher algorithms participating in the comparison have poor randomness in the electronic codebook (ECB) mode and cannot pass most tests. AES and ChaCha20, two mainstream TLS symmetric cipher algorithms, have good randomness in ciphertext except ECB mode, and have resistance to cipher algorithm recognition based on CNN or random forest. Relevant research can provide reference for the deep analysis of TLS cipher suite selection and encrypted traffic.

Weibo tendency analysis based on sentimental object recognition and sentimental rules
WANG Zechen, WANG Shupeng, SUN Liyuan, ZHANG Lei, WANG Yong, HAO Bingchuan
2022, 48(2): 301-310. doi: 10.13700/j.bh.1001-5965.2020.0404
Abstract:

Weibo contains a large number of information reflecting users' likes and dislikes, which is important for popular trend judgment, precision marketing, public opinion monitoring, etc. However, the existing methods tend to focus on the classification of Weibo sentiment. In order to solve the problem of Weibo tendentiousness analysis and position detection, we employ semisupervised learning method, through collaborative training and active learning. We train entity recognition models and combine deep learning with emotional rules. Moreover, the sentiment rules based on principal component analysis are constructed to extract the main components of sentences, normalize the spoken text into the specified format. Then we use the positive and negative aspects of directional entities, the positive and negative meanings of emotional words, and the sentence components of emotional words to judge the tendency of blog posts, and conduct deeper analysis on position classification. Finally, the self comparison experiment and other comparison experiment on different scale data sets show that with the increase of the number of blog posts of labeled entities, the accuracy of the model continues to improve, and the accuracy of this method is significantly higher than the comparison method, which is 2.79% and 10.00% higher than the existing research methods.

User electricity consumption behavior mode analysis based on energy decomposition
LU Ruirui, YU Haiyang, YANG Zhen, LAI Yingxu, YANG Shisong, ZHOU Ming
2022, 48(2): 311-323. doi: 10.13700/j.bh.1001-5965.2020.0557
Abstract:

With the popularization of smart grids and the development of big data technology, more and more attention has been paid to the analysis of users' electricity consumption behavior through electricity consumption data. The existing energy decomposition methods cannot meet the high requirements for resolution and decomposition accuracy in practical applications, and the cluster analysis method is too rough and does not fully show the electricity consumption characteristics of each type of electrical appliances. In view of this, this paper proposes an analysis method of users' electricity consumption behavior based on energy decomposition. Based on the discriminative sparse coding algorithm model, firstly, to solve the problem that the regular term of L0 is not easy to solve and the effect of the sparse constraint of L1 regular term is not ideal, we propose to use the sparse constraint of L1/2 regular term to perform energy decomposition, and add the homogeneity between users as a regular term to the basic model to modify the performance of the model. Secondly, based on the results of energy decomposition, we use the electricity consumption characteristics of a user's single-type electrical appliances instead of the total electricity consumption characteristics to refine the analysis of user's electricity consumption behavior, and improve the traditional K-Mean clustering algorithm for experimental verification. The experimental results show that the energy decomposition method based on the sparse constraint of L1/2 regular term and the constraint of homogeneity can effectively improve the accuracy of energy decomposition compared with the traditional discriminative sparse coding method. At the same time, the result of cluster analysis of users' electricity consumption behavior based on energy decomposition is also significantly improved.

A dynamic network threat evaluation method for smart grid embedded devices
LYU Zhuo, GUO Zhimin, CHEN Cen, MO Jiansong, CHANG Chaowen
2022, 48(2): 324-330. doi: 10.13700/j.bh.1001-5965.2020.0398
Abstract:

Due to the limited computing and storage resources, the smart grid embedded devices cannot deal with the network attacks effectively and the security assessment method is weak. In order to solve these problems, a dynamic network attack behavior evaluation method for smart grid embedded devices is proposed. This method uses the security control module to analyze the communication data stream of the actual embedded device, and conduct security detection evaluation of the impact of the attack behavior in the embedded system simulator by using the component dynamic trust measurement. The final security evaluation result of the network attacks is obtained based on the whole process dynamic comprehensive measurement of the platform configuration property, the platform operation attribute and the user authentication attribute. The method is tested in the actual environment of the power distribution automation system and the power utilization information collection system. The results show that, aimed at the common attacks against the embedded devices, accuracy rate of the proposed detection method can reach more than 90%. This method provides good safety assessment accuracy, and meanwhile achieves effective upgrade of its own security.

Adversarial sample generation technology of malicious code based on LIME
HUANG Tianbo, LI Chengyang, LIU Yongzhi, LI Denghui, WEN Weiping
2022, 48(2): 331-338. doi: 10.13700/j.bh.1001-5965.2020.0397
Abstract:

Based on the research and analysis of machine learning technology to detect malicious code, a local interpretable model-agnostic explanations (LIME)-based black-box adversarial examples generation method is proposed to generate adversarial samples for any black-box malicious code classifier and bypass the detection of machine learning models. The method uses a simple model to simulate the target classifier's local performances, obtains the feature weights, and generates disturbances through the disturbance algorithm. According to the generated disturbances, the method modifies the original malicious code to generate adversarial samples. We test the method using Microsoft's common malicious sample data in 2015 and the collected benign sample data from more than 50 suppliers as follows: 18 target classifiers based on different algorithms or features were implemented concerning common malicious code classifiers. Their classifiers' true positive rates were reduced to approximately zero when we attacked them using the method. Two advanced black-box sample generation methods, MalGAN and ZOO, were reproduced for comparison with this method. The experimental results show that the proposed method in this paper can effectively generate adversarial samples, and the method itself owns various strengths, including broad applicability, flexible control of disturbances, and soundness.

A method for filtering the attack pairs of adversarial examples based on attack distance
LIU Hongyi, FANG Yutong, WEN Weiping
2022, 48(2): 339-347. doi: 10.13700/j.bh.1001-5965.2020.0529
Abstract:

During the generation of black-box adversarial examples, an attack pair is usually specified, including a source example and a target example. The purpose is to let the generated adversarial example only have little norm difference from the source example, but it is recognized by the classifier as the classification of the target sample. In order to solve the problem of the instability of adversarial attacks caused by different attack difficulty of attack pairs, taking the image recognition field as an example, this paper presented an attack distance measurement method based on the length of the decision boundary, which provided a measurement method for the attack difficulty of attack pairs. Then this paper designed a filtering method based on attack distance of the attack pairs, which filtered out attack pairs that were difficult to attack before the attack started, so this method can improve the attack performance without modifying the attack algorithm. Experiments show that, compared with the attack pairs before filtering, the filtered attack pairs improve the overall attack performance by 42.07%, improve the attack efficiency by 24.99%, and stabilize the variance by 76.23%. It is recommended that all methods of generating adversarial examples using attack pairs should filter attack pairs before attack to stabilize and improve the attack performance.

Malware family classification method based on abstract assembly instructions
LI Yu, LUO Senlin, HAO Jingwei, PAN Limin
2022, 48(2): 348-355. doi: 10.13700/j.bh.1001-5965.2020.0568
Abstract:

The emergence of malware variants poses a great threat to network security. In malware family classification methods based on assembly instructions, the semantics of operands are closely related to the operating environment and difficult to extract, which leads to the lack of instruction semantics and the difficulty in correctly classifying malware variants. A malware family classification method based on abstract assembly instructions is proposed. The instruction is reconstructed by abstracting the operand type, so that the semantics of the operands can be separated from the constraints of the operating environment. The word attention mechanism and bidirectional gate recurrent unit (Bi-GRU) are used to construct an instruction embedding network and to capture the instruction behavior semantics. Combined with bidirectional recursive neural networks (Bi-RNN), the common instruction sequence of malware family is learned to reduce the interference of variation technology on the instruction sequence. The original instruction and family common instruction sequence are integrated to construct feature images, and the malware family classification is realized through convolutional neural network. The experimental results on the public dataset show that the proposed method can effectively extract operand information, resist the interference of irrelevant instructions in malware variants, and realize the family classification of malware variants.

Automatic poster synthesis system based on keywords
GUAN Shuaipeng, YU Haiyang, YANG Zhen, ZHOU Ming, LAI Yingxu
2022, 48(2): 356-368. doi: 10.13700/j.bh.1001-5965.2020.0552
Abstract:

The popularization of intelligence puts forward new requirements for image editing. As a way of transmitting information in the form of images, posters play an important role in daily life and work management. But the production of posters requires multi-element image synthesis. However, there is a lack of an interactive and one click image synthesis system. Combined with the current popular image processing technology, a poster automatic synthesis system is designed and implemented. We propose keyword-based image retrieval scheme, constructs a dual filtering scheme based on text and content, and provides users with accurate and fast image retrieval methods. By counting the composition rules of a large number of carefully designed poster pictures and introducing the composition rules of aesthetic common sense, we propose a portrait layout recommendation scheme based on two-way rules, which assists users in portrait layout design under the combined effect of two-way rules. The experimental results prove that the scheme designed in this paper can run stably and efficiently, users can realize image synthesis through simple interactive operations, and the final image synthesis effect is real and effective.