-
摘要:
隧道穿越问题是主题爬虫发展过程中无法回避的一个问题,为解决隧道穿越问题,提出一种基于博伊德环的自决策主题爬虫 (FCIDOL) 算法。该算法以博伊德环为基本框架,按照“观察-评估-决策-行动”形成闭环,根据爬虫已完成的工作——记忆,对观察到的当前状态进行评估,产生激进或保守策略的决策,引导爬虫执行寻找新的主题相关网页团,或专注于短期收益的行动,记忆的作用在于为评估网络提供训练材料,实现对网络的在线训练满足爬虫的冷启动。实验表明:所提算法相较于多种主题爬虫算法在不同主题环境下收获率提升了7.8%以上,重复链接次数减少了15.6%以上。
Abstract:Tunnel crossing problem is unavoidable in the development of the topic crawler. To solve this problem, a self-decision topic crawler algorithm based on Boyd loop (FCIDOL) was proposed. The algorithm took the Boyd loop as the basic framework and formed a closed loop according to the principle of “observation-assessment-decision-action”. According to the work completed by the crawler, which refers to memory, the algorithm evaluated the current state observed to generate decisions of radical or conservative strategies, guiding the crawler to search for new theme-relevant web pages or to focus on the actions of short-term benefits. The role of memory was to provide training materials for the assessment network, thus realizing the online training of the network to meet the cold start of the crawler. The experiment shows that compared with various topic crawler algorithms in different topic environments, FCIDOL achieves an improvement of over 7.8% in harvest rate, and the number of duplicate links is reduced by more than 15.6%.
-
Key words:
- topic crawler /
- tunnel crossing /
- self-decision /
- online learning /
- cold start
-
表 1 评估网络的输入与输出
Table 1. Input and output of assessment network
信息 类别 值类型 主题相关度 输入 浮点型, (0,1) 网页权威值 输入 浮点型, (0,1) 相关性变化 输入 布尔型, 真/假 父网页平均主题相关度 输入 浮点型,(0,1) 与主题相关父网页距离 输入 整 型, ≥1 重复链接存在性 输入 布尔型, 真/假 子网页主题相关预测值 输入 浮点型, (0,1) 子网页网页权威值 输入 浮点型, (0,1) 激进评估得分 输出 浮点型, (0,1) 保守评估得分 输出 浮点型, (0,1) 表 2 不同隐藏节点数的BPNN训练误差
Table 2. Training errors of BPNN with different numbers of hidden nodes
隐藏节点数 训练误差FMSE k−1 k−2 k−3 k−4 R C R C R C R C 4 0.0137 0.0155 0.0152 0.0146 0.0144 0.0148 0.0152 0.0185 5 0.0323 0.0774 0.0925 0.0448 0.0674 0.0855 0.0898 0.0479 6 0.1001 0.1299 0.1696 0.1618 0.1063 0.1207 0.2183 0.1553 7 0.0279 0.0980 0.0109 0.0532 0.0079 0.0846 0.0401 0.0269 8 0.0002 0.0009 0.0065 0.0009 0.0008 0.0069 0.0023 0.0012 9 0.0161 0.5245 0.0389 0.2393 0.0404 0.4237 0.0255 0.2361 10 0.0230 0.0670 0.0278 0.0256 0.0344 0.0365 0.0109 0.0438 11 0.3730 0.2327 0.0713 0.5661 0.0879 0.0537 0.0261 0.1565 12 0.0839 0.4194 0.2262 0.1135 0.2022 0.1723 0.7254 0.0333 13 0.0826 0.2577 0.2488 0.1733 0.1442 0.1556 0.1389 0.9764 14 0.0966 0.3991 0.1656 0.1491 0.1941 0.3161 0.5263 0.2881 -
[1] BERGMARK D, LAGOZE C, SBITYAKOV A. Focused crawls, tunneling, and digital libraries[C]// Lecture Notes in Computer Science. Berlin: Springer, 2002: 91-106. [2] ABITEBOUL S, PREDA M, COBENA G. Adaptive on-line page importance computation[C]//Proceedings of the Twelfth International Conference on World Wide Web-WWW '03. New York: ACM, 2003: 280-290. [3] PAGE L, BRIN S, MOTWANI R, et al. The PageRank citation ranking: Bringing order to the web[C]. Stanford Digital Libravies Working Paper, [s. l.]: [s. n.], 1998. [4] WANG C, GUAN Z Y, CHEN C, et al. On-line topical importance estimation: an effective focused crawling algorithm combining link and content analysis[J]. Journal of Zhejiang University: Science A, 2009, 10(8): 1114-1124. doi: 10.1631/jzus.A0820481 [5] 朱庆生, 徐宁, 周瑜. 一种基于链接和内容分析的自适应主题爬虫算法[J]. 计算机与现代化, 2015(9): 77-80. doi: 10.3969/j.issn.1006-2475.2015.09.016ZHU Q S, XU N, ZHOU Y. An adaptive focused crawling algorithm based on link and content analysis[J]. Computer and Modernization, 2015(9): 77-80(in Chinese). doi: 10.3969/j.issn.1006-2475.2015.09.016 [6] KANG X P, MIAO D Q. A study on information granularity in formal concept analysis based on concept-bases[J]. Knowledge-Based Systems, 2016, 105: 147-159. doi: 10.1016/j.knosys.2016.05.005 [7] JING W P, WANG Y J, WEIWEI D. Research on adaptive genetic algorithm in application of focused crawler search strategy[J]. Computer Science, 2016, 43(8): 254-257. [8] LIU W J, DU Y J. A novel focused crawler based on cell-like membrane computing optimization algorithm[J]. Neurocomputing, 2014, 123: 266-280. doi: 10.1016/j.neucom.2013.06.039 [9] ZHENG S. Genetic and ant algorithms based focused crawler design[C]//Proceedings pf the Second International Conference on Innovations in Bio-inspired Computing and Applications. Piscataway: IEEE Press, 2011: 374-378. [10] GUAN W G, LUO Y C. Design and implementation of focused crawler based on concept context graph[J]. Computer Engineering and Design, 2016, 37 (10): 2679-2684. [11] FEI C J, LIU B S. Focused crawler based on LDA extended topic terms[J]. Computer Applications and Software, 2018, 35 (4) : 49-54. [12] LIU J F, DONG Y, LIU Z X, et al. Applying ontology learning and multi-objective ant colony optimization method for focused crawling to meteorological disasters domain knowledge[J]. Expert Systems with Applications, 2022, 198: 116741. doi: 10.1016/j.eswa.2022.116741 [13] ENCK R E. The OODA loop[J]. Home Health Care Management & Practice, 2012, 24(3): 123-124. [14] RANI M, DHAR A K, VYAS O P. Semi-automatic terminology ontology learning based on topic modeling[J]. Engineering Applications of Artificial Intelligence, 2017, 63: 108-125. doi: 10.1016/j.engappai.2017.05.006 [15] CHURCH K W. Word2Vec[J]. Natural Language Engineering, 2017, 23(1): 155-162. doi: 10.1017/S1351324916000334 [16] AIZAWA A. An information-theoretic perspective of tf–idf measures[J]. Information Processing & Management, 2003, 39(1): 45-65. [17] LI L, ZHANG G Y, LI Z W. Research on focused crawling technology based on SVM[J]. Computer Science, 2015, 42(2) : 118-122. [18] CHIBA Z, ABGHOUR N, MOUSSAID K, et al. A novel architecture combined with optimal parameters for back propagation neural networks applied to anomaly network intrusion detection[J]. Computers & Security, 2018, 75: 36-58. [19] BILSKI J, KOWALCZYK B, MARCHLEWSKA A, et al. Local levenberg-marquardt algorithm for learning feedforwad neural networks[J]. Journal of Artificial Intelligence and Soft Computing Research, 2020, 10(4): 299-316. doi: 10.2478/jaiscr-2020-0020 [20] DE JESÚS RUBIO J. Stability analysis of the modified levenberg–marquardt algorithm for the artificial neural network training[J]. IEEE Transactions on Neural Networks and Learning Systems, 2020, 32(8): 3510-3524. [21] LIU J F, GU Y P, LIU W J. Focused crawler method combining ontology and improved Tabu search for meteorological disaster[J]. Journal of Computer Applications, 2020, 40(8): 2255-2261. -