留言板

尊敬的读者、作者、审稿人, 关于本刊的投稿、审稿、编辑和出版的任何问题, 您可以本页添加留言。我们将尽快给您答复。谢谢您的支持!

姓名
邮箱
手机号码
标题
留言内容
验证码

一种在线训练的自决策主题爬虫算法

熊观野 杨百龙

熊观野,杨百龙. 一种在线训练的自决策主题爬虫算法[J]. 北京航空航天大学学报,2025,51(2):602-615 doi: 10.13700/j.bh.1001-5965.2023.0002
引用本文: 熊观野,杨百龙. 一种在线训练的自决策主题爬虫算法[J]. 北京航空航天大学学报,2025,51(2):602-615 doi: 10.13700/j.bh.1001-5965.2023.0002
XIONG G Y,YANG B L. A self-decision topic crawler algorithm with online training[J]. Journal of Beijing University of Aeronautics and Astronautics,2025,51(2):602-615 (in Chinese) doi: 10.13700/j.bh.1001-5965.2023.0002
Citation: XIONG G Y,YANG B L. A self-decision topic crawler algorithm with online training[J]. Journal of Beijing University of Aeronautics and Astronautics,2025,51(2):602-615 (in Chinese) doi: 10.13700/j.bh.1001-5965.2023.0002

一种在线训练的自决策主题爬虫算法

doi: 10.13700/j.bh.1001-5965.2023.0002
详细信息
    通讯作者:

    E-mail:xa_403@163.com

  • 中图分类号: TP319.1

A self-decision topic crawler algorithm with online training

More Information
  • 摘要:

    隧道穿越问题是主题爬虫发展过程中无法回避的一个问题,为解决隧道穿越问题,提出一种基于博伊德环的自决策主题爬虫 (FCIDOL) 算法。该算法以博伊德环为基本框架,按照“观察-评估-决策-行动”形成闭环,根据爬虫已完成的工作——记忆,对观察到的当前状态进行评估,产生激进或保守策略的决策,引导爬虫执行寻找新的主题相关网页团,或专注于短期收益的行动,记忆的作用在于为评估网络提供训练材料,实现对网络的在线训练满足爬虫的冷启动。实验表明:所提算法相较于多种主题爬虫算法在不同主题环境下收获率提升了7.8%以上,重复链接次数减少了15.6%以上。

     

  • 图 1  使用重启策略情况示例

    Figure 1.  Example of using restart strategy

    图 2  评估等级与对应策略

    Figure 2.  Assessment level and corresponding strategy

    图 3  连续工作的场景评估等级与对应策略

    Figure 3.  Assessment level and corresponding strategy of continuous work scenario

    图 4  自决策主题爬虫OODA环

    Figure 4.  OODA loop of self-decision topic crawler

    图 5  历史状态的记忆片段

    Figure 5.  Memory fragments of historical state

    图 6  评估网络在线训练流程

    Figure 6.  Online training process of assessment network

    图 7  评估得分的MSE误差表现

    Figure 7.  Performance of MSE errors in score assessment

    图 8  评估得分的R2误差表现

    Figure 8.  Performance of R2 errors in score assessment

    图 9  主题爬虫收获率实验结果

    Figure 9.  Experiment results of topic crawler harvest ratio

    图 10  主题爬虫发现重复链接次数实验结果

    Figure 10.  Experiment results of duplicate link discovery frequency by topic crawlers

    表  1  评估网络的输入与输出

    Table  1.   Input and output of assessment network

    信息 类别 值类型
    主题相关度 输入 浮点型, (0,1)
    网页权威值 输入 浮点型, (0,1)
    相关性变化 输入 布尔型, 真/假
    父网页平均主题相关度 输入 浮点型,(0,1)
    与主题相关父网页距离 输入 整 型, ≥1
    重复链接存在性 输入 布尔型, 真/假
    子网页主题相关预测值 输入 浮点型, (0,1)
    子网页网页权威值 输入 浮点型, (0,1)
    激进评估得分 输出 浮点型, (0,1)
    保守评估得分 输出 浮点型, (0,1)
    下载: 导出CSV

    表  2  不同隐藏节点数的BPNN训练误差

    Table  2.   Training errors of BPNN with different numbers of hidden nodes

    隐藏节点数 训练误差FMSE
    k−1 k−2 k−3 k−4
    R C R C R C R C
    4 0.0137 0.0155 0.0152 0.0146 0.0144 0.0148 0.0152 0.0185
    5 0.0323 0.0774 0.0925 0.0448 0.0674 0.0855 0.0898 0.0479
    6 0.1001 0.1299 0.1696 0.1618 0.1063 0.1207 0.2183 0.1553
    7 0.0279 0.0980 0.0109 0.0532 0.0079 0.0846 0.0401 0.0269
    8 0.0002 0.0009 0.0065 0.0009 0.0008 0.0069 0.0023 0.0012
    9 0.0161 0.5245 0.0389 0.2393 0.0404 0.4237 0.0255 0.2361
    10 0.0230 0.0670 0.0278 0.0256 0.0344 0.0365 0.0109 0.0438
    11 0.3730 0.2327 0.0713 0.5661 0.0879 0.0537 0.0261 0.1565
    12 0.0839 0.4194 0.2262 0.1135 0.2022 0.1723 0.7254 0.0333
    13 0.0826 0.2577 0.2488 0.1733 0.1442 0.1556 0.1389 0.9764
    14 0.0966 0.3991 0.1656 0.1491 0.1941 0.3161 0.5263 0.2881
    下载: 导出CSV
  • [1] BERGMARK D, LAGOZE C, SBITYAKOV A. Focused crawls, tunneling, and digital libraries[C]// Lecture Notes in Computer Science. Berlin: Springer, 2002: 91-106.
    [2] ABITEBOUL S, PREDA M, COBENA G. Adaptive on-line page importance computation[C]//Proceedings of the Twelfth International Conference on World Wide Web-WWW '03. New York: ACM, 2003: 280-290.
    [3] PAGE L, BRIN S, MOTWANI R, et al. The PageRank citation ranking: Bringing order to the web[C]. Stanford Digital Libravies Working Paper, [s. l.]: [s. n.], 1998.
    [4] WANG C, GUAN Z Y, CHEN C, et al. On-line topical importance estimation: an effective focused crawling algorithm combining link and content analysis[J]. Journal of Zhejiang University: Science A, 2009, 10(8): 1114-1124. doi: 10.1631/jzus.A0820481
    [5] 朱庆生, 徐宁, 周瑜. 一种基于链接和内容分析的自适应主题爬虫算法[J]. 计算机与现代化, 2015(9): 77-80. doi: 10.3969/j.issn.1006-2475.2015.09.016

    ZHU Q S, XU N, ZHOU Y. An adaptive focused crawling algorithm based on link and content analysis[J]. Computer and Modernization, 2015(9): 77-80(in Chinese). doi: 10.3969/j.issn.1006-2475.2015.09.016
    [6] KANG X P, MIAO D Q. A study on information granularity in formal concept analysis based on concept-bases[J]. Knowledge-Based Systems, 2016, 105: 147-159. doi: 10.1016/j.knosys.2016.05.005
    [7] JING W P, WANG Y J, WEIWEI D. Research on adaptive genetic algorithm in application of focused crawler search strategy[J]. Computer Science, 2016, 43(8): 254-257.
    [8] LIU W J, DU Y J. A novel focused crawler based on cell-like membrane computing optimization algorithm[J]. Neurocomputing, 2014, 123: 266-280. doi: 10.1016/j.neucom.2013.06.039
    [9] ZHENG S. Genetic and ant algorithms based focused crawler design[C]//Proceedings pf the Second International Conference on Innovations in Bio-inspired Computing and Applications. Piscataway: IEEE Press, 2011: 374-378.
    [10] GUAN W G, LUO Y C. Design and implementation of focused crawler based on concept context graph[J]. Computer Engineering and Design, 2016, 37 (10): 2679-2684.
    [11] FEI C J, LIU B S. Focused crawler based on LDA extended topic terms[J]. Computer Applications and Software, 2018, 35 (4) : 49-54.
    [12] LIU J F, DONG Y, LIU Z X, et al. Applying ontology learning and multi-objective ant colony optimization method for focused crawling to meteorological disasters domain knowledge[J]. Expert Systems with Applications, 2022, 198: 116741. doi: 10.1016/j.eswa.2022.116741
    [13] ENCK R E. The OODA loop[J]. Home Health Care Management & Practice, 2012, 24(3): 123-124.
    [14] RANI M, DHAR A K, VYAS O P. Semi-automatic terminology ontology learning based on topic modeling[J]. Engineering Applications of Artificial Intelligence, 2017, 63: 108-125. doi: 10.1016/j.engappai.2017.05.006
    [15] CHURCH K W. Word2Vec[J]. Natural Language Engineering, 2017, 23(1): 155-162. doi: 10.1017/S1351324916000334
    [16] AIZAWA A. An information-theoretic perspective of tf–idf measures[J]. Information Processing & Management, 2003, 39(1): 45-65.
    [17] LI L, ZHANG G Y, LI Z W. Research on focused crawling technology based on SVM[J]. Computer Science, 2015, 42(2) : 118-122.
    [18] CHIBA Z, ABGHOUR N, MOUSSAID K, et al. A novel architecture combined with optimal parameters for back propagation neural networks applied to anomaly network intrusion detection[J]. Computers & Security, 2018, 75: 36-58.
    [19] BILSKI J, KOWALCZYK B, MARCHLEWSKA A, et al. Local levenberg-marquardt algorithm for learning feedforwad neural networks[J]. Journal of Artificial Intelligence and Soft Computing Research, 2020, 10(4): 299-316. doi: 10.2478/jaiscr-2020-0020
    [20] DE JESÚS RUBIO J. Stability analysis of the modified levenberg–marquardt algorithm for the artificial neural network training[J]. IEEE Transactions on Neural Networks and Learning Systems, 2020, 32(8): 3510-3524.
    [21] LIU J F, GU Y P, LIU W J. Focused crawler method combining ontology and improved Tabu search for meteorological disaster[J]. Journal of Computer Applications, 2020, 40(8): 2255-2261.
  • 加载中
图(10) / 表(2)
计量
  • 文章访问数:  240
  • HTML全文浏览量:  94
  • PDF下载量:  8
  • 被引次数: 0
出版历程
  • 收稿日期:  2023-01-04
  • 录用日期:  2023-03-03
  • 网络出版日期:  2023-03-24
  • 整期出版日期:  2025-02-28

目录

    /

    返回文章
    返回
    常见问答