北京航空航天大学学报 ›› 2011, Vol. 37 ›› Issue (5): 564-568.

• 论文 • 上一篇    下一篇

基于前缀树的数据流容错概要结构构造

由育阳1, 张健沛1, 杨志宏2, 由勇3   

  1. 1. 哈尔滨工程大学计算机学院, 哈尔滨 150001;
    2. 中国医学科学院药用植物研究所, 北京 100193;
    3. 空军航空医学研究所, 北京 100142
  • 收稿日期:2010-11-02 出版日期:2011-05-30 发布日期:2011-05-30
  • 作者简介:由育阳(1977-),男,黑龙江哈尔滨人,博士生,arthurwy@163.com.
  • 基金资助:

    国家自然科学基金资助项目(61073041)

Construction of fault-tolerant synopsis over data stream based on prefix-tree

You Yuyang1, Zhang Jianpei1, Yang Zhihong2, You Yong3   

  1. 1. College of Computer Science and Technology, Harbin Engineering University, Harbin 150001, China;
    2. Institute of Medicinal Plant Development CAMS and PUMC, Beijing 100193, China;
    3. Institute of Aviation Medicine, Beijing 100036, China
  • Received:2010-11-02 Online:2011-05-30 Published:2011-05-30

摘要: 应用于数据流环境的数据挖掘算法应首要考虑算法的时空复杂性,而要实现消耗巨大计算资源的容错模式挖掘则更要专注于算法的效率.容错模式挖掘是为了从被噪声干扰的真实世界数据中获取允许一定程度错配的、更加泛化的有用知识.提出一种新的单遍历、高压缩的容错前缀树形概要结构DSFT-tree(Data Stream Fault-Tolerant Frequent Pattern Tree),用来捕捉最近到达的数据流中的数据元素,并且能够高效移除过期数据,实现最大限度地降低计算资源消耗.利用滑动窗指针和位向量表达法实现容错树形概要结构的高效重构,并进一步基于滑动窗口技术实现了数据流环境下的容错频繁项挖掘.实验采用IBM数据发生器产生事务数据,在合理时间内最终挖掘频繁项的数量为FP-stream算法的1.5倍.

Abstract: Complexity of data mining algorithm over data stream is the most important and it should be more focused on algorithm efficiency because of the great consumption of algorithm resources. Fault-tolerant frequent pattern mining is more generalized and suitable for extracting interesting knowledge from real-world data stream polluted by noise. An algorithm, called data stream fault-tolerant frequent pattern tree(DSFT-tree ), was proposed. It could achieve a frequency-descending and highly compact prefix-tree structure with a single-pass to capture fault-tolerant frequent itemsets in recent sliding window. To completely and efficiently perform the tree restructuring operation, an efficient mechanism based on sliding window pointer and bit-vector representation were utilized to restructure the tree. The efficient reconstruction mechanism greatly reduced the consumption of calculation resources and achieved fault-tolerant frequent itemsets mining. Experimental transaction database was generated by IBM synthetic data generator. The number of frequent itemsets extracted by DSFT-tree is 1.5 fold greater than that extracted by FP-stream. Experimental results show that developed algorithm is an efficient fault-tolerant synopsis.

中图分类号: 


版权所有 © 《北京航空航天大学学报》编辑部
通讯地址:北京市海淀区学院路37号 北京航空航天大学学报编辑部 邮编:100191 E-mail:jbuaa@buaa.edu.cn
本系统由北京玛格泰克科技发展有限公司设计开发