Construction of fault-tolerant synopsis over data stream based on prefix-tree
-
摘要: 应用于数据流环境的数据挖掘算法应首要考虑算法的时空复杂性,而要实现消耗巨大计算资源的容错模式挖掘则更要专注于算法的效率.容错模式挖掘是为了从被噪声干扰的真实世界数据中获取允许一定程度错配的、更加泛化的有用知识.提出一种新的单遍历、高压缩的容错前缀树形概要结构DSFT-tree(Data Stream Fault-Tolerant Frequent Pattern Tree),用来捕捉最近到达的数据流中的数据元素,并且能够高效移除过期数据,实现最大限度地降低计算资源消耗.利用滑动窗指针和位向量表达法实现容错树形概要结构的高效重构,并进一步基于滑动窗口技术实现了数据流环境下的容错频繁项挖掘.实验采用IBM数据发生器产生事务数据,在合理时间内最终挖掘频繁项的数量为FP-stream算法的1.5倍.Abstract: Complexity of data mining algorithm over data stream is the most important and it should be more focused on algorithm efficiency because of the great consumption of algorithm resources. Fault-tolerant frequent pattern mining is more generalized and suitable for extracting interesting knowledge from real-world data stream polluted by noise. An algorithm, called data stream fault-tolerant frequent pattern tree(DSFT-tree ), was proposed. It could achieve a frequency-descending and highly compact prefix-tree structure with a single-pass to capture fault-tolerant frequent itemsets in recent sliding window. To completely and efficiently perform the tree restructuring operation, an efficient mechanism based on sliding window pointer and bit-vector representation were utilized to restructure the tree. The efficient reconstruction mechanism greatly reduced the consumption of calculation resources and achieved fault-tolerant frequent itemsets mining. Experimental transaction database was generated by IBM synthetic data generator. The number of frequent itemsets extracted by DSFT-tree is 1.5 fold greater than that extracted by FP-stream. Experimental results show that developed algorithm is an efficient fault-tolerant synopsis.
-
Key words:
- data stream /
- synopsis /
- fault-tolerant frequent pattern /
- prefix-tree
-
[1] Beringer J,Hullermeier E.Online clustering of parallel data streams[J].Data & Knowledge Engineering,2006,58(2):180- 204 [2] Li H F,Lee S Y.Mining frequent itemsets over data streams using efficient window sliding techniques [J].Expert Systems with Applications,2009,36(2):1466-1477 [3] Chang J H,Lee W S.Online data stream mining of recent frequent itemsets by sliding window method[J].Journal of Information Science,2005,31(2):76-90 [4] Yu J X,Chong Z H,Lu H J,et al.A false negative approach to mining frequent itemsets from high speed transactional data streams[J].Information Sciences,2006,176 (14):1986-2015 [5] Yang C,Fayyad U,Bradley P S.Efficient discovery of error-tolerant frequent itemsets in high dimensions //Proc of 2001 ACM Int Conf on Knowledge Discovery in Databases.San Francisco,CA:Association for Computing Machinery,2001:194-203 [6] Bashir S, Halim Z,Baig A R.Mining fault tolerant frequent patterns using pattern growth approach //AICCSA 08-6th IEEE/ACS International Conference on Computer Systems and Applications.Doha,Qatar:Inst of Elec and Elec Eng Computer Society,2008:172-179 [7] Zhang S C,Zhang J L,Zhang C Q.EDUA:an efficient algorithm for dynamic database mining[J].Information Sciences,2007,177(13):2756-2767 [8] Chi Y,Wang H,Yu P S,et al.Catch the moment:maintaining closed frequent itemsets over a data stream sliding window[J].Knowledge and Information Systems,2006,10(3):265-294
点击查看大图
计量
- 文章访问数: 2567
- HTML全文浏览量: 42
- PDF下载量: 867
- 被引次数: 0