-
摘要:
过程挖掘算法是从管理信息系统产生的事件日志中提取信息、发现知识并实现工作流建模的工具,也是目前工作流最主要的建模工具。然而现有的过程挖掘算法存在准确度较低、运行时间长和拟合度过高等问题,影响最终工作流模型的准确率。提出了一种基于统计
α 算法的过程挖掘算法,在保证算法较高的准确率和合适的拟合度的同时,降低算法运行时间,保证了算法的效率。首先,提出了重名活动识别算法,作为过程挖掘的预处理活动,提高了算法的准确性;其次,提出了统计α 算法作为过程挖掘核心算法,有效消除了事件日志中噪声的影响;最后,提出了新的非自由选择结构识别算法,进一步提高了算法的鲁棒性和准确率。通过仿真实验和真实案例验证了该算法在准确率和运行时间上的优越性。Abstract:Workflow technology is widely used in business process management. However, there are still many problems during the execution of business process because of the imperfect workflow model. Process mining is the most useful tool of workflow modeling, which can obtain objective and valuable information from event logs and build process model. Nevertheless, the existing process mining algorithms still have some problems, such as low accuracy, long operation time and overfitting, which will decreace the accuracy of the workflow model. This paper proposed a new process mining algorithm based on statistical
α -algorithm, which can not only ensure the accuracy and suitable fitness, but also decrease the operation time. First, cognominal activity identification rules were proposed to be the pre-treated process of process mining, which could improve the accuracy of algorithm. Second, statisticalα -algorithm was proposed as the core algorithm of process mining to eliminate the influence of noise in event logs. Moreover, a new algorithm was proposed to identify non-free-choice constructs, which improved the robustness and accuracy of the algorithm. The accuracy and efficiency of the algorithm are verified by simulation and real case. -
表 1 直接依赖关系分类与定义
Table 1. Classification and definition of direct relations
直接依赖关系分类 符号 定义 顺序关系 a > Wb a, b∈σ={…, a, b, …} 因果关系 a→Wb a > Wb∧b≯Wa 选择关系 a#Wb a≯Wb∧b≯Wa 并行关系 a‖Wb a > Wb∧b > Wa 表 2 某临床路径事件日志部分内容
Table 2. Event log of a clinical pathway (part)
活动序号 活动名称 25 医患交流 26 手术室准备 27 药物准备 28 手术工具准备 29 最后确认 手术阶段 41 术后整理 42 送病人回病房 43 送水送药 44 病人休息 45 医患交流 46 病情讨论 表 3 重名活动识别算法验证实验数据安排
Table 3. Experimental data arrangement for cognominal activity identification algorithm verification
数据编号 流程轨迹长度 事件日志规模 噪声类型 噪声规模/% 优先度等级 C1 22 200 ⑤ 5 Level1 C2 22 200 ⑤ 10 Level1 C3 22 200 ⑤ 20 Level1 C4 42 200 ⑤ 5 Level1 C5 42 200 ⑤ 10 Level1 C6 42 200 ⑤ 20 Level1 表 4 重名活动识别算法对比实验结果
Table 4. Comparison of experimental results of cognominal activity identification algorithm
数据编号 重名活动数量 本文算法 文献[6]算法 识别率/% 准确率/% 运行时间/ms 识别率/% 准确率/% 运行时间/ms C1 2对 100 100 10.12 100 100 15.12 C2 3对 100 100 10.14 100 100 15.36 C3 5对 100 100 10.80 100 100 15.66 C4 3对 100 100 21.33 100 100 30.54 C5 5对 100 100 21.35 100 100 30.87 C6 9对 100 100 21.55 100 100 31.03 表 5 统计α算法验证实验数据安排
Table 5. Experimental data arrangement for statistical α-algorithm verification
数据编号 流程轨迹长度 事件日志规模 噪声类型 噪声规模/% 优先度等级 S1~S5 42 200~1 000 ①~④ 5 Level1 S6~S10 42 200~1 000 ①~④ 10 Level1 S11~S15 42 200~1 000 ①~④ 20 Level1 表 6 统计α算法对比实验结果
Table 6. Comparison of experimental results of statistical α-algorithm
数据编号 本文算法(α=0.05) 经典α算法[9] HAC-PSO[16] 泛化性/% 准确率/% 运行时间/ms 泛化性/% 准确率/% 运行时间/ms 泛化性/% 准确率/% 运行时间/ms S1 99.10 97.98 3 350 105.00 92.98 2 289 99.65 98.56 12 651 S2 99.16 98.88 5 830 104.90 93.86 4 820 99.58 98.96 33 624 S3 99.52 98.82 7316 104.88 93.82 9 213 99.58 99.19 75 623 S4 99.36 99.25 9 833 104.95 94.22 15 632 99.69 99.69 136 362 S5 99.88 99.65 11 369 104.36 94.65 30 205 99.78 99.96 206 534 S6 101.02 97.23 4 069 111.02 86.88 2 441 99.16 98.03 12 543 S7 100.56 97.79 6 151 110.56 87.25 4 922 99.23 98.15 34 123 S8 100.23 98.56 8 536 110.23 88.36 9 365 99.26 98.45 76 245 S9 100.20 98.33 10 623 110.20 89.12 15 664 99.32 98.32 132 013 S10 99.93 99.21 12 988 109.93 89.35 30 157 99.49 98.98 214 756 S11 100.98 97.56 4 419 114.98 82.36 2 674 98.52 97.23 13 025 S12 100.56 97.88 6 496 114.56 83.65 5 155 98.62 97.53 34 216 S13 100.23 98.65 9 441 114.43 82.98 9 998 99.15 98.68 77 487 S14 100.59 98.98 14 635 115.09 84.98 17 897 99.48 99.32 140 259 S15 100.20 99.52 19 062 114.35 83.66 32 390 99.60 99.56 219 874 表 7 非自由选择结构识别算法验证实验数据安排
Table 7. Experimental data arrangement for non-free-choice construct identification algorithm verification
数据编号 流程轨迹长度 事件日志规模 优先度等级 F1 50 600 Level4 F2 60 600 Level4 F3 70 600 Level4 F4 80 600 Level4 F5 90 600 Level4 表 8 非自由选择结构识别算法对比实验结果
Table 8. Comparison of experimental results of non-free-choice construct identification algorithm
数据编号 非自由选择结构数量 本文算法 α++算法[12] 识别数目 准确率/% 间接依赖数目 运行时间/ms 识别数目 准确率/% 间接依赖数目 运行时间/ms F1 5 5 100 14 1 025 5 100 20 1 562 F2 6 6 100 18 1 365 6 100 32 1 845 F3 7 7 100 20 1 852 7 100 36 2 214 F4 8 7 100 22 1 955 8 100 42 2 456 F5 9 9 100 28 2 123 9 100 50 2 814 表 9 急性阑尾炎病种事件日志信息
Table 9. Description for event logs of acute appendicitis
病种名称 流程轨迹长度 事件日志规模 重名活动数量 噪声类型 噪声规模/% 非自由选择结构数量 急性阑尾炎 53 200 10对 ①~④ 6 2 表 10 真实数据实验结果
Table 10. Experimental results for real data
-
[1] 闻立杰. 基于工作流网的过程挖掘算法研究[D]. 北京: 清华大学, 2007. http://cdmd.cnki.com.cn/Article/CDMD-10003-2008088148.htmWEN L J. Studies on algorithms for process mining based on WF-net[D]. Beijing: Tsinghua University, 2007(in Chinese). http://cdmd.cnki.com.cn/Article/CDMD-10003-2008088148.htm [2] 曾庆田.过程挖掘的研究现状与问题综述[J].系统仿真学报, 2007, 19(增刊1):275-280. http://www.cqvip.com/QK/96569X/2007A01/25333075.htmlZENG Q T.A survey of research issues and approaches on process mining[J].Journal of System Simulation, 2007, 19(s1):275-280(in Chinese). http://www.cqvip.com/QK/96569X/2007A01/25333075.html [3] COOK J E, WOLF A L. Automating process discovery through event-data analysis[C]//Proceedings of the 17th International Conference on Software Engineering. New York: ACM, 1995: 73-82. [4] AGRAWAL R, GUNOPULOS D, LEYMANN F. Mining process models from workflow logs[C]//International Conference on Extending Database Technology. Berlin: Springer, 1998: 467-483. [5] PINTER S S, GOLANI M.Discovering workflow models from activities' lifespans[J].Computers in Industry, 2004, 53(3):283-296. doi: 10.1016/j.compind.2003.10.004 [6] HERBST J, KARAGIANNIS D.Workflow mining with InWoLvE[J].Computers in Industry, 2004, 53(3):245-264. doi: 10.1016/j.compind.2003.10.002 [7] SCHIMM G.Mining exact models of concurrent workflows[J].Computers in Industry, 2004, 53(3):265-281. doi: 10.1016/j.compind.2003.10.003 [8] GRECO G, GUZZO A, PONTIERI L, et al.Discovering expre-ssive process models by clustering log traces[J].IEEE Transactions on Knowledge and Data Engineering, 2006, 18(8):1010-1027. doi: 10.1109/TKDE.2006.123 [9] VAN DER AALST W M P, WEIJTERS T, MARUSTER L.Workflow mining:Discovering process models from event logs[J].IEEE Transactions on Knowledge and Data Engineering, 2004, 16(9):1128-1142. doi: 10.1109/TKDE.2004.47 [10] DEMEDEIROS A K A, DONGEN B F, VAN DER AALST W M P, et al. Process mining for ubiquitous mobile systems: An overview and a concrete algorithm[C]//International Workshop on Ubiquitous Mobile Information and Collaboration Systems. Berlin: Springer, 2004: 151-165. [11] WEN L, WANG J, VAN DER AALST W M P, et al.A novel approach for process mining based on event types[J].Journal of Intelligent Information Systems, 2009, 32(2):163-190. doi: 10.1007/s10844-007-0052-1 [12] WEN L, VAN DER AALST W M P, WANG J, et al.Mining process models with non-free-choice constructs[J].Data Mi-ning and Knowledge Discovery, 2007, 15(2):145-180. doi: 10.1007/s10618-007-0065-y [13] WEN L, WANG J, SUN J. Mining invisible tasks from event logs[M]//LI Q, FENG L, PEI J, et al. Advances in data and web management. Berlin: Springer, 2007: 358-365. [14] LI J, LIU D, YANG B. Process mining: Extending α-algorithm to mine duplicate tasks in process logs[M]//SUI Q, YANG D, WANG T. Advances in web and network technologies, and information management. Berlin: Springer, 2007: 396-407. [15] DEMEDEIROS A K A, WEIJTERS A J M M, VAN DER AALST W M P. Using genetic algorithms to mine process models: Representation, operators and results[R]. Eindhoven: Eindhoven University of Technology, 2005. [16] 李莉, 李洪奇, 谢绍龙.基于变异粒子群算法的过程挖掘[J].计算机集成制造系统, 2012, 18(3):634-638. http://www.cims-journal.cn/CN/abstract/abstract3342.shtmlLI L, LI H Q, XIE S L.Process mining based on mutation-particle swarm optimization[J].Computer Integrated Manufacturing System, 2012, 18(3):634-638(in Chinese). http://www.cims-journal.cn/CN/abstract/abstract3342.shtml [17] 宋炜, 刘强.基于模拟退火算法的过程挖掘研究[J].电子学报, 2009, 37(增刊1):135-139. http://www.cnki.com.cn/Article/CJFDTOTAL-DZXU2009S1026.htmSONG W, LIU Q.Business process mining based on simulated annealing[J].Acta Electronica Sinica, 2009, 37(s1):135-139(in Chinese). http://www.cnki.com.cn/Article/CJFDTOTAL-DZXU2009S1026.htm [18] 黄黎, 谭文安, 许小媛.一种基于时序行为的流过程协同重构算法[J].计算机工程与科学, 2017, 39(5):897-903. http://www.xueshu.com/jsjgcykx/201705/29184488.htmlHUANG L, TAN W A, XU X Y.Cooperative streaming process reengineering based on sequential behaviors[J].Computer Engineering & Science, 2017, 39(5):897-903(in Chinese). http://www.xueshu.com/jsjgcykx/201705/29184488.html [19] DEMEDEIROS A K A, DONGEN B F, VAN DER AALST W M P, et al. Process mining: Extending the α-algorithm to mine short loops[R]. Eindhoven: Eindhoven University of Technology, 2004. [20] VAN DER AALST W M P, DONGEN B F, HERBST J, et al.Workflow mining:A survey of issues and approaches[J].Data & Knowledge Engineering, 2003, 47(2):237-267. https://wenku.baidu.com/view/5012e6d86f1aff00bed51e06.html