-
摘要:
在复杂网络的分布式环境中,精准全面地采集海量用户在浏览网站过程中的行为数据和网站过程数据并高效存储是用户行为分析的前提和基础。为了解决数据类型的多样性和存储的差异性问题,提高数据的检索效率,为企业的个性化需求做用户行为的分析提供支持,设计了白盒模式的用户痕迹采集存储系统。用户访问Web服务器过程中会产生交互/交易数据以及用户操作,浏览网站过程中会产生图片、视频、商品描述等多种类型的文件,这些界面和数据称为用户浏览痕迹,操作序列则作为用户行为的实际动作顺序记录。对用户数据和操作序列分析,能精确反映用户特征。采集模型通过界面窗口树来建模,提供统一数据存取接口,根据数据类型的不同,分别存储于不同的位置,完整采集用户痕迹,应用程序传递参数指定存储位置创建数据库文件,通过存取接口可以分类型、按要求存取用户数据,解决了面向互联网的用户交互痕迹捕获、存储和检索的问题,具有良好的精确性和完整性。
Abstract:In the distributed complex network environment, collecting the large number of users' behavioral data along with the website data during browsing accurately and comprehensively, efficiently storing them are the basis of user behavior analysis. In order to solve the problems of diversity of data types and storage differences, improve the efficiency of data retrieval, and provide support for the analysis of user behavior for the individual needs of enterprises, a white box mode of user trace collection and storage system is designed in this paper. The users visit the Web server and processes the data of interaction/transaction and user operations, such as pictures, video, description of goods and other types of files. These interfaces and data are called user browsing traces, and operation sequences are the actual user behaviors in order. User data and operation sequence analysis can accurately reflect user characteristics. The collection system is modeled by the interface window tree, providing a unified access interface for data, which is stored in different locations according to the data types. The applications input parameters to specify the storage location to create the database. Through the access interface, the user data can be accessed according to the different file types and requirements. The model solves the problem of capturing, storing, and retrieving traces of Internet-oriented user interaction, and has good accuracy and integrity.
-
Key words:
- user behavior /
- user trace collection /
- interface window tree /
- unique storage /
- unstructured data
-
表 1 四种采集方式的优缺点分析
Table 1. Advantage and disadvantage analysis of four collection methods
采集方式 优点 缺点 Web服务器日志 基本字段获取容易
半结构化数据,处理方便
能分析搜索引擎的爬虫记录
能体现文件下载内容无法获取缓存
无法获取业务行为记录
只识别IP,识别用户不准确
跨域访问检测难
数据获取不完整JavaScript页面标签 采集内容可自定义
记录前端用户行为数据
能处理Cookie缓存,
采集行为数据准确用户的JavaScript设置影响数据收集
增加网站的脚本的负荷
无法获取文件下载记录
数据获取不完整包嗅探器 获取数据实时性高
跨域访问检测容易成本高
无法获取缓存,代理记录
用户隐私没保障
数据获取不完整代理服务器日志 支持SSL的编码
支持Cookie的管理
支持JavaScript动态创建的链接不指向代理时,数据获取不完整
采集慢,效率低表 3 Operation关键字段
Table 3. Key field of Operation
字段 类型 说明 operationId varchar 操作事件编号 interfaceNodeID varchar 所属界面 menuID varchar 菜单操作编号 apiID varchar API编号 userOperation varchar 用户操作 eventType varchar 事件类型 relativity varchar 前后台相关性 inparam varchar 输入参数 inparamType varchar 输入参数类型 outparam varchar 输出参数 outparamType varchar 输出参数类型 timeStamp timestamp 记录事件发生的时间 表 2 userAccesslog关键字段
Table 2. Key field of userAccesslog
字段 类型 说明 userID varchar 用户id OperationId varchar 用户操作事件编号 timeStamp timestamp 记录事件发生的时间 IP varchar 当前IP地址 表 4 用户行为记录数据项及格式说明
Table 4. User behavior record data items and format specification
数据项 格式 说明 时间 time 2018-03-20 17:13:13 日志记录生成时间 日志编号 logID [95C4C3AE62D41E3213C3007F3] 为每条日志设定唯一编号,用[]标识,可供后续搜索 操作行为
信息Input requestURL “requestURL”:“http://10.2.8.166:11680/portal/login” 访问者请求URL IP “IP”:“10.2.8.171” 访问者真实IP地址 varchar alice,pwd123456 输入值为用户名,密码 Output Json格式,用{}标识,属性间用“,”分隔 {“userID”:idUser0001,“status”:200,“msg”:“注册成功”} 返回用户id,注册成功状态和提示信息 界面信息 window 界面窗口 LoginNode,SacrificesNode LoginNode的孩子节点是SacrificesNode widgets 控件序列 用户名文本框(idUser0001,username,Input,alice,True,True)密码文本框(id0000001,password,Input,pwd123456,True,True)注册按钮(id0000002,registerBtn,getApi,null,True,False)用户头像框(headImg) 控件编号;控件名称;控件类型;控件当前取值;控件的可见性;控件的可修改性 menuID varchar login 登录注册菜单 apiID varchar loginApi 操作对应的API -
[1] SRIVASTAVA J, COOLEY R, DESHPANDE M, et al.Web usage mining:Discovery and applications of usage patterns from Web data[J].ACM SIGKDD Explorations Newsletter, 2000, 1(2):12-23. doi: 10.1145/846183.846188 [2] 张玉芳, 张艳华, 熊忠阳.一种高效的用户浏览行为采集方法[J].计算机工程与应用, 2013, 49(3):126-129. doi: 10.3778/j.issn.1002-8331.1108-0269ZHANG Y F, ZHANG Y H, XIONG Z Y.Efficient method for collecting user browsing behaviors[J].Computer Engineering and Applications, 2013, 49(3):126-129(in Chinese). doi: 10.3778/j.issn.1002-8331.1108-0269 [3] CATLEDGE L D, PITKOW J E.Characterizing browsing strategies in the world-wide web[J].International World Wide Web Conference, 1995, 27(95):1065-1073. [4] 董志安, 吕学强.基于百度搜索日志的用户行为分析[J].计算机应用与软件, 2013, 30(7):17-20. doi: 10.3969/j.issn.1000-386x.2013.07.006DONG Z A, LYU X Q.User behavior analyses based on baidu search logs[J].Computer Applications and Software, 2013, 30(7):17-20(in Chinese). doi: 10.3969/j.issn.1000-386x.2013.07.006 [5] THORAT S S, MORE P.User oriented approach to website navigation concept using mathematical model[C]//International Conference on Computational Intelligence and Communication Networks.Piscataway, NJ: IEEE Press, 2016: 1431-1435. [6] 李睿, 连航, 马世龙, 等.基于形式化方法的航空电子系统检测[J].软件学报, 2015, 26(2):181-201. http://d.old.wanfangdata.com.cn/Periodical/rjxb201502002LI R, LIAN H, MA S L, et al.Avionics system testing based on formal methods[J].Journal of Software, 2015, 26(2):181-201(in Chinese). http://d.old.wanfangdata.com.cn/Periodical/rjxb201502002 [7] 余慧佳, 刘奕群, 张敏, 等.基于大规模日志分析的搜索引擎用户行为分析[J].中文信息学报, 2007, 21(1):109-114. doi: 10.3969/j.issn.1003-0077.2007.01.018YU H J, LIU Y Q, ZHANG M, et al.Research in search engine user behavior based on log analysis[J].Journal of Chinese Information Processing, 2007, 21(1):109-114(in Chinese). doi: 10.3969/j.issn.1003-0077.2007.01.018 [8] FU Y, LUO S, SHU J.Survey of secure cloud storage system and key technologies[J].Journal of Computer Research & Development, 2013, 50(1):136-145. http://www.wanfangdata.com.cn/details/detail.do?_type=perio&id=jsjyjyfz201301013 [9] LIU J, HUANG K, RONG H, et al.Privacy-preserving public auditing for regenerating-code-based cloud storage[J].IEEE Transactions on Information Forensics & Security, 2015, 10(7):1513-1528. [10] WU Y, JIANG Z L, WANG X, et al.Dynamic data operations with deduplication in privacy-preserving public auditing for secure cloud storage[C]//IEEE International Conference on Computational Science and Engineering.Piscataway, NJ: IEEE Press, 2017: 562-567. [11] BELLET A, HABRARD A, SEBBAN M.A survey on metric learning for feature vectors and structured data[EB/OL].(2014-02-12)[2018-12-29].https: //arxiv.org/abs/1306.6709. [12] 杨晶, 周双娥.一种基于XML的非结构化数据转换方法[J].计算机科学, 2017, 44(11):414-417. http://www.wanfangdata.com.cn/details/detail.do?_type=perio&id=jsjkx2017z2088YANG J, ZHOU S E.Method for unstructured data transformation based on XML technology[J].Computer Science, 2017, 44(11):414-417(in Chinese). http://www.wanfangdata.com.cn/details/detail.do?_type=perio&id=jsjkx2017z2088 [13] BOUCHER T D, AUSLANDER D M, BASH C E, et al.Viability of dynamic cooling control in a data center environment[C]//The Ninth Intersociety Conference on Thermal and Thermomechanical Phenomena in Electronic Systems, 2004(ITHERM'04).Piscataway, NJ: IEEE Press, 2006: 593-600. [14] HOU B, CHEN F, OU Z, et al.Understanding I/O performance behaviors of cloud storage from a client's perspective[J].ACM Transactions on Storage, 2017, 13(2):16. http://www.wanfangdata.com.cn/details/detail.do?_type=perio&id=98bf6283ba03ec991c10fccbde40cad9 [15] 汪帅, 吕江花, 汪溁鹤, 等.一种支持数据去冗和扩容的多媒体文件云存储系统实现[J].计算机研究与发展, 2018, 55(5):1034-1048. http://d.old.wanfangdata.com.cn/Periodical/jsjyjyfz201805013WANG S, LYU J H, WANG R H, et al.A multimedia file cloud storage system to support data deduplication and logical expansion[J].Journal of Computer Research and Development, 2018, 55(5):1034-1048(in Chinese). http://d.old.wanfangdata.com.cn/Periodical/jsjyjyfz201805013 [16] 李慧莹.基于HDFS的小文件存储方法的研究与优化[D].西安: 西安电子科技大学, 2014. http://cdmd.cnki.com.cn/Article/CDMD-10701-1014331548.htmLI H Y.Research and optimization of small file storage method based on HDFS[D].Xi'an: Xidian University, 2014(in Chinese). http://cdmd.cnki.com.cn/Article/CDMD-10701-1014331548.htm [17] 焦晨宇.可伸缩分布式文件系统及其应用[D].北京: 北京理工大学, 2015.JIAO C Y.The design and application of a scalable distributed file system[D].Beijing: Beijing Institute of Technology, 2015(in Chinese).