面向数据流分类的集成学习方法研究毕业论文

2021-11-06 08:11

摘要

在大数据时代，数据海量无限产生，数据也不仅仅只是数据，而是各种信息的载体，应用范围也越来越大。特别是数据流这种新的数据形式，它具有实时快速到达、海量的特点，且不易存储。传统算法适合处理静态数据，却不能很好地适用于动态的数据流分类，所以一种能适用于数据流分类的算法成为研究的热点，也是需要攻克的难点。

在线集成算法是一种区别于批量集成的算法，它既拥有批量集成较强的泛化性能，但无需如批量集成那样重复多次的扫描采样。在线集成算法仅通过一次线性扫描就能完成对模型的训练，非常适合用于处理数据流分类。

基于此，本文主要研究如何使用在线集成算法来处理数据流。在线集成算法仅通过一次线性扫描就能完成模型训练，可以实时地接收数据流完成模型训练，节省存储数据的硬件空间。本毕设论文完成的主要工作如下：首先，以bagging和boosting作为在线集成算法的基础展开研究，分析总结数据流分类集成算法的研究背景和意义，对数据流的特点以及处理难点进行了详细的分析，并结合集成学习方法阐述；之后，以bagging和boosting为基础构建在线算法，并将批量版本和在线版本的两种算法进行理论分析，比较不同版本和不同算法之间的优劣；最后，实现online bagging和online boosting算法，并以两种算法为基础设计对比分析试验，比较在线算法和批量算法间的差异性，比较不同数据集对于在线算法的影响，比较数据平衡前后算法的性能，同时设计系统测试算法的稳定性。

关键词：数据流；集成分类；在线集成算法；装袋；助推

Abstract

In the era of big data, data is generated in an infinite amount. Data is not just data, but also a carrier of various information, and its application range is also increasing. Especially the new data form of data stream, which has the characteristics of rapid arrival in real time, massive, and is not easy to store. The traditional algorithm is suitable for processing static data, but it is not well suited for dynamic data flow classification. Therefore, an algorithm that can be applied to data flow classification has become a research hotspot and a difficulty to be overcome.

Online integration algorithm is a kind of algorithm different from batch integration. It has strong generalization performance of batch integration, but it does not need to repeat the scanning and sampling as batch integration. The online integration algorithm can complete the training of the model with only one linear scan, which is very suitable for processing data stream classification.

Based on this, this thesis mainly studies how to use online integration algorithm to process data flow. The online integration algorithm can complete the model training only through a linear scan, and can receive the data stream in real time to complete the model training, saving the hardware space of data storage. Firstly, bagging and boosting are taken as the basis of online integration algorithm to carry out the research, analyze and summarize the research background and significance of data stream classification integration algorithm, conduct a detailed analysis of the characteristics and processing difficulties of data stream, and illustrate the integrated learning method. After that, the online algorithm was built based on bagging and boosting, and the two algorithms of the batch version and the online version were theoretically analyzed to compare the pros and cons of different versions and different algorithms. Finally, the online bagging and online boosting algorithms are implemented, and a comparative analysis experiment is designed based on the two algorithms to compare the differences between the online algorithm and the batch algorithm, compare the influence of different data sets on the online algorithm, compare the performance of the algorithm before and after data balance, and design the system to test the stability of the algorithm.

Key Words：data stream；ensemble classification；online ensemble algorithm；bagging；boosting

第1章绪论 1

1.1研究背景和意义 1

1.2国内外研究现状 2

1.3研究内容和结构安排 3

第2章集成算法原理与技术基础 6

2.1 引言 6

2.2 集成学习算法 6

2.2.1 集成算法基本概念 6

2.2.2 bagging算法 7

2.2.3 boosting算法 9

2.3 在线集成学习算法 12

2.3.1 online bagging算法 12

2.3.2 online boosting算法 13

2.4小结 15

第3章在线集成算法设计与实现 17

3.1引言 17

3.2 online bagging算法设计与实现 17

3.2.1算法框架 17

3.2.2算法实现 18

3.3 online boosting算法设计与实现 19

3.3.1算法框架 19

3.3.2算法实现 20

3.4小结 21

第4章数据流分类系统的设计与实现 22

4.1引言 22

4.2系统总体设计 22

4.2.1系统设计目标和功能 22

4.2.2技术路线和系统框架原理图 23

4.2.3交互界面设计 25

4.3功能模块实现 26

4.3.1数据导入模块 26

4.3.2数据预处理模块 27

4.3.3模型训练/测试模块 29

4.4实验分析 30

4.4.1实验目的和数据资源 30

4.4.2实验环境及实验过程 31

4.4.3实验结果及分析 31

4.5系统测试 37

4.5.1测试环境 37

4.5.2测试用例 37

4.5.3测试结果 37

4.6小结 37

第5章结束语 39

5.1全文总结 39

5.2工作展望 39

参考文献 41

致谢 43

第1章绪论

1.1研究背景和意义

在大数据时代，每时每刻都产生着海量的数据，数据量级已经从的TB、PB级别增长到EB、ZB级别。数据的量级发生如此巨大的改变，再用之前的处理方式就不合时宜了。而且数据产生、交换的形式也发生了改变。之前的网络规模小，数据传输速度慢，大多数情况下数据交换都是通过硬盘、光盘或者U盘等静态存储设备进行的。但是现在网络规模变大，数据传输速度提升，数据形式也发生了变化，数据交换也变成了通过通信网络的动态交换。不仅可以随时随地的进行，而且交换的效率也有所提升。

由于数据交换的形式和量级发生改变，也给数据挖掘带来了新的挑战。面对静态存储的数据，传统的方法可以重复扫描，多次取样进行分析。这对于存储在硬盘或者数据库中的静态数据是没有问题的，但是对于存在于通信网络中实时快速到达、海量无限的数据可能就处理不过来了。可能会想到把动态数据存储下来，再像静态数据那样处理就可以了。可是这只能是治标不治本的办法，存储数据需要花费大量的硬件存储设备，还会增加数据处理整体消耗的时间，不能根本解决数据形式和数据量的改变而带来的问题。

为了解决问题就要知道面临的问题是什么。数据量大可以通过升级硬件来提高吞吐量和处理速度，而对于算法真正的挑战是数据形式的改变而带来的。针对这种存在于通信网络中快速实时到达的数据形式，专业上将其命名为数据流。数据流^[1]是区别于静态数据的一种新型数据形式，它如水流一般源源不断的产生，它的源头是不同领域的应用，例如网银交易数据、地质勘探数据、互联网购物交易数据等。将数据流作为一种新的数据形式主要是因为它流的特性，即快速实时到达且数据量庞大。数据流同时还伴随着概念分布随时间变化的特点。传统的数据挖掘算法可以很好地适应静态数据，多次遍历重复扫描抽样对于静态数据是很方便就能执行的，但是对于数据流就不太适用。针对数据流数量庞大存储困难的特点，就需要数据处理算法能实时的通过单次扫描训练模型，且能够及时识别概念漂移的情况，以此提高数据挖掘的可靠性和有效性^[2]。

您需要先支付 80元 才能查看全部内容！立即支付

注册

找回密码

面向数据流分类的集成学习方法研究毕业论文

Abstract

第1章绪论

1.1研究背景和意义

您可能感兴趣的文章

最新文档

推荐栏目

登录

注册

找回密码

面向数据流分类的集成学习方法研究毕业论文

Abstract

第1章 绪论

1.1研究背景和意义

您可能感兴趣的文章

最新文档

推荐栏目

第1章绪论