数据清理方法仿真计算毕业论文

2021-10-24 15:50:06

摘要

随着信息技术的快速发展,数据正在爆炸式增长,数据挖掘应运而生。数据挖掘是从数据中获取知识的技术,因此数据的质量尤为重要。但由于人工的疏漏,网络的错误等原因导致数据或多或少存在着各种问题,包括数据值异常、记录重复和数据值缺失等,而这些“脏数据”将导致挖掘出的信息可信度较低。在数据挖掘之前对数据进行预处理尤为重要,而数据清洗就是数据预处理的关键技术。本文主要研究数据挖掘中的数据清洗技术,重点研究对异常值数据的清洗。通过对基于统计方法的异常值判断和基于聚类算法的异常值判断进行仿真模拟，比较各方法的性能，给出相应的分析结果，为数值处理方法提高有益参考。传统的异常值判断技术虽然有一定的作用，但在条件不适应时判断效果不佳。聚类算法是现代数据挖掘技术的宝贵成果，在适用条件和使用效果上对比传统基于统计的数据清洗方法有很大的改观，因此研究聚类算法的使用及其优缺点就成了现代数据清洗工作的重中之重。

关键词：数据清洗异常值聚类算法

Abstract

With the rapid development of information technology, data is growing explosively, and data mining arises at the historic moment. Data mining is a technology to acquire knowledge from data, so the quality of data is particularly important. However, due to manual omissions, network errors and other reasons, there are a variety of data problems, including abnormal data values, duplicate records and missing data values, and these dirty data will lead to low credibility of the information mined. It is particularly important to preprocess data before data mining, and data cleaning is the key technology of data preprocessing. This paper mainly studies the data cleaning technology in data mining, focusing on the cleaning of outlier data. Through the simulation of outlier judgment based on statistical method and outlier judgment based on clustering algorithm, the performance of each method is compared, and the corresponding analysis results are given, which can improve the useful reference for numerical processing methods. Although the traditional outlier judgment technology plays a certain role, the judgment effect is not good when the conditions are not suitable. Clustering algorithm is a valuable achievement of modern data mining technology, which is greatly improved compared with traditional statistics-based data cleaning methods in terms of applicable conditions and application effects. therefore, the study of the use of clustering algorithms and their advantages and disadvantages has become the top priority of modern data cleaning work.

Keywords: data cleaning ,outlier ,clustering algorithm

摘要 4

Abstract 5

第一章绪论 8

1.1课题研究背景和意义 8

1.2课题研究现状 8

1.3课题主要任务 9

第二章数据清理 11

2.1数据清理的定义 11

2.2数据清理的方法 11

2.2.1 基于统计分析的方法 12

2.2.2 基于数据特征的方法 12

2.3异常值的处理 13

2.3.1 基于统计分析的方法 13

2.3.2 基于数据特征的方 13

2.4本章小结 14

第三章基于统计分析的数据清理方法 15

3.1异常值定义 15

3.2异常值剔除准则 15

3.3异常值处理方法 17

3.4仿真计算 18

3.5本章小结 21

第四章基于聚类的异常值清理方法 22

4.1 聚类算法 22

4.1.1聚类算法定义 22

4.1.2 聚类算法分类 22

4.2两种常用的聚类算法 24

4.2.1K-means聚类算法 24

4.2.2 K-nn近邻算法 25

4.3仿真计算 26

第五章总结与展望 33

5.1全文总结 33

5.2 工作展望 33

参考文献 34

致谢 36

第一章绪论

1.1课题研究背景和意义

近年来随着技术创新和应用，尤其是互联网行业的飞速发展，大量的数据充斥在世界的每个角落，大数据时代正在降临。大数据不仅仅是庞大的数据，其间还包含着很大的利用价值，而并非所有数据都是能够直接利用的，这还需要筛选与更改，才能更好地去利用这些数据来提取价值。因此，在处理和使用数据之前，要先进行数据清洗，提高数据质量，减少不必要的麻烦，也节约了时间人力机器等成本。

在整个数据挖掘过程中，数据预处理主要内容包括数据清洗，数据集成，数据变换和数据规约，总工作量占到了整个过程的60%，而数据清洗是数据预处理的第一步，决定着所有数据的质量，直接影响到数据挖掘结果的准确性，其重要性不言而喻。数据清洗主要是删除原始数据集中的无关数据、重复数据、平滑噪声数据，筛选掉与挖掘主题无关的数据，处理数据缺失，异常值等。

本课题主要研究数据清洗中的异常值处理。异常值是指样本中的个别值，其数值明显偏离其余的观测值。异常值也称为离群点，异常值的分析也称为离群点分析。异常值分析是检验数据是否有录入错误及含有不合常理的数据。忽视异常值的存在是十分危险的，不加剔除地把异常值包括进数据的计算分析过程中，对结果会带来不良影响。因此，重视异常值的出现，分析其产生原因，常成为发现问题进而改进决策的契机^[1]。

您需要先支付 80元 才能查看全部内容！立即支付

注册

找回密码

数据清理方法仿真计算毕业论文

Abstract

第一章绪论

1.1课题研究背景和意义

您可能感兴趣的文章

最新文档

推荐栏目

登录

注册

找回密码

数据清理方法仿真计算毕业论文

Abstract

第一章 绪 论

1.1课题研究背景和意义

您可能感兴趣的文章

最新文档

推荐栏目

第一章绪论