一种新的文档聚类方法

2023-01-29 13:04:37

论文总字数：32869字

摘要

聚类分析又叫做群分析，它起源于分类学，属于一种无监督算法。聚类可以在数据集未知先前关系的条件下进行分类，通常把聚类当作对数据预处理的过程。文档聚类的作用有：文档聚类可以作为对庞大数据的文档一个初步处理，对于网上在线搜索所获得的数据的结果，我们也可以通过文档聚类把结果分类，便于用户更好的、更快的获得自己所需要的数据。文档聚类可以更好的发现用户当下的需求，另外也可以实现信息过滤和主动推荐等功能；文档聚类还能够优化文本分类的最终结果，实现图书馆的快速检索服务等。

目前我们都是依据“对于相同类型的文档来说彼此之间的相似程度度更大，而对于不同类型的文档来说彼此之间的相似程度更小”这一假设来实现文档聚类。几十年的发展，文档聚类已经有很多种，主要分为：基于划分的聚类、基于层次的聚类、基于密度的聚类、基于网格的聚类、基于模型的聚类等。

但是这些聚类算法本身存在的一些缺点导致在文本聚类中一定的限制。本课题研究将一种新颖的DDC聚类算法用于文档聚类中，该聚类算法较之于其它的聚类方式具有：不受聚类数量的干扰、适用于任意空间分布的数据集等优点。

DDC算法核心是根据数据对象密度和距离判定聚类中心点，然后根据簇中心来聚类。它被提出之后广泛应用于超广谱图像向的波段选择、检测超网络中的社群、面部图像的年龄估计等，但少有应用于文本聚类之中，本次课题我们把DDC应用于文档聚类算法并比较其优越性。

在进行文档聚类之前要对收集的文档数据集进行文档向量化的工作，因为计算机不能直接获取数据集中的相关信息，必须把文档转换成机器能看懂的数据，最后聚类完成之后要对最终聚类的结果进行评估，分析聚类的好坏，这也是很重要的一个环节。

关键词：DDC；聚类；文档聚类；文本向量化；聚类质量测评

A New Document Clustering Method

Abstract

Clustering analysis is also called group analysis, which originated from taxonomy, it belongs to an unsupervised algorithm. Clustering can classify data sets without knowing the previous relationship of any data set. We usually regard clustering as a process of data preprocessing. The function of document clustering is: document clustering can be used as a preliminary processing of documents with huge data. For the results of returned data obtained by online search, we can also classify the results by document clustering, so that users can get the data they need better and faster. Document clustering can better discover users"current needs, and it can also achieve information filtering and active recommendation functions.. Document clustering can also optimize the final results of text categorization and realize the fast retrieval service of the library.

At present, we are based on the assumption that the similarity between documents of the same type is greater than that between documents of different types to achieve document clustering. Document clustering has been developed for decades. These algorithms are divided into: partition-based clustering algorithm, hierarchical clustering algorithm, density-based clustering algorithm, grid-based clustering algorithm, model-based clustering algorithm and so on.

However, some shortcomings of these clustering algorithms lead to some limitations in text clustering. In this paper, a novel DDC clustering algorithm is applied to document clustering. Compared with other clustering methods, this clustering algorithm has the advantages of not being disturbed by the number of clusters and being suitable for data sets with arbitrary spatial distribution.

The core of DDC algorithm is to determine cluster centers according to the density and distance of data objects, and then cluster according to cluster centers. After it was proposed, it has been widely used in the band selection of the direction of the extended spectrum image, the detection of the community in the hypernetwork, the age estimation of the facial image and so on. But it is seldom used in text clustering. In this topic, we apply DDC to document clustering algorithm and compare its advantages.

Before clustering, we need to vectorize the document data set we collected. Because computers can not directly obtain the relevant information in our data set, we need to convert the document into machine-readable data. Finally, after clustering, we need to evaluate the final clustering results and analyze the quality of clustering. This is also an important link.

Keywords:DDC; Clustering；document clustering; text vectorization; cluster quality assessment

第一章引言 1

1.1 研究背景及意义 1

1.2 算法研究现状 2

1.2.1 K-means算法 2

1.2.2 DBSCAN算法 2

1.2.3 BIRCH算法 2

1.2.4 DDC算法 2

1.3 论文研究内容和组织结构 3

第二章聚类算法理论与基础 4

2.1聚类的基础 4

2.1.1 聚类的相似度计算 4

2.1.2 聚类指标 5

2.2 K-means算法基础 7

2.3 DBSCAN算法基础 8

2.4 BIRCH算法基础 9

2.5 DDC算法基础 10

2.6本章小结 12

第三章文档处理的基础知识 13

3.1文档处理 13

3.1.1 文档分词 14

3.1.2 停用词过滤 14

3.2文本表示模型 15

3.2.1 布尔模型 15

3.2.2 向量空间模型 15

3.3 TF-IDF处理文本示例： 16

3.4本章小结 17

第四章文档聚类算法实现及对比 19

4.1 文档预处理 19

4.1.1 文档向量化基础 19

4.1.2文本向量化过程 19

4.2 K-means算法实现 20

4.3 DBSCAN算法实现 20

4.4 DDC算法实现 21

4.4.2求数据点密度代码： 21

4.4.4计算相似度代码： 22

4.4.5 Ryan Seghers求密度的方法： 22

4.5 DBI指数测评 23

4.6 轮廓系数测评 24

4.7 实验结果对比及分析 25

4.7.1 实验结果对比 25

4.7.2 实验结果分析 37

4.8 本章小结 37

第五章总结与期望 38

5.1 全文工作总结 38

5.1.1 全文的主要研究内容 38

5.1.2 缺点与不足 38

5.2 对未来工作的展望 38

致谢 40

参考文献 41

引言

1.1 研究背景及意义

从第三次工业革命开始计算机的出现一直发展到现在，我们的生活方式已经发生了很大的变化。大量员工、大工业、大机器从事的大规模流水线形式的生产方式已经不再是主流，环顾四周第三产业也就是服务业在我们的生活中越来越多，越来越占据主流，这决定了信息类无形产业将会成为关键资源。从以前我们经常用的纸币到现在的移动支付，从以前商场购物到现在的淘宝京东网上购物，包括我们的手机作为互联网络时代重要的载体，最开始我们只是用手机来拉近彼此的距离互相打电话、发短信，而现在手机更像是一扇窗户让我们看到丰富的世界。根据中国互联网络信息中心之前发布的对于上网人数的第43次统计报告表明我国网民截止至2018年已经达到了8.28个亿，其增长趋势更是不停的在上升，而我们在网上购物的人数已经达到了6.10亿，使用支付宝、微信等网上支付的用户规模更是达到了6.00亿。

随着信息越来越多我们要进一步筛选信息从海量的信息中提取出我们想要的信息，但是网络中产生的大量数据大多是杂乱无章没有规律可循的因此聚类就显得尤为重要，合适的聚类算法能够从大量看似无规则的数据集中，分析出隐藏的内部关系从而把数据按类别分好，方便我们之后对数据的选择和应用，从而帮我们节省大量的时间。

剩余内容已隐藏，请支付后下载全文，论文总字数：32869字

您需要先支付 80元 才能查看全部内容！立即支付

注册

找回密码

一种新的文档聚类方法

引言

1.1 研究背景及意义

您可能感兴趣的文章

最新文档

推荐栏目

登录

注册

找回密码

一种新的文档聚类方法

引言

1.1 研究背景及意义

您可能感兴趣的文章

最新文档

推荐栏目