基于FCM的文本聚类毕业论文

2022-04-17 22:16:03

论文总字数：18102字

摘要

将聚类用在文本上其意思就是把一大堆文本进行组的文本聚类是把给定的文本集合划分，使得每个组内的文本内容都是相关的，而各组外的文本的内容却无关的。通过上面这样做，就可以用来对信息进行检索以及识别模式。现在网上的信息越来越多，研究文本聚类这门技术对信息检索来说作用很大。

本论文完整地讲解了文本聚类的原理及建立方法，说明了在这门技术里我们会用到的向量空间模型，还有它是怎么样建立起来的。先了解模糊数学和聚类理论，然后在这些原理之上将聚类用到文本之中来。目前FCM算法是比较用的多一些的，因为此算法的理论方面是很充足的，大量的被用于处理文本问题。但是它也有不足之处，比如这个算法对初始聚类中间数据点反应敏感而达不到最优解这个问题。文中就这问题拿出了一种基于改进初始聚类中心选取点的新的FCM算法。

提出改进措施之后就出现一种新算法，在此我们简称为NFCM，接下来我们开始设计系统模型，将新算法和之前的算法都用在上面做实验分析，一方面是看看这个系统模型是否可行，另一方面也是想证明一下这个新算法比传统算法在文本聚类时的优越性。

关键词： 文本聚类初始聚类中心向量空间模型模糊C-均值算法

Text clustering based on Fuzzy C-means

ABSTRACT

Clustering used in the text means that a lot of text is to be divided into groups, it makes the text in each group related, nevertheless ,the text contents between groups unrelated. By doing this above, it can be used to retrieve information and pattern recognition. Now there are more and more information on the Internet ,the research on text clustering play a great role in the technology for information .

This paper explained the principle of text clustering and the method for establishing the vector space model in this technology where we will use, as well as how it is built. First ,we should understand the theory of fuzzy mathematics and fuzzy clustering ,then apply clustering to the text on the basis of this theory. Currently, FCM algorithm has better theoretical basis and it is commonly used to handle the problem in text . But it also has shortcomings. For example, the initial clustering algorithm is sensitive to the intermediate data points so that we can not get the optimal solution . The paper has come up with an improvement on the initial cluster centers selected point of the new FCM algorithm.

Through the improvement above, we propose a new algorithm, which we referred to as NFCM. Then we began to design the system model, and use the new and the previous algorithms to make analysis with experiment ;On one hand，we can find whether this system model is feasible or not;On the other hand, we want to prove whether this new algorithm is superior to the traditional one when clustering.

Keywords :Text clustering;Vector space model;Initial Clustering Center ;FCM

摘要 I

ABSTRACT II

第一章绪论 1

1.1课题背景与意义 1

1.2文本聚类和文本分类 1

1.3文本聚类的研究进展 2

1.4文本聚类的困难及特点 2

1.5本文的主要工作 2

第二章文本的数字化模型 4

2.1向量空间模型 4

2.2 特征项的选择 4

2.3文本特征表示 5

2.4 特征项的抽取 7

2.5本章小节 8

第三章模糊理论与FCM聚类算法 9

3.1模糊理论的发展 9

3.2文本聚类算法概述 10

3.3 FCM的发展和变化 11

3.3.1 FCM算法简述 11

3.3.2 FCM算法的不足 13

3.4 NFCM算法的提出 13

3.5 本章小节 15

第四章文本聚类系统的实现与分析 16

4.1 体系结构 16

4.2功能模块 16

4.3实验测试准备 18

4.4实验步骤 19

4.5 本章小节 23

第五章总结与展望 24

5.1 总结 24

5.2 展望 24

参考文献 25

致谢 27

第一章绪论

1.1课题背景与意义

进入21世纪，网络上的信息开始泛滥，随之而来的后果是导致信息的结构也变得混乱。想要把它们的内容结构搞明白的确很难，虽然生活中有很多搜索引擎，但它们搜出来的文本很多也会是无关的。如果要找出相关的文本单单靠人工是真的不行的。平时我们用的方式只能做一些浅层处理，而且，获得的信息数据会有很多是没用的，相反有用的信息却没有被挖掘出来。文本挖掘^[1]是数据挖掘里面及其重要的一个分支，因为处理文本需要通过自然语言来做媒介，但是现在电脑还不能够完全解决人语言容易出现歧义这个问题，因此要使文本挖掘对自然语言更深入地理解，还需要不断的努力研究。文本聚类这个课题是其中之一，在现在也是非常热的课题。

以前的很多聚类方法都是硬划分，也就是说分得很明确、一刀切的那种。但实际上世上的事物都不能明确来划分的，它们彼此间具有关联，可以说是是“亦此亦彼”的^[3]。为了更好地解决这种聚类问题人们开始想到用模糊法来处理，也就是模糊聚类分析法。根据模糊聚类定义可知，它在执行过程中没有人来监督实行，只是靠事物相似不相似来判断是否为一类。

请支付后下载全文，论文总字数：18102字

您需要先支付 80元 才能查看全部内容！立即支付

注册

找回密码