面向特定领域的自然语言理解与生成技术研究毕业论文

2021-11-06 11:11

摘要

随着大数据时代的到来，文本信息的数量已经远远超过人工处理的极限，人们希望在短时间内了解更多有用的信息，如何快速而又准确地获取信息的核心思想成为一个研究热点。随着自然语言理解和生成技术的发展，文本自动摘要技术也日益成熟，有了很多应用场景。本文面向中文新闻领域，从抽取式和生成式两个方面展开研究，主要工作如下：

（1）抽取式摘要研究：针对传统的TextRank算法不能充分挖掘文本语句特征的问题，将句子的位置特征、长度特征融入到TextRank算法中，同时考虑到TextRank算法生成的摘要存在一定的语义重复，在摘要生成时进行冗余处理。

（2）生成式摘要研究：提出一种融合Multi-Head Attention机制和语义相关性的生成式摘要模型。基于seq2seq Attention Beam Search构建基线模型，在此之上引入Multi-Head Attention机制，使模型能够从多角度学习文本特征；使用Mask机制引入先验知识，让模型解码时更加准确地聚焦于关键位置；引入语义相关性，使模型更倾向于输出与源文本相似度高的摘要。

（3）对比实验验证：使用本文的抽取式摘要算法在长文本数据集NLPCC上进行的实验，证明了TextRank算法优化工作的有效性。使用本文提出的生成式摘要模型在短文本数据集LCSTS上进行实验，与多个生成式摘要模型进行横向比较，与抽取式摘要模型进行纵向比较，实验结果表明，本文提出的模型相对于基线模型有了很大的提高，Rouge-1、Rouge-2、Rouge-L分别达到了32.5、21.4、31.1，所输出的摘要可读性、可靠性较好。

本文的贡献在于：在抽取式摘要中通过引入文本特征对TextRank算法进行了改进；在生成式摘要中提出一种融合Multi-Head Attention机制和语义相关性的生成式摘要模型，通过对比实验验证了模型的有效性。

关键词：自动摘要；深度学习；TextRank；多头注意力机制；语义相关性

Abstract

With the advent of the era of big data, the amount of text information has far exceeded the limit of manual processing. People hope to learn more useful information in a short time. How to quickly and accurately obtain information’s core ideas has become a research hotspot. With the development of natural language understanding and generation technology, automatic text summarization technology has also become increasingly mature, with many application scenarios. This paper which is oriented to the field of Chinese news conducts research from two aspects: extractable summarization and abstract summarization. The main work is as follows:

(1) Research on extractable summarization: In view of the problem that the traditional TextRank algorithm cannot fully excavate the characteristics of text sentences, this paper integrates the positional characteristics and length characteristics of sentences into the TextRank algorithm. Considering that the summarization generated by the TextRank algorithm has semantic duplications, this paper performs redundant processing when generating the summarization.

(2) Research on abstract summarization: An abstract summarization model combining Multi-Head Attention mechanism and semantic correlation is proposed in this paper. Firstly, we build a baseline model based on seq2seq Attention Beam Search. Then we introduce the Multi-Head Attention mechanism to enable the model to learn text features from multiple angles. At the same time, we use the Mask mechanism to introduce priori knowledges, which makes the model more accurately focus on the key position when decoding. Finally, we introduce semantic relevance to make the model more inclined to output summarization with high similarity to the source text.

(3) Verification by comparative experiment: We use the extractive summarization algorithm of this paper to conduct experiments on the long text data set NLPCC, which proves that the optimization work of the TextRank algorithm is effective. At the same time, we use the abstract summarization model proposed in this paper to conduct experiments on the short text data set LCSTS. We compared the experimental results with multiple abstract summarization models and the extractive summarization model. Experimental results show that the model proposed in this paper, whose Rouge-1、Rouge-2 and Rouge-L has reached 32.5、21.4 and 31.1, has greatly improved compared with the baseline model. The output summary of this model is readable and reliable.

The contribution of this paper: In terms of extractive summarization, the TextRank algorithm is improved by introducing text features. In terms of abstract summarization, an abstract summarization model combining Multi-Head Attention mechanism and semantic relevance is proposed, which has been verified to be effective by comparative experiment.

Key Words：Automatic summarization; deep learning; TextRank; multi-head Attention; semantic correlation

第1章绪论 1

1.1 研究背景及意义 1

1.2 国内外研究现状 1

1.3 本文的研究内容 2

1.4 本文的组织结构 3

第2章基于TextRank的抽取式自动摘要 4

2.1 PageRank算法 4

2.2 基于TextRank算法的抽取式自动摘要 4

2.3 改进的抽取式摘要算法 5

2.3.1 算法改进思想 5

2.3.2 改进算法的描述 6

2.4 本章小结 8

第3章基于深度学习的生成式自动摘要 9

3.1 基础知识 9

3.1.1 循环神经网络 9

3.1.2 长短期记忆网络 10

3.1.3 seq2seq架构 11

3.1.4 Attention机制 12

3.2 基于seq2seq架构的生成式摘要基线模型 13

3.2.1 双向LSTM 13

3.2.2 Beam Search 14

3.3 改进的生成式摘要模型 15

3.3.1 Multi-Head Attention机制 15

3.3.2 引入先验知识 16

3.3.3 引入语义相关性 16

3.3.4 融合Multi-Head Attention机制和语义相关性的生成式摘要模型 17

3.4 本章小结 18

第4章实验结果与分析 19

4.1 数据集及预处理 19

4.2 评测方法 21

4.3 生成式摘要模型参数设置 21

4.4 实验结果对比与分析 22

4.4.1 改进的抽取式摘要实验结果对比 22

4.4.2 抽取式和生成式摘要的实验结果对比 24

4.5 本章小结 25

第5章总结与期望 26

5.1 工作总结 26

5.2 研究期望 26

参考文献 27

致谢 29

绪论

研究背景及意义

随着5G技术的成熟，网络传输速度又将到达一个新的高度。与此同时，网络上的信息呈指数性增长，信息获取便捷的同时也给人们带来了信息过载的烦恼。近年来微博、微信等社交媒体迅速崛起，人们越来越习惯在网上浏览信息，如看新闻、科普、评论等。人们习惯通过标题去预测文章内容，但是也有越来越多的标题党，为了吸引眼球而取了与内容并不相符的标题，从而造成人们时间的浪费。同时，生活节奏的不断加快让人们无法对每天接触的信息及时进行有效的梳理，如何在短时间内获取更多有用的信息成为了人们迫切的需求，由此，文本自动摘要技术应运而生。

摘要可以让人们快速了解文本的核心思想，但是随着文本信息的不断增加，人工总结摘要需要耗费大量的人力物力。文本自动摘要技术通过算法对文本内容进行自动总结并输其核心内容，大大节省了人工总结摘要的成本，因此文本自动摘要技术的研究有着重大的实际意义。文本自动摘要技术按照输入可分为多文档摘要和单文档摘要。多文档摘要是指将多个文档作为输入，生成单个文档作为摘要输出；单文档摘要是指将单个文档作为输入，生成更短的文本作为摘要输出。Hovy^[1]认为“摘要是由一个或多个文档产生的文本，包含了原始文档中的很大一部分信息，并且长度不超过原始文本的一半”。文本自动摘要按照实现的技术主要分为抽取式摘要和生成式摘要两种方法。抽取式摘要通过抽取若干原始文本中的重要句子，组合成摘要输出；生成式摘要通过理解源文本的语义信息，使用自然语言处理算法，生成源文本中不存在但却包含了源文本核心信息的句子作为摘要。可以看出，生成式摘要相对于抽取式更加复杂，但这种先理解后总结的思维更贴近于人们进行摘要的习惯。相对于机器翻译、图像处理、情感分析等热门领域，文本自动摘要技术在国内的研究起步相对较晚。另一方面，中文文本的自动摘要数据集一直比较少，这也导致基于中文的生成式摘要研究比较滞后。

国内外研究现状

1958年，Luhn ^[2]首次提出自动摘要的概念，他通过统计词频特征对句子进行评分，然后比较每个句子的分数并选取分数最高的若干句子，组合成为摘要。虽然其获得的摘要冗余很高且质量较低，但却拉开了文本自动摘要技术研究的序幕。在这之后的很长一段时间内，抽取式摘要一直都是自动摘要技术的主流，其中基于图模型的TextRank算法^[3]是经典的抽取式算法之一。TextRank算法的主要思想是：以句子为顶点，以句子间的相似度为边，构建文本的TextRank网络图，然后对网络图中的节点进行迭代计算，句子的重要性得分就是迭代计算达到收敛时的数值，最后选取得分最高的若干个句子组成摘要。它的优点在于实现简单、无监督、语言弱相关，摘要质量也比较高。

您需要先支付 80元 才能查看全部内容！立即支付

注册

找回密码

面向特定领域的自然语言理解与生成技术研究毕业论文

绪论

研究背景及意义

国内外研究现状

您可能感兴趣的文章

最新文档

推荐栏目

登录

注册

找回密码

面向特定领域的自然语言理解与生成技术研究毕业论文

绪论

研究背景及意义

国内外研究现状

您可能感兴趣的文章

最新文档

推荐栏目