基于深度学习的生物医学命名实体识别研究毕业论文

2021-11-05 07:11

摘要

在这个时代中，科技迅猛的发展，同样的生物医学文献也在快速增长，生物医学的文本挖掘工作就显得尤为重要，所以我们需要利用一些工具来对生物医学的文本进行实体的识别，这样才能适应更大量的生物医学文献阅读工作，能让专家更快地提取想要的实体信息。生物医学命名实体识别是生物医学文本挖掘的最基本任务之一，现在所有的研究方法基本都是利用字典以及规则标注方法进行识别的，也有利用CRF模型进行识别的，但是直接将这些自然语言处理的方法应用于生物医学命名实体识别中效果并不理想。所以本研究课题的目的在于对基于生物医学命名实体识别搭建一个比较好的神经网络来完成识别工作，内容如下：1.寻找恰当的语料库，能够凸显出在生物医学命名实体识别中模型的优势的语料库，以及在生物医学界较为权威的语料库，2.选择合适的词向量，并对词向量进行处理，3.基于这些语料库进行语料库的处理，其中包括分词，去除停用词等等，4.然后搭建BiLstm CRF的神经网络模型，5.利用输入的词向量对模型进行训练，6.得出模型在各个数据集中的分数，并且对结果进行分析。最终我完成了对Bi-Lstm CRF模型的搭建以及训练，最终分数比较理想，分别在5个公认的生物医学语料库数据集中进行了预测评估，并且得到了较为优秀的分数。这个模型解决了一般的RNN模型的梯度消失以及梯度爆炸问题，也能对标注进行一个条件性的约束，同时不需要手动提取特征。本模型的一个特色在于这样的一个研究模型比较容易理解，可读性高而且不需要人工抽取特征工程，神经网络会自动提取特征值，在模型上明显优于nltk一众的基于字典或者是规则的标注方法，在数值上相比较于只有LSTM的模型以及CNN的模型要获得了更高的F1分数，所以这样的一个模型具有高效地搭建以及准确地输出的特点，是一般的神经网络以及普通标注方法所不具有的。

关键词：生物医学；命名实体；长短期依赖网络；条件随机场

Abstract

In this era, the rapid development of science and technology, the same biomedical literature is also growing rapidly, biomedical text mining work is particularly important, so we need to use some tools to identify biomedical text entities, so as to adapt to a larger number of biomedical literature reading work, and enable experts to extract the desired entity information faster 。 Biomedical named entity recognition is one of the most basic tasks of biomedical text mining. At present, all research methods are based on dictionary and rule annotation, and CRF model. However, the effect of directly applying these natural language processing methods to biomedical named entity recognition is not ideal. Therefore, the purpose of this research is to build a better neural network for the biomedical named entity recognition. The contents are as follows: 1. Find the right corpus, which can highlight the advantages of the model in biomedical named entity recognition, and the more authoritative corpus in biomedical field; 2. Select the right word vector and the right word Vector processing, 3. Corpus processing based on these corpora, including segmentation, removal of stop words and so on, 4. Then build the neural network model of bilstm CRF, 5. Use the input word vector to train the model, 6. Get the score of the model in each data set, and analyze the results. Finally, I completed the construction and training of Bi LSTM CRF model, and the final score was ideal. I made prediction and evaluation in five recognized biomedical corpus data sets, and got better scores. This model solves the problem of gradient vanishing and gradient explosion in general RNN model, and it can also conditionally constrain the annotation without manual feature extraction. One of the characteristics of this model is that such a research model is easy to understand, has high readability and does not need to extract feature engineering manually. Neural network will automatically extract feature values, which is significantly better than nltk based dictionary or rule-based annotation methods in the model. Compared with only LSTM model and CNN model, it has higher F1 score in numerical value Such a model has the characteristics of efficient construction and accurate output, which is not possessed by general neural network and general annotation methods.

Key Words：Biomedicine; named entity; long term and short term dependent network; conditional random field

第1章绪论 1

1.1研究背景和意义 1

1.2国内外相关研究现状 1

1.3本文研究内容 2

1.4本文组织结构 2

第2章生物医学命名实体识别推荐方法 4

2.1 问题提出 4

2.2 BIOES标注策略 4

2.3 Bioner推荐模型 4

2.3.1 Long Short-Term Memory(长短期依赖网络)模型 4

2.3.2 BiLSTM CRF（双向lstm crf）模型 7

2.3.3 Single-task model（简单任务）模型 11

2.3.4 Multi-task models（多任务）模型 11

第3章研究模型的设计 13

3.1 双向Lstm CRF模型设计框架图 13

3.2 模型具体设计步骤 13

3.2.1 利用Skip-gram模型将单词转换成词向量 13

3.2.2 词向量输入至双向LSTM层过程介绍 15

3.2.3 CRF层的传输 17

第4章实验内容 18

4.1 实验数据来源 18

4.1.1 实验数据集 18

4.1.2 实验词嵌入向量 19

4.2 实验步骤 19

4.3 实验度量指标 19

4.4 比较Bi-Lstm crf模型在不同数据集上的分数 20

4.4 案例研究 22

第5章总结与展望 23

5.1 总结 23

5.2 展望 23

参考文献 24

致谢 25

第1章绪论

1.1研究背景和意义

当下社会物质迅速发展，伴随着物质的发展，人们对于生物医学方面的追求也更多，需要获取更多的生物医学研究前沿的知识，比如说专家们需要从大量的文献中提取出需要的领域方面的命名实体，如病毒，蛋白质等等。这样的一个发展趋势，导致了只通过全文阅读文献来摄取生物医学方面有关领域的知识越来越成为不可能。因为文献的数量庞大，并且命名实体难于一眼发现，导致对生物医学知识的摄取速度会越来越慢。正是这样的原因，生物文本挖掘便进入了人们的视线，生物文本挖掘旨在将文本挖掘技术来应用到生物医学领域之中。Bio-NER是生物医学文本挖掘中最基本的任务之一，也是许多下游应用程序的原始步骤，如关系提取和知识库等。生物医学领域的命名实体是一个新兴的实体结构，其命名复杂并奇瑞种类繁多，不容易识别，所以很基本的自然语言处理工具并不能直接应用到生物医学领域上，因为这样根本无法识别出命名实体。Bio-NER的主要目的就是在庞大的生物医学文献中，对protein,DNA，RNA，cell-Type等等对于生物医学领域具有重要意义的命名实体提取并进行标签标注，这样才能使得文本挖掘能够更有效率的进行，比如在：“We found a new protein: GRP78, GRP78 is a stress-induced protein, which can be folded and synthesized, and can be used for quality control in endoplasmic reticulum (an important part of human cells).”中，我们需要快速地识别出命名实体是:GRP78，然后对其标注是S-protein，这样的一个标注有助于专家的快速阅读，以及对新的信息有一个总的概念。所以我们需要对Bioner进行更深一步的研究，这样才能保证人们从生物医学文献中能够更快速地抓住该领域实际有用的信息。Bio-NER是生物医学文本挖掘中最基本的任务之一，也是许多下游应用程序的原始步骤，如关系提取和知识库等。

1.2国内外相关研究现状

您需要先支付 80元 才能查看全部内容！立即支付

注册

找回密码