网络爬虫技术提取网页信息应用与研究毕业论文

2021-03-13 22:49:19

摘要

在近几十年中，在全世界的用户及技术的人员的推动下，网络得到了高速的发展，万维网成了网络信息的载体。有许多应用需要将这些网络上的信息提取，如：搜索引擎，咨询采集，舆情监测等。从而，在巨大的Internet信息库中定位用户的信息将成为搜索技术未来研究的方向。本文主要研究一个网页信息的获取工具：网络爬虫。

网络爬虫的主要框架包含网页获取，网页保存，生成索引。网络爬虫工作原理：已一个给定的网络URL超链接种子开始，建立客户端和服务器之间的连接，在获取指定的网页后，不停的从保存的网页上提取出新的URL集合插入队列。本文设计的爬虫主要有网页爬取功能，URL链接管理功能，索引生成功能，网页解析功能并设计人机交互界面。网页获取功能使用C#语法的WebClient方法，下载网页。URL管理功能实现通过建立一个Todo表存储访问过的链接以便爬虫确定正在使用链接是否访问过；该功能还使用正则表达式提取URL链接。网页解析功能，通过对HTML标签进行解析的方法实现了对网页的正文，标题，关键字和URL链接的提取。索引生成功能通过调用Luence.Net(全文检索开发包)类库，调用该类库的Indexwrite方法生成索引。整个爬虫设计通过C#语言实现，在VS2010环境中编写，使用SQL Server2008数据库存储数据，并实现对数据库文件的增加，删除，查询操作。爬虫系统的测试，以给定的一个URL链接，然后测试爬虫的各个功能运行情况。

关键字：网络爬虫；URL；页面分析

Abstract

In recent decades, the network driven by users and technical staff in the world has been high-speed development, and the World Wide Web has become the carrier of network information. There are many applications need to extract the information on these networks, such as: search engines, consulting collection, public opinion monitoring. Thus, positioning the user's information in a large Internet repository will be the direction of future research for searching technology.The mainly study of this article is about a tool called web crawler to acquire the web page’s information.

The main framework of the web crawler contains web page acquisition, web page saving, and indexing. Web crawler works like that it begin to establish the connection between the server and the client by a given network URL hyperlink;after obtaining the specified page,the cralwer keeps on extracting a new URL links from saved web pages and inserting into queue.The crawler designed in this paper mainly has web crawling function, URL link management function, index generation function, web page analysis function and designed human-computer interaction interface. Web page acquisition function uses the WebClient method of the C # syntax to download the page. The URL management function is that by creating a Todo table to store the visited URL links so that the crawler determines whether the link is being accessed or not; the function also uses the regular expression to extract the URL links. Web page analysis function, through the HTML tag to resolve the method to achieve the text of the page, title, keyword and URL link extraction. The index generation function generates an index by using the Luence.Net (full-text search development package) class library, which calls the library's Indexwrite method. The whole crawler design uses the C # language in the VS2010 environment,the tool named SQL Server2008 database to store data, and achieve the adding, deleting, querying operation about database file. The crawler system tests is that by a links given by user, the cralwer can work normally and the user tests the various functions of the crawler.

Keywords: web crawler；URL；page analysis

摘要 I

Abstract II

目录 III

第1章.绪论 1

1.1课题的研究背景及意义 1

1.2 网络爬虫国内外发展现状 1

1.3论文的相关研究内容 2

1.3.1本文研究的内容 2

1.3.2本文的组织结构 2

第2章相关理论及关键技术 1

2.1网络爬虫工作原理 1

2.2 HTTP协议 2

2.3 正则表达式 3

2.4 C#编程语言 4

2.4.1 C#概述 4

2.4.2 C#语言的特点 4

2.4.3 C#语言理论知识 5

2.5本章小结 5

第3章网络爬虫系统分析与设计 7

3.1 网络爬虫系统的需求分析 7

3.2 网络爬虫功能设计 8

3.3 网络爬虫主要功能设计 9

3.3.1 网页爬取功能设计 9

3.3.2 URL管理 11

3.3.3 网页爬行策略设计 13

3.3.4页面解析功能设计 15

3.4 本章小结 18

第4章网页爬虫系统的实现 19

4.1 开发工具 19

4.2 网页爬虫各部分的实现 19

4.2.1 网页爬取功能实现 20

4.2.2 URL管理功能实现 20

4.2.3 爬虫爬行策略实现 20

4.2.4 网页解析实现 21

4.2.5 爬虫界面设计实现 22

4.2.6 生成索引设计 23

4.3线程管理实现 23

4.3.1 多线程优点 23

4.3.2 多线程缺点 24

4.4 本章小结 26

第5章网络爬虫系统测试 27

5.1 测试环境 27

5.2测试过程 27

5.2.1单爬虫测试 27

5.2.2 多线程工作测试 28

5.2.3页面设计测试 28

5.3测试结果 30

5.4本章小结 30

第6章总结与展望 32

6.1 全文总结 32

6.2 展望 32

参考文献 33

附录 35

致谢 42

第1章绪论

1.1课题的研究背景及意义

科技的提高使互联网更快速的进步，导致Internet网上信息量以指数级的速率增加，特别是在如今的大数据时代下，每天都会有源源不断的信息向整个互联网输送,比如：用户每天会各种社交媒体上发送各种消息。由于互联网的信息库过于庞大，并且网页每天自动地更新。用户在大量的信息中不能在短时间中定位并且正确的获取自己想要的信息。在这个形势下，人们迫切的想要一个工具帮助他们管理、组织这庞大、凌乱的信息，然后建立索引，能根据自己的需求快速定位，所以，在经过科研人员的大量研究后，搜索引擎出现在了我们的生活中。

现在每个人的生活都离不开了搜索引擎，并且有研究表明，搜索引擎已经极大的改变了人们的记忆方式：人们记住的常常是如何通过搜索引擎找到信息，而不是信息本身。研究也发现，人们会记住自己从何处获得的信息，而不是信息的内容本身。所以引擎已经在潜移默化影响了我们的生活，但是随着网络的不断发展，技术的创新，Web网页的不断更新，其内容，格式等等也变得多样化，使得了当今的搜索引擎技术越来越不满足用户的要求，展现出了大量缺陷，所以引起了相关的技术人们注意，必须要不断更新搜索引擎的效率。

您需要先支付 80元 才能查看全部内容！立即支付

注册

找回密码

网络爬虫技术提取网页信息应用与研究毕业论文

Abstract

第1章绪论

1.1课题的研究背景及意义

您可能感兴趣的文章

最新文档

推荐栏目

登录

注册

找回密码

网络爬虫技术提取网页信息应用与研究毕业论文

Abstract

第1章 绪论

1.1课题的研究背景及意义

您可能感兴趣的文章

最新文档

推荐栏目

第1章绪论