基于hadoop平台的分布式网络爬虫研究与实现毕业论文

2021-08-24 22:58:44

摘要

21世纪以来，信息进入了爆炸性增长的时代，面对如此海量且以不同形式存在于不同位置的数据资源，如何获取并存储管理资源，如何从中快速找到个人所需的信息，如何保证信息的准确和有效性成为各界急需解决的问题。基于上述情况，本课题进行了基于Hadoop平台的分布式网络爬虫研究与实现，目的在于利用Hadoop这个开源平台实现分布式网络爬虫系统，替代集中式网络爬虫，以提高系统爬取和存储网络信息资源的能力。为实现本次课题，主要进行了如下工作：

（1）研究课题背景，由集中式网络爬虫的不足从而引申出对分布式网络爬虫的探索；研究Hadoop平台的MapReduce计算模型和HDFS文件存储，Hadoop平台是一个分布式平台，具有良好的扩展性、容错性、健壮性；其MapReduce计算模型以键值对的形式对数据进行处理，易于分布式过程的实现；而HDFS分布式文件存储提供数据的分布存储，通过两者的配合可以很好地支持分布式网络爬虫的实现。

（2）设计分布式网络爬虫系统，将系统分为五个主要模块进行分布式实现，各模块划分如下：分布式爬取模块、分布式URL提取模块、分布式过滤URL模块、网页去重模块以及分布式存储模块。其中分布式爬取模块用来并行抓取待抓取URL队列中各URL所对应的网页；分布式URL提取模块用来将爬取出的网页中所包含的所有链出链接提取出来；分布式过滤URL模块用来过滤掉上一模块中提取出的重复的URL；网页去重模块用来对比存储在系统上的网页信息是否重复，去掉重复的网页信息，避免存储资源的浪费；分布式存储模块就用来存储其他模块所产生的各种信息数据，而为了分别存储各模块的资源，在该模块部分设计了四个库：URL队列库、链出URL存储库、爬取网页存储库、最终资源库。

（3）搭建Hadoop平台，实现分布式网络爬虫系统；在计算机上安装好CentOS操作系统，然后配置其Java环境，使用的JDK版本为jdk-6u45-linux-x64，在基础环境配置好后，进行Hadoop平台的搭建，在搭建好的平台上进行爬虫各模块的分布式实现，最后将各模块集成为一个完整的系统。

关键词：Hadoop；网络爬虫；分布式；Map/Reduce

Abstract

Since the 21st century, information entered the era of explosive growth, Faced with such a huge and diverse data sources, how to obtain , store and manage resources, how to quickly find the information you need , how to ensure the accuracy and effectiveness of information become urgent problems for all sectors. Based on the above , this topic has studied and implemented the distributed web crawler based on Hadoop platform. It aims to use the Hadoop platform which is an open source platform to realize the distributed web crawler system and replace the centralized web crawler, in order to improve the crawling and storage capacity of the system. To achieve this topic, the main work is as follows：

(1) Studying the background of the topic. By the lack of the centralized web crawler extends to the exploration of the distributed web crawler; Studying the Map/Reduce calculation model and HDFS system of the Hadoop platform, this platform is a distributed platform which has good expansibility, fault tolerance and robustness; the Map/Reduce calculation model uses key-value model to deal with the data which is easy to the realization of the distributed process; HDFS distributed file system provide the distribution storage of data, the cooperation of the Map/Reduce calculation model and HDFS distributed file system is a good way to support the implementation of distributed web crawler.

(2) Designing the web crawler system. The system will be realized by five main modules, The modules are divided as follows: distributed crawl module, distributed URL extraction module, distributed URL filter module, similar web pages module and distributed storage module. The distributed crawl module is used to crawl pages according to the URL in the URL queue in the same time; The distributed URL extraction module is used to extract all of the chain links contained in a web page; The similar web pages module is used to compare the Web pages information on the HDFS system and remove duplicate web information in order to avoiding the waste of storage resources; The distributed storage module is used to store all kinds of information data generated by other modules, and in order to storage resources of each module the system designs four storehouses: URL queue repository, link repository, crawled web pages repository, the final resource repository.

(3) Building Hadoop platform and realizing the distributed web crawler system. First installing CentOS operating system on the computer, then configuring the Java environment, using the JDK version of the JDK - 6 u45 - Linux - x64, building Hadoop platform on the based environment ,then realizing each module of the crawler system, finally integrating all modules into a complete system.

Key Words：Hadoop; Web crawler; Distributed; Map/Reduce

目录

第1章绪论 1

1.1课题研究目的及意义 1

1.2国内外研究现状 2

1.2.1 分布式网络爬虫的国内外研究现状 2

1.2.2 Hadoop平台的国内外研究现状 3

1.3本文主要研究内容及目标 3

第2章课题相关技术研究 5

2.1 Hadoop平台有关知识 5

2.1.1 HDFS分布式文件系统 5

2.1.2 Map/Reduce计算模型 6

2.2网络爬虫的基本原理 7

2.2.1网络爬虫系统的工作原理 8

2.2.2网络爬虫系统的基本结构 8

第3章系统的需求分析 10

3.1系统整体需求分析 10

3.2系统功能模块需求分析 10

3.2.1分布式爬取模块需求分析 10

3.2.2分布式URL提取模块需求分析 11

3.2.3分布式过滤URL模块需求分析 12

3.2.4网页去重模块需求分析 12

3.2.5分布式存储模块需求分析 13

第4章系统的设计及实现 14

4.1系统总体设计 14

4.2各模块详细设计及实现 15

4.2.1分布式爬取模块 15

4.2.2分布式url提取模块 17

4.2.3分布式过滤URL模块 19

4.2.4网页去重模块 22

4.2.5分布式存储模块 25

第5章总结与展望 26

参考文献 27

致谢 28

绪论

本章详细阐述了研究该课题的背景、意义与目的，并初步设计了实现该系统所需完成的任务。

1.1课题研究目的及意义

21世纪以来，互联网进入了一个高速发展的时代，第37次《中国互联网络发展状况统计报告》显示^[1]：“截至2015年12月，中国网民规模达到6.88亿，互联网普及率达到50.3%，中国居民上网人数已过半。其中，2015年新增网民3951万人，增长率为6.1%，较2014年提升1.1个百分点，网民规模增速有所提升。”如此多的网民每天在网络上发布和获取数量庞大且种类庞杂的信息资源，导致网络的信息资源成指数模式向上增长。例如现在较为流行的社交网站：微博、SNS、哔哩哔哩等，每天都有大量的信息数据产生，且时时都在进行数据的更新；便利的购物网站：淘宝、京东、亚马逊、当当等，每天都有数以万计的浏览数据、订单数据。据统计，如今网页的数量已超过2000亿个，涵盖了各个领域、各种语言、各种形式的资源；面对如此海量且以不同形式存在于不同位置的数据资源，如何获取并存储管理资源，如何从中快速找到个人所需的信息，如何保证信息的准确和有效性成为各界急需解决的问题。

为了解决上述问题，人们也采取了许多方法。最初的方法便是增大单台计算机的存储能力、计算能力，为此人们不断提升计算机的软、硬件性能，但提升的速度远远赶不上信息爆炸式增长的速度，想要单单依靠单台计算机去解决这一问题已经成为了不可能的事情。于是Google公司率先提出了变革，其利用自己开发的海量数据计算模型Map/Reduced^[2]搭建出了Google计算集群，将许多单个电脑的计算资源联合形成逻辑上为一体的电脑集群，这样用许多低廉的计算机就可以组合成一个具有超强计算能力的集群了；并且Google开发出了分布式文件系统GFS^[3]用于存储海量数据；由于这两个技术的实现，Google解决了海量数据的处理问题，其搜索引擎性能得到了极大的改善，并由此提出了云计算平台的开发，在其发布的一篇名为《Web Search For A Planet-The Google Cluster Architecture》^[4]的文章中充分体现出了设计思想。但由于Google的技术并不对外公开且其系统的扩展性还不够好，很难供各界学习使用。与此同时，许多公司也相继实现了自己的分布式网络爬虫系统，都有其优缺点。

您需要先支付 80元 才能查看全部内容！立即支付

注册

找回密码

基于hadoop平台的分布式网络爬虫研究与实现毕业论文

绪论

1.1课题研究目的及意义

您可能感兴趣的文章

最新文档

推荐栏目

登录

注册

找回密码

基于hadoop平台的分布式网络爬虫研究与实现毕业论文

绪论

1.1课题研究目的及意义

您可能感兴趣的文章

最新文档

推荐栏目