基于scrapy框架分布式算法的网络爬虫的研究与应用

2023-02-17 09:26:05

论文总字数：30270字

摘要

自从步入信息化时代，人类不再单纯的使用实体纸张等方式记录信息，而更多选择采用计算机存储器保存信息。在其后的飞速的发展中，人类的生活已经离不开社交网络、云计算等技术，这样就不可避免的使数据在医疗、商业、互联网等领域不断累加，数据的所产生的爆炸性增长将人们的推入了大数据时代。但是大数据只是一个术语，它描述了大量的数据（无论是结构化的还是非结构化）。不过这些数据算不上重要的数据量，我们需要通过数据分析组织出重要数据，然后通过大数据的分析结果获得更好的决策和战略业务动向。对于大数据的作用有近乎实时地确定故障以及问题所产生的原因，推荐给客户其所需要的商品广告，甚至能运用到游戏中。

不过对于数据的获取，仅仅依赖搜索引擎是存在局限性的，不同领域的不同数据往往不能通过搜索引擎满足要求，为了解决此类问题，产生了聚焦网络爬虫，因为只需要获取特定需求的内容，极大的节省了资源与时间的损耗，还能保持数据的更新。其次是从数据的获取、分析到展示这一完整过程没有过多的应用出现。

本文将从数据的获取和分析的基本理论出发，首先对网络爬虫与数据分析技术进行了深入浅出的阐述。然后循序渐进地将爬虫与数据挖掘的详细操作展现出来。如何利用网络爬虫进行数据抓取以及如何将获取的数据进行分析是本文的重点研究对象。先简要介绍了不同语言在数据抓取的和数据分析的区别，重点研究和详细论述了运用使用python这门程序语言使用分布式算法和聚类算法进行数据抓取与数据分析的原理与具体实现步骤。在数据抓取方面采用多线程加分布式的方式达到增强数据爬取的速度；在数据分析方面采用聚类等统计学的方式处理大量数据。最后通过web端进行数据可视化的展示，达到良好的交互性效果。

关键词：计算机；大数据；网络爬虫；数据分析；数据挖掘；数据可视化

Research and Application of Web Crawler Based on Scrapy Framework Distributed Algorithm

Abstract

Since entering the information age, humans no longer simply use physical papers to record information, but more often choose to use computer memory to store information. In the subsequent rapid development, human life has been inseparable from social networking, cloud computing and other technologies, which inevitably causes data to accumulate in the medical, commercial, Internet, and other areas. The explosive growth of data will be People pushed into the era of big data. But big data is just a term that describes a large amount of data (whether structured or unstructured). However, these data are not important data volumes. We need to organize important data through data analysis, and then obtain better decisions and strategic business trends through the analysis results of big data. The role of big data is to determine the cause of the failure and the problem in near real time. It is recommended to the customer for the desired product advertisement and can even be applied to the game.

However, for data acquisition, relying solely on search engines is limited. Different data in different fields often cannot meet the requirements through search engines. In order to solve such problems, focused web crawlers are generated because only the contents of specific needs are acquired. Great savings in resources and time, but also to maintain data updates. Second, there is no excessive application of the complete process of data acquisition, analysis, and display.

This article starts with the basic theory of data acquisition and analysis. First, it elaborates on the techniques of web crawler and data analysis. Then gradually show the detailed operation of the crawler and data mining. How to use the web crawler for data capture and how to analyze the acquired data is the focus of this article. First, it briefly introduces the differences between data capture and data analysis in different languages. It focuses on the study and discusses in detail the principles and concrete implementation steps of using data acquisition using distributed algorithms and clustering algorithms using the programming language. In data capture, multithreading and distributed methods are used to increase the speed of data crawling; in data analysis, clusters and other statistical methods are used to process large amounts of data. Finally, through the web-based data visualization display, to achieve a good interactive effect.

Keywords：Computer; Big Data; Web Crawler; Data Analysis; Data Mining; Data Visualization

目录

摘要 I

Abstract II

第一章绪论 1

1.1 本文研究的背景意义 1

1.2 国内外研究现状 1

1.3 本文研究的内容 1

第二章基础理论介绍 3

2.1 网页与网络连接基础 3

2.1.1 W3C标准 3

2.1.2 HTTP标准 4

2.2 网络爬虫 6

2.2.1 网络爬虫的原理 6

2.2.2 分布式爬虫简述 7

2.2.3 爬虫框架Scrapy 8

2.2.4 Scrapy架构分析 8

2.2.5 Scrapy开发流程 9

2.2.6 HTML解析方式 10

2.2.7 数据存储 11

2.2.8 网络爬虫策略 12

2.3 数据分析 13

2.3.1 数据分析理论 13

2.3.2 模型与统计学算法 15

2.3.3 Numpy库与Pandas库简介 15

2.4 web开发 16

2.4.1 Python Flask介绍 16

2.4.2 web开发流程 16

第三章基于Scrapy框架分布式算法的网络爬虫实现 17

3.1 web端协议分析 17

3.1.1 确立抓取目标页面 17

3.1.2 确定项目流程 18

3.2 编写Spider 21

3.2.1 利用测试代码模拟人工步骤 21

3.2.2 定义item 22

3.2.3 创建爬虫模块 22

3.2.4 创建数据解析模块 23

3.2.5 编写Item Pipeline模块存储数据 24

3.2.6 爬虫的分布式优化与去重优化 25

3.3 应对反爬虫机制 26

3.3.1 伪造User-Agent 26

3.3.2 设置网络访问延时 26

3.3.3 IP代理的构建 27

3.4 数据展示 27

第四章基于pandas模块实现数据分析 30

4.1 基于pandas模块实现数据分析 30

4.1.1 定义问题 30

4.1.2 数据收集与数据读取 30

4.1.3 数据分析 31

4.1.4 数据初步可视化 32

4.1.5 数据展示 33

第五章基于web端的数据可视化 35

5.1 利用flask 框架搭建web端 35

5.2 利用JavaScript Echart 插件建立图表 35

5.3 部署 36

第六章结果分析与结论 37

6.1 成果展示 37

6.2 分析结果 46

第七章总结与展望 47

7.1 总结 47

7.2 课题展望 47

致谢 48

参考文献(References) 49

第一章绪论

1.1 本文研究的背景意义

信息爆炸性增长，云计算的出现,使得大数据时代已经到来。无论是政府、学校、银行、医院还是互联网公司甚至是实体行业,都逐渐依赖大数据来预测与决断。

剩余内容已隐藏，请支付后下载全文，论文总字数：30270字

您需要先支付 80元 才能查看全部内容！立即支付

注册

找回密码

基于scrapy框架分布式算法的网络爬虫的研究与应用

Abstract

第一章绪论

您可能感兴趣的文章

最新文档

推荐栏目

登录

注册

找回密码

基于scrapy框架分布式算法的网络爬虫的研究与应用

Abstract

第一章 绪 论

您可能感兴趣的文章

最新文档

推荐栏目

第一章绪论