动态网页信息采集技术毕业论文

2022-05-26 21:42:04

论文总字数：23063字

摘要

传统的线下交易方式具有一定的局限性，将会导致知识产权交易信息的最新动态不能够即时、精准的得到反馈。可知，开发一个知识产权协同交易平台尤为的重要。而如何从网络众多的信息资源采集相关信息，并经过过滤筛选，分析存储到数据库中，成为了知识产权协同交易平台开发中重要的技术之一。

本文分析国内外动态网页信息采集的技术现状及发展趋势，讨论了目前网络爬虫技术存在的几个问题。根据本课题的来源及其背景，对南京市鼓楼区的知识产权交易协同运营中心的需求分析，选择佰腾网作为网页信息采集的测试对象，选取从佰腾网专利搜索的关键信息作为抓取的信息，以观察本文研究的技术方法的可行性和适用性。在对网络爬虫的工作流程、常用几种爬虫技术进行分析基础上，最终选择了Heritrix技术来实现对动态网页的信息进行采集、存储。通过分析采集模块开发过程中运用到的相关技术的优点，选择了ajax技术、jQuery、knockout类JS库、正则表达式、css设计来辅助模块的开发。结合面向对象的分析与设计思想，对采集模块进行规划，以MVC的设计模式进行分层开发。通过程序框图来对采集模块执行流程进行详细的说明。

在实现技术上，以SSM（Spring、SpringMVC、Mybatis）框架为基础，构造采集模块后端程序，及创建前端用户界面；以MySql作为数据库，来建立采集模块的后台数据源；前端进行一系列操作并通过框架中的Mybatis对数据库的安全连接、访问。

综合运用以上技术手段，本文实现了动态网页信息采集技术的基本功能，基本上满足了南京市鼓楼区的知识产权交易协同运营中心的需求。其研究成果对其它有信息自动融合需求的系统有一定的借鉴意义。

关键字：知识产权协同交易动态网页网络爬虫信息采集 SSM

Information Collection Technology of Dynamic Web Pages

ABSTRACT

The traditional offline transaction style has its own limitations, and will lead to the untimely and inaccurate feedback of the newest developments trends of intellectual property rights transaction information. It is important to develop a Cooperative Trading Platform of intellectual property. How to collect and store related information from numerous network information resources into the database has become one of the key technologies in the developing of collaborative transaction platform of intellectual property.

This paper analyzes the situation and development trend of the information collecting technology of dynamic web pages and discusses the existing problems of the web crawler. According to the source of the project and its background, demand analysis is made for intellectual property collaborative transaction operations center of Gulou District, Nanjing. Baiteng network is chosen as the testing object of web information collection, the key information collected from the patent search from Baiteng is the information captured so as to observe the feasibility and applicability of the method of this paper. On the basis of analyzing the working flow of the web crawler and some of the mostly used crawler techniques, the Heritrix technology is chosen to realize the information collection and storage of the dynamic web pages.. By taking advantages of the relative technology in the process of developing analysis and collection module, the Ajax technology, knockout, JS class jQuery library, regular expression, CSS design are selected to assist the development of the module.. According to the object oriented analysis and design idea, the collection module is planned and the MVC design pattern is developed. Block diagram is used to demonstrate in detain the module implementation flow.

In the technology realization, using SSM (spring, spring MVC, mybatis) framework as the basis, a back-end program and front-end user interface of collection module are constructed; MySQL is used as backstage data source to establish collection module; a series of operation is made at the front end, and secured connection and access to the database is made through the mybatis in the framework.

By integrated use of the above technologies, this paper implements the basic function of dynamic web information collection, basically meet the requirement intellectual property collaborative transaction operation center of Gulou District The research results can offer some reference to other systems with requirement of automatic information fusion.

Keywords: Intellectual property; Collaborative Transaction; Dynamic Web Page; Web Crawler; Information Collection; SSM

摘要 I

ABSTRACT II

第一章引言 1

1.1选题来源 1

1.2研究背景 1

1.3 研究思路及方法 1

1.3.1 本文拟采用的研究思路 1

第二章文献综述 2

2.1 动态网页信息采集的概述 2

2.1.1信息采集的基本原理 2

2.1.2信息采集的基本结构 2

2.1.3面临的主要困难 4

2.2采集技术的国内外研究现状 4

2.2.1针对整个Web的信息采集 5

2.2.2增量式Web信息采集 5

2.2.3主题的Web信息采集 5

2.2.4面向用户个性化的信息采集 5

2.3发展趋势 6

2.4研究评述 6

第三章动态网页信息采集实现技术 7

3.1 信息采集遵循的技术标准 7

3.1.1 HTTP协议 7

3.1.2 HTML标准 8

3.1.3 URL标准 8

3.2 网络爬虫相关技术 9

3.2.1网络爬虫 9

3.2.2 Heritrix介绍 10

3.2.3 HtmIUnit介绍 12

3.2.4对比与选择 12

3.3 系统开发相关技术介绍 12

3.3.1 AJAX技术 12

3.3.2 CSS 14

3.3.3 jQuery 16

3.3.4正则表达式 16

3.3.5 SSM框架 17

第四章采集模块的分析与实现 18

4.1信息采集模块分析 18

4.1.1 三层体系架构分析 18

4.1.2运行平台 18

4.2 采集模块总体结构设计 19

4.3 数据库设计 19

4.3.1数据库结构 19

4.3.2使用MySQL操作建表 20

4.4采集模块整体分析与实现 22

4.4.1 搜索页面采集功能实现 24

4.4.2详情页面采集功能实现 32

4.4.3采集模块的整体分析 33

结语 36

参考文献 37

附录 41

附录一：使用JDBC连接数据库 41

请支付后下载全文，论文总字数：23063字

您需要先支付 80元 才能查看全部内容！立即支付

注册

找回密码

动态网页信息采集技术毕业论文

ABSTRACT

您可能感兴趣的文章

最新文档

推荐栏目

登录

注册

找回密码

动态网页信息采集技术毕业论文

ABSTRACT

您可能感兴趣的文章

最新文档

推荐栏目