基于互联网的主题内容提取与存储系统设计与实现

2023-02-22 10:00:35

论文总字数：17883字

摘要

互联网上存在大量的网页，每一个网页通常都会有某个主题，人们无法快速有效的获取自己所关注的某个主题所相关的那些网页。基于互联网的主题内容提取与存储系统可以帮助人们实现对自己所关注主题的那些网页快速提取，并且可以存储起来可以方便人们查看那些提取到的内容。为人们节省寻找目标网页的时间。

系统主要分为五大部分：1、网页下载部分2、页面解析部分3、URL抓取部分4、结果处理部分5、提取内容存储部分。第一部分负责网页的下载以便接下来的处理，第二部分负责解析页面内容，提取人们所需的信息，以及发现新的链接。第三部分负责管理待抓取的URL和一些去重的任务。第四部分负责对提取结果的处理加工，将其存储到数据库或者文件中。第五部分负责把提取的内容存储到数据库，以便后期查看。系统也开发了可视化界面，利于操作。

系统基于WIN7操作系统，使用SQL Server 2008数据库配合Eclipse开发。

关键词：页面解析；去重；持久化；数据库

Abstract

There are a large number of web pages on the Internet, each web page will usually have a theme, people can not quickly and effectively get their attention to a topic related to those pages. Internet theme content extraction and storage system based on can help people achieve on their focus on the theme of the page rapid extraction, and can be stored can be convenient for people view to extract the content. Save time for people looking for the target web page.

The system is mainly divided into five parts: 1, page download part 2, page resolution Part 3, URL grab Part 4, the results of the processing part 5, extract content storage section. The first part is responsible for downloading the web page so that the next process, the second part is responsible for the analysis of the content of the page, to extract the information needed by the people, as well as the discovery of new links. The third part is responsible for the management of the URL to grab and some of the heavy task. The fourth part is responsible for the processing and processing of the results of the extraction, which will be stored in the database or file. The fifth part is responsible for the extraction of the contents stored in the database, so that the latter view. The visual interface of the system is also developed, which is beneficial to the operation.

System based on the WIN7 operating system, using Server SQL 2008 database with Eclipse development.

Keywords: page parsing; deduplication; persistent; database

目录

摘要 I

Abstract II

第一章引言 1

1.1 课题背景及研究意义 1

1.2 国内外发展现状及趋势 1

1.3 课题的应用领域 1

1.4 论文组织结构 2

第二章开发工具及技术简介 3

2.1 开发工具简介 3

2.1.1 Microsoft SQL Server 2008简介 3

2.1.2 Eclipse简介 3

2.1.3 Swing编程简介 3

2.2 开发技术简介 3

2.2.1 Java编程语言技术简介 3

2.2.2 Java线程简介 4

2.2.3 SQL数据库系统技术简介 4

2.2.4 JDBC访问数据库简介 4

第三章系统需求分析 5

3.1 系统分析 5

3.1.1 系统设计可行性分析 5

3.1.2 系统设计需求分析 5

3.1.3 系统设计思想分析 6

3.2 系统数据库分析 6

3.2.1 数据库设计结构分析 6

第四章系统总体设计 8

4.1 系统总体架构设计 8

4.2 系统总体功能设计 8

4.3 系统总体流程设计 9

第五章系统详细设计和实现 11

5.1 网页下载部分设计 11

5.2 页面解析部分设计 11

5.2.1 页面元素的提取 11

5.2.2 后续链接的发现 12

5.3 URL抓取部分 12

5.3.1 URL抓取管理 12

5.3.2 已抓取URL去重 13

5.4 提取内容处理与存储部分设计 13

5.4.1 提取内容的处理 13

5.4.2 提取内容存储设计 14

5.4.3 数据库设计 14

5.5 界面设计部分 15

5.5.1 前台显示设计 15

5.5.2 后台启停设计 16

第六章系统测试 19

6.1 系统数据库连接测试 19

6.2 网页内容提取测试 20

6.3 提取内容存储测试 21

6.4 系统界面操作测试 21

第七章结论 24

致谢 25

参考文献 26

第一章引言

1.1 课题背景及研究意义

剩余内容已隐藏，请支付后下载全文，论文总字数：17883字

您需要先支付 80元 才能查看全部内容！立即支付

注册

找回密码