网页表格提取管理系统设计与实现

2023-03-09 08:54:38

论文总字数：19257字

摘要

随着互联网技术和应用的快速发展，社会生产和生活中的大量数据汇聚在网上，表格作为一种结构化数据大量存在于网页中，人们从网页中提取表格的结构模式和表格内容费时费力，针对该问题，本课题设计并实现网页表格识别和提取软件，实现对网页中表格的识别、提取，并保存于数据库中，方便后续对表格内容的查询处理。

网页表格提取管理系统主要功能是对网页表格进行判断和对其内容的存取。本系统主要用到的工具是Eclipse和SQL Server Management Studio。网页表格提取管理系统是建立在Windows7平台上运行，利用eclipse软件编写程序，系统的页面是利用Java.swing技术而设计出来的整体效果。对于网页内容则是利用Jsoup技术，安装Jsoup.jar可以解析网页源代码，然后分析网页源代码判断网页中是否含有表格依旧是否含有规则表格，最后系统将规则的表格内容结合SQL数据库语言写入到SQL Server数据库之中，形成相应的关系表，最终完成对数据的提取和存放。

关键词：网页表格，数据提取，Eclipse，SQL Sever数据库,，Jsoup

Abstract

With the rapid development of Internet technology and applications, social production and life in a large number of data gathered on the Internet, the form as a structured data exists in a large number of pages, people from the page to extract the form of the structure and form content time and effort, In view of this problem, this topic design and implementation of web page form recognition and extraction software, to achieve the form of the page recognition, extraction, and stored in the database to facilitate the follow-up on the contents of the table query processing.

Web page table extraction management system is the main function of the page form to determine and access to its content. The system main tools used are Eclipse and SQL Server Management Studio. Web page form extraction management system is built on the Windows7 platform to run, the use of eclipse software to write programs, the system page is the use of Java.swing technology designed out of the overall effect. For the contents of the page is the use of Jsoup technology, the installation of Jsoup.jar can analyze the source code, and then analyze the page source code to determine whether the page contains the table is still containing the rules table, the final system will be the contents of the table with the SQL database language written to SQL Server database, the formation of the corresponding relationship table, the final completion of the data extraction and storage.

Keywords: Web form, data extraction, Eclipse, SQL Server database, Jsoup

摘要 3

第一章引言 7

1.1课题背景 7

1.2 研究目的及意义 7

1.3系统概述 8

1.4系统开发运行环境 8

1.5论文结构介绍 8

第二章开发环境以及软件介绍 9

2.1开发环境 9

2.2 JAVA软件介绍 9

2.2.1 java语言介绍 9

2.2.2 Eclipse软件介绍 9

2.3 SQL Sever数据库 10

2.3.1 SQL结构化查询语言 10

2.3.2 JDBC概述 10

2.3.3数据库的连接设置 11

2.4 HTML语言介绍 13

2.5 Jsoup 1.10.2解析器 13

2.6本章小结 14

第三章系统分析 15

3.1系统需求分析 15

3.2功能分析 15

3.3可行性分析 15

3.3.1技术可行性 16

3.3.2经济可行性 16

3.3.3运行可行性 16

3.4本章小结 16

第四章系统详细设计 17

4.1系统总体功能结构 17

4.1.1系统的功能模块图 17

4.1.2功能模型介绍 17

4.2建立用例模型 17

4.3用例功能描述 18

4.4系统功能流程图 20

第五章系统实现 21

5.1界面设计 21

5.2连接数据库 22

5.3判断功能模块 23

5.4存储功能模块 24

第六章系统测试 26

6.1判断模块测试 26

6.2存取模块测试 27

6.3数据库测试 27

6.4本章小结 28

致谢 29

参考文献 30

第一章引言

1.1课题背景

随着二十一世纪互联网技术和应用的快速发展，社会企业生产和人们日常生活中的大量数据汇聚在网上，并且每年这些数据都在成倍的在增长。现如今，互联网对我们的日常工作生活影响甚大，我们利用网络的同时也不断更新数据，互联网聚集的数据量越来越大。

互联网上信息的迅速扩展，web已经成为全球最大、类型最齐全的海量信息库，与此同时 web页面信息提取的应用也逐渐繁荣起来。越来越多的人开始关注如何提取web页面信息，并且热门行业对网页提取技术要求也变得十分得高，人们要求更加有效、更加准确、更加便捷的提取技术。

剩余内容已隐藏，请支付后下载全文，论文总字数：19257字

您需要先支付 80元 才能查看全部内容！立即支付

注册

找回密码