基于大数据的分布式互联网航运交通数据采集系统毕业论文

2021-04-05 12:04

摘要

本文针对航运大数据系统的应用需求，介绍了一种基于大数据的分布式互联网航运交通数据采集系统，阐述了针对航运交通数据采集的基本原理和方法。针对航运大数据的特点，设计并实现航运大数据存储系统，论证并实现了航运交通数据和海事信息分布式、多线程采集方案，并对采集到的数据进行了初步的清洗和融合。

本设计首先在Hadoop服务器集群上实现了分布式文件系统，在此基础上利用HBase实现了分布式数据库，从而实现大数据存储系统。基于Python 3.6和相关软件包，本文重点分析了宝船网、国家海事局、中国船级社和长江航务局等目标网站的体系结构和运行机制，针对性地设计并实现了多线程航运数据专用采集器和分布式海事信息专用采集器，并在实际使用中对上述系统的结构和功能的性能进行了测试。本文对上述网络数据采集器采集到的数据进行了初步的清洗和融合，讨论了若干关于本系统中数据处理的问题。

截至发文，本文所述的数据采集系统已经上线运行一个月，并作为子系统为指导老师的航运大数据风险可视化平台服务。

关键词：数据采集; 航运数据; 分布式系统

Abstract

Shipping big data system that refers to a system that analysis big data in shipping involves the acquisition, storage, processing and visualization of big data, in which data acquisition plays a fundamental and leading role. This paper introduces a distributed Internet shipping traffic data collection system based on big data for the application requirements of shipping big data system, and expounds the basic principles and methods of big data, data acquisition, data storage and data cleaning. And then this paper also introduces how to design and implement the big data storage system, demonstrates and implements the distributed and multi-threaded collection scheme of shipping data and maritime information, and carries out preliminary cleaning and fusion of the acquired data.

This design first implements a distributed file system using Hadoop on a server cluster. Based on this, HBase is used to implement a distributed database to realize a big data storage system. Based on Python 3.6 and related software packages, this paper focuses on the architecture and operation mechanism of target websites such as mysips.com, National Maritime Safety Administration, China Classification Society and Changjiang Maritime Safety Administration, and specifically designed and implemented multi-thread shipping data collection. And distributed maritime information collectors, and tested the performance of the structure and function of the above system in actual use. In this paper, the data acquired by the above network data collector is preliminarily cleaned and merged, and some problems concerning data processing in this system are discussed.

By writing this, the data acquisition system which is described in this paper has been online for one month, and serves as a subsystem for the shipping big data risk visualization platform of my instructor.

Keywords：data acquisition; shipping data; distributed system

绪论

研究背景及意义

大数据时代已经来临。吴军^[1]指出，过去五十多年影响人类社会的根本动力是摩尔定律，而将来几十年真正会改变人类社会的将是大数据。大数据已经在改变我们的生活方式。

研究大数据的首要问题是采集大数据。为了能够更好地帮助用户获取网络信息资源，或者为科学研究工作提供数据信息，研究人员需要构建一个网络信息采集系统，其中就使用了网络爬虫从互联网上抓取网络数据。

本文所述的数据采集系统通过分布式网络爬虫完成对航运交通数据的采集。早期互联网中网页数量并不大，研究人员一般选择将网络爬虫程序放在单台机器中运行，以采集所需要的数据^[2]。对于航运交通信息来说也是如此。但随着网络迅速的发展，目前互联网中网页的数量早已和过去不是一个量级了^[3]。面对如此庞大数量的网页，仅仅想依靠单机版的网络爬虫程序获取足够多的数据是不太现实的，即便有高性能、高带宽的服务器支撑，爬虫自身的采集速度也远远跟不上网页增长的速度^[4]，因此，本文将采用支持扩展到多台机器的分布式网络爬虫，设计和实现一个快速、高效、安全、稳定的网络信息获取系统。这也是很有必要的。

网络数据采集技术领域已经有很多研究正在进行中。未来人们将进一步提高提高算法的效率，而且还可以提高搜索引擎的准确性和及时性。可以进一步扩展不同爬行算法的工作，以提高网络采集的速度和准确性^[5-6]。

您需要先支付 80元 才能查看全部内容！立即支付

注册

找回密码

基于大数据的分布式互联网航运交通数据采集系统毕业论文

目录

绪论

研究背景及意义

您可能感兴趣的文章

最新文档

推荐栏目

登录

注册

找回密码

基于大数据的分布式互联网航运交通数据采集系统毕业论文

目 录

绪论

研究背景及意义

您可能感兴趣的文章

最新文档

推荐栏目

目录