改进的云端Hadoop任务分配方案外文翻译资料

2022-12-18 15:39:11

Dai and Bassiouni Journal of Cloud Computing: Advances, Systems and Applications 2013, 2:23 http://www.journalofcloudcomputing.com/content/2/1/23

R E S E A R C H Open Access

An improved task assignment scheme for Hadoop running in the clouds

Wei Dai^* and Mostafa Bassiouni

Abstract

Nowadays, data-intensive problems are so prevalent that numerous organizations in various industries have to face them in their business operation. It is often crucial for enterprises to have the capability of analyzing large volumes of data in an effective and timely manner. MapReduce and its open-source implementation Hadoop dramatically simplified the development of parallel data-intensive computing applications for ordinary users, and the combination of Hadoop and cloud computing made large-scale parallel data-intensive computing much more accessible to all potential users than ever before. Although Hadoop has become the most popular data management framework for parallel data-intensive computing in the clouds, the Hadoop scheduler is not a perfect match for the cloud environments. In this paper, we discuss the issues with the Hadoop task assignment scheme, and present an improved scheme for heterogeneous computing environments, such as the public clouds. The proposed scheme is based on an optimal minimum makespan algorithm. It projects and compares the completion times of all task slotsrsquo; next data block, and explicitly strives to shorten the completion time of the map phase of MapReduce jobs. We conducted extensive simulation to evaluate the performance of the proposed scheme compared with the Hadoop scheme in two types of heterogeneous computing environments that are typical on the public cloud platforms. The simulation results showed that the proposed scheme could remarkably reduce the map phase completion time, and it could reduce the amount of remote processing employed to a more significant extent which makes the data processing less vulnerable to both network congestion and disk contention.

Keywords: Cloud computing; Hadoop; MapReduce; Task assignment; Data-intensive computing; Parallel and distributed computing

Introduction

We have entered the era of Big Data. It was estimated that the total volume of digital data produced worldwide in 2011 was already around 1.8 zettabytes (one zettabyte equal to one billion terabytes) compared to 0.18 zettabytes in 2006 [1]. Data has been generating in an explosive way. Back in 2009, Facebook already hosted 2.5 petabytes of user data growing at about 15 terabytes per day. And the trading system in the New York Stock Exchange generates around one terabyte of data every day. For many organiza-tions, petabyte datasets have already become the norm, and the capability of data-intensive computing is a neces-sity instead of a luxury. Data-intensive computing lies in the core of a wide range of applications used across various industries, such as web indexing, data mining, scientific

* Correspondence: wdai@knights.ucf.edu

School of Electrical Engineering amp; Computer Science, University of Central Florida, 4000 Central Florida Blvd., Orlando, Florida 32816, USA

simulations, bioinformatics research, text/image processing, and business intelligence. In addition to large volume, Big Data also features high complexity, which makes the pro-cessing of data sets even more challenging. As a result, it is difficult to work with Big Data using most relational data-base management systems. And the solution is parallel and distributed processing on large number of machines.

MapReduce [2] is a parallel and distributed programming model and also an associated implementation for process-ing huge volumes of data on a large cluster of commodity machines. Since it was proposed by Google in 2004, MapRe-duce has become the most popular technology that makes data-intensive computing possible for ordinary users, espe-cially those that donrsquo;t have any prior experience with parallel and distributed data processing. While Google owns its proprietary implementation of MapReduce, an open source implementation named Hadoop [3] has gained great popu-larity in the rest of the world. Hadoop is now being used at

copy; 2013 Dai and Bassiouni; licensee Springer. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

Dai and Bassiouni Journal of Cloud Computing: Advances, Systems and Applications 2013, 2:23 Page 2 of 16

http://www.journalofcloudcomputing.com/content/2/1/23

many organizations in various industries, including Amazon, Adobe, Facebook, IBM, Powerset/Microsoft, Twitter, and Yahoo! [4]. Many well-known IT companies have been either offering commercial Hadoop-related products or providing support for Hadoop, including Cloudera, IBM, Yahoo!, Google, Oracle, and Dell [5].

The access to computer clusters of sufficient size is necessary for the parallel processing of large volumes of data. However, not every organization with data-intensive computing needs can afford or has the interest to pur-chase and maintain such computer clusters. The innova-tive concept of utility computing proposed a perfect solution to this problem, which eliminates both upfront hardware investment and periodical maintenance costs for cloud users. The combination of Hadoop and cloud computing has become an attractive and promising solution to parallel processing of terabytes and even peta-bytes datasets. A well-known feat of r

剩余内容已隐藏，支付完成后下载完整资料

戴和Bassiouni云计算期刊：进展，系统和应用2013,2：23 http://www.journalofcloudcomputing.com/content/2/1/23

R E S E A R C H Open Access

改进的云端Hadoop任务分配方案

Wei Dai^* 和 Mostafa Bassiouni

概述

如今，数据密集型问题非常普遍，各行业的众多组织不得不在业务运营中面对这些问题。企业通常有能力以有效和及时的方式分析大量数据。 MapReduce及其开源实现Hadoop大大简化了普通用户的并行数据密集型计算应用程序的开发，Hadoop和云计算的结合使得所有潜在用户比以往任何时候都更容易访问大规模并行数据密集型计算。虽然Hadoop已经成为云中并行数据密集型计算最流行的数据管理框架，但Hadoop调度程序并不是云环境的完美匹配。在本文中，我们讨论了Hadoop任务分配方案的问题，并提出了异构计算环境的改进方案，例如公共云。所提出的方案基于最佳最小完工时间算法。它预测并比较所有任务槽的下一个数据块的完成时间，并明确地努力缩短MapReduce作业的映射阶段的完成时间。在公共云平台上典型的两种异构计算环境中，我们进行了广泛的仿真，以评估所提出的方案与Hadoop方案相比的性能。仿真结果表明，该方案可以显着减少映射阶段完成时间，并且可以在更大程度上减少远程处理的使用量，使数据处理不易受网络拥塞和磁盘争用的影响。

关键词：云计算; Hadoop; MapReduce;任务分配;数据密集型计算;并行和分布式计算

介绍

我们已进入大数据时代。据估计，2011年全球生产的数字数据总量已经接近1.8 zettabytes（一个zettabyte等于10亿TB），而2006年为0.18 zettabytes [1]。数据以爆炸性的方式产生。早在2009年，Facebook已经托管了2.5PB的用户数据，每天大约15TB。纽约证券交易所的交易系统每天产生大约1TB的数据。对于许多组织来说，PB级数据集已经成为常态，数据密集型计算的能力是必要的而不是奢侈品。数据密集型计算是各种行业中使用的各种应用程序的核心，例如Web索引，数据挖掘，科学

* Correspondence: wdai@knights.ucf.edu

School of Electrical Engineering amp; Computer Science, University of Central Florida, 4000 Central Florida Blvd., Orlando, Florida 32816, USA

模拟，生物信息学研究，文本/图像处理和商业智能。除了大容量外，大数据还具有高复杂性，这使得数据集的处理更具挑战性。因此，使用大多数关系数据库管理系统很难使用大数据。解决方案是在大量机器上进行并行和分布式处理。

MapReduce [2]是一个并行和分布式编程模型，也是在大型商用机器上处理大量数据的相关实现。自从2004年谷歌提出以来，MapReduce已经成为最普遍的技术，使普通用户可以进行数据密集型计算，特别是那些没有任何平行和分布式数据处理经验的人。虽然谷歌拥有其专有的MapReduce实现，但一个名为Hadoop [3]的开源实现在世界其他地方获得了极大的普及。 Hadoop现在正在

戴和Bassiouni云计算期刊：进展，系统和应用2013,2：23

http://www.journalofcloudcomputing.com/content/2/1/23

各行各业的许多组织中使用，包括亚马逊，Adobe，Facebook，IBM，Powerset / Microsoft，Twitter和Yahoo! [4]. 许多知名IT公司要么提供商业Hadoop相关产品，要么为Hadoop提供支持，包括Cloudera，IBM，Yahoo！，Google，Oracle和Dell [5]。

对大量数据的并行处理来说，访问足够大小的计算机集群是必要的。但是，并非每个具有数据密集型计算需求的组织都能够负担或有兴趣购买和维护此类计算机集群。效用计算的创新概念提出了一个完美的解决方案，可以消除云用户的前期硬件投资和定期维护成本。 Hadoop和云计算的结合已成为并行处理TB级甚至peta-bytes数据集的有吸引力且前景广阔的解决方案。纽约时报在Amazon Elastic Compute Cloud（EC2）[6]上使用100个虚拟机（VM）将4TB的扫描档案从纸张转换为11个，这是众所周知的在云中运行Hadoop以实现数据密集型计算的壮举。在不到24小时内以PDF格式发表了数百篇文章[7]。

尽管Hadoop已成为处理云中大量数据的最流行的数据管理框架，但Hadoop调度程序存在严重降低云中运行Hadoop性能的问题。在本文中，我们讨论了Hadoop任务分配方案的问题，并提出了一种改进的方案，该方案基于最小完工时间调度的最优算法，并明确地努力缩短MapReduce作业的地图阶段的持续时间。我们进行了广泛的模拟，以评估所提议方案的性能。仿真结果表明，该方案可以在地图阶段的完成时间和所采用的远程处理量方面显着提高Hadoop的性能。

本文的其余部分安排如下。背景：MapReduce和Hadoop在MapReduce和Hadoop中引入了相关的背景。 Hadoop任务分配方案的问题讨论了在云环境的上下文中Hadoop任务分配方案的问题。相关数学模型介绍了我们的新方案所基于的相关数学模型。 ECT任务分配方案提供了新方案的详细信息。评估显示模拟设置和结果。相关工作在相关工作中介绍，我们在结论中总结。

背景：MapReduce和Hadoop

在MapReduce [2]的编程模型中，计算的输入是一组键/值对，并且

输出也是一组键/值对，通常位于与输入不同的域中。用户定义一个映射函数，它将一个输入键/值对转换为任意数量的中间键/值对，还有一个reduce函数，它将同一个中间键的所有中间值合并为一组较小的值，通常为一个值对于每个中间密钥。编程模型的应用示例是计算大量文档集中每个单词的出现次数。 map函数的输入lt;key / valuegt;对是lt;该文档的集合/内容中某个文档的名称gt;。映射函数为文档中的每个单词发出lt;word / 1gt;的中间键/值对。然后，reduce函数对为特定单词发出的所有计数求和，以获得该单词的出现总次数。

Hadoop [3]是目前MapReduce编程模型中最成熟，最易访和最流行的实现。 Hadoop集群采用主从架构，其中主节点称为作业跟踪器，多个从节点称为TaskTrackers。 Hadoop通常由Hadoop分布式文件系统（HDFS）支持，HDFS是Google文件系统（GFS）的开源实现。 HDFS还采用主从架构，其中NameNode（主）维护文件命名空间，并将客户端应用程序定向到实际存储数据块的DataNodes（从属）。 HDFS存储每个数据块的单独副本（默认情况下为三个副本），以实现容错和性能改进。在大型Hadoop集群中，每个从节点都充当TaskTracker和DataNode，并且通常有两个专用主节点分别用作JobTracker和Name-Node。在小型集群的情况下，可能只有一个专用主节点充当JobTracker和NameNode。

启动MapReduce作业时，Hadoop首先将输入文件拆分为固定大小的数据块（默认为64 MB），然后存储在HDFS中。 MapReduce作业分为一定数量的map和reduce任务，可以并行地在从属节点上运行。每个映射任务处理输入文件的一个数据块，并输出由用户定义的映射函数生成的中间键/值对。首先将映射任务的输出写入内存缓冲区，然后在缓冲区中的数据达到特定阈值时写入本地磁盘上的溢出文件。一个映射任务生成的所有溢出文件最终合并到映射任务的本地磁盘上的一个分区和排序的中间文件中。此中间文件中的每个分区将由一个不同的reduce任务处理，并在分区可用时由reduce任务复制。并行运行，减少任务，然后将用户定义的reduce函数应用于与每个中间键关联的中间键/值对，并生成MapReduce作业的最终输出。

戴和Bassiouni云计算期刊：进展，系统和应用2013,2：23

http://www.journalofcloudcomputing.com/content/2/1/23

在Hadoop集群中，JobTracker是作业提交节点，其中客户端应用程序提交要执行的MapReduce作业。 JobTracker组织MapReduce作业的整个执行过程，并协调所有地图的运行并减少任务。TaskTrakers是实际执行所有map和reduce任务的worker节点。每个TaskTraker都有一个可配置数量的任务分配任务槽（默认情况下，两个插槽用于映射任务，两个用于reduce任务），因此可以充分利用TaskTraker节点的资源。 JobTracker负责作业调度，即如何调度来自多个用户的并发作业，以及任务分配，即如何将任务分配给所有TaskTrackers。在本文中，我们只解决了地图任务分配的问题。 Hadoop的map任务分配方案采用心跳协议。每隔几分钟，每个TaskTraker都会向JobTracker发送一个心跳消息，告知后者它正常运行，以及它是否有一个空任务槽。如果TaskTracker有一个空槽，来自JobTracker的确认消息将包含有关新输入数据块分配的信息。

为了减少跨网络数据传输的开销，JobTracker在执行任务分配时尝试强制执行数据局部性。当TaskTraker可用于任务分配时，JobTracker将首先尝试查找位于TaskTracker的本地磁盘上的未处理数据块。如果找不到本地数据块，则JobTracker将尝试查找位于与TaskTracker位于同一机架上的某个节点上的数据块。如果它仍然找不到机架本地块，JobTracker将最终根据集群上的拓扑信息找到尽可能接近TaskTracker的未处理块。

虽然map任务只有一个阶段，但reduce任务包括三个：复制，排序和减少阶段。在复制阶段，reduce任务复制map任务生成的中间数据。每个reduce任务通常负责处理与许多中间键相关的中间数据。因此，在排序阶段，重复任务需要对由中间键复制的所有中间数据进行排序。在reduce阶段，reduce任务将用户定义的reduce函数应用于与每个中间键关联的中间数据，并将输出存储在最终输出文件中。输出文件保存在HDFS中，每个reduce任务只生成一个输出文件。

Hadoop容错机制

当Hadoop大规模运行时，故障几乎是不可避免的。因此，Hadoop被设计为容错框架，可以处理各种故障，同时对服务质量的影响最小。有

三种不同的故障模式，任务故障，TaskTracker故障和JobTracker故障。当TaskTracker检测到任务失败时，它会将任务尝试标记为失败，释放正在运行任务的任务槽，并在其心跳消息中通知JobTracker失败。然后JobTracker将尝试在不同的TaskTracker上重新安排该任务的执行。如果任何任务失败了可配置的次数（默认为四次），则整个作业将失败，这通常意味着用户代码有问题。当JobTracker在一段可配置的时间内（默认情况下为10分钟）未收到来自某个TaskTracker的任何心跳消息时，会发生TaskTracker故障。 TaskTracker失败是一种比任务失败更严重的失败模式，因为之前在失败的TaskTracker上运行和完成的所有映射任务的中间输出变得无法访问。在这种情况下，JobTracker将重新运行所有已完成的地图任务，并在其他TaskTrackers上重新安排任务。 JobTracker故障是最严重的故障模式，但由于特定机器故障的可能性很低，因此不太可能发生故障。在JobTracker发生故障的情况下，Hadoop提供了一个配置选项，可以尝试恢复发生故障时正在运行的所有作业。

有关MapReduce和Hadoop各方面的更详细讨论可以在[8]和[9]中找到。

Hadoop任务分配方案的问题

MapReduce和Hadoop最初都是为计算机集群而不是计算机云而设计的。集群大多是同构计算环境，其中同类节点在类似的负载条件下运行，并且相同类型的任务倾向于在大致接近的时间开

剩余内容已隐藏，支付完成后下载完整资料

资料编号：[20197]，资料为PDF文档或Word文档，PDF文档可免费转换为Word

您需要先支付 30元 才能查看全部内容！立即支付

注册

找回密码