大数据： Hadoop reduce阶段

Mapreduce中由于sort的存在，MapTask和ReduceTask直接是工作流的架构。而不是数据流的架构。在MapTask尚未结束，其输出结果尚未排序及合并前，ReduceTask是又有数据输入的，因此即使ReduceTask已经创建也只能睡眠等待MapTask完成。从而可以从MapTask节点获取数据。一个MapTask最终的数据输出是一个合并的spill文件，可以通过Web地址访问。所以reduceTask一般在MapTask快要完成的时候才启动。启动早了浪费container资源。

ReduceTask是个线程，这个线程运行在YarnChild的Java虚拟机上，我们从ReduceTask.run开始看Reduce阶段。获取更多大数据视频资料请加QQ群：947967114

public void run(JobConf job, final TaskUmbilicalProtocol umbilical)

throws IOException, InterruptedException, ClassNotFoundException {

job.setBoolean(JobContext.SKIP_RECORDS, isSkipping());

if (isMapOrReduce()) {

/添加reduce过程需要经过的几个阶段。以便通知TaskTracker目前运行的情况/

copyPhase = getProgress().addPhase("copy");

sortPhase = getProgress().addPhase("sort");

reducePhase = getProgress().addPhase("reduce");

}

// start thread that will handle communication with parent

TaskReporter reporter = startReporter(umbilical);

// 设置并启动reporter进程以便和TaskTracker进行交流

boolean useNewApi = job.getUseNewReducer();

//在job client中初始化job时，默认就是用新的API，详见Job.setUseNewAPI()方法

initialize(job, getJobID(), reporter, useNewApi);

/用来初始化任务，主要是进行一些和任务输出相关的设置，比如创建commiter，设置工作目录等/

// check if it is a cleanupJobTask

/以下4个if语句均是根据任务类型的不同进行相应的操作，这些方法均是Task类的方法，所以与任务是MapTask还是ReduceTask无关/

if (jobCleanup) {

runJobCleanupTask(umbilical, reporter);

return;//只是为了JobCleanup，做完就停

}

if () {

runJobSetupTask(umbilical, reporter);

return;

//主要是创建工作目录的FileSystem对象

}

if (taskCleanup) {

runTaskCleanupTask(umbilical, reporter);

return;

//设置任务目前所处的阶段为结束阶段，并且删除工作目录

}

下面才是真正要成为reducer

// Initialize the codec

codec = initCodec();

RawKeyValueIterator rIter = null;

ShuffleConsumerPlugin shuffleConsumerPlugin = null;

Class combinerClass = conf.getCombinerClass();

CombineOutputCollector combineCollector =

(null != combinerClass) ?

new CombineOutputCollector(reduceCombineOutputCounter, reporter, conf) : null;

//如果需要就创建combineCollector

Classextends ShuffleConsumerPlugin> clazz =

job.getClass(MRConfig.SHUFFLE_CONSUMER_PLUGIN, Shuffle.class, ShuffleConsumerPlugin.class);

//配置文件找mapreduce.job.reduce.shuffle.consumer.plugin.class默认是shuffle.class

shuffleConsumerPlugin = ReflectionUtils.newInstance(clazz, job);

//创建shuffle类对象

LOG.info("Using ShuffleConsumerPlugin: " + shuffleConsumerPlugin);

ShuffleConsumerPlugin.Context shuffleContext =

new ShuffleConsumerPlugin.Context(getTaskID(), job, FileSystem.getLocal(job), umbilical,

super.lDirAlloc, reporter, codec,

combinerClass, combineCollector,

spilledRecordsCounter, reduceCombineInputCounter,

shuffledMapsCounter,

reduceShuffleBytes, failedShuffleCounter,

mergedMapOutputsCounter,

taskStatus, copyPhase, sortPhase, this,

mapOutputFile, localMapFiles);

//创建context对象，ShuffleConsumerPlugin.Context

shuffleConsumerPlugin.init(shuffleContext);

//这里调用的起始是shuffle的init函数，重点摘要如下。

this.localMapFiles = context.getLocalMapFiles();

scheduler = new ShuffleSchedulerImpl(jobConf, taskStatus, reduceId,

this, copyPhase, context.getShuffledMapsCounter(),

context.getReduceShuffleBytes(), context.getFailedShuffleCounter());

//创建shuffle所需的调度器

merger = createMergeManager(context);

//创建shuffle内部的merge，createMergeManager里面源码：

return new MergeManagerImpl(reduceId, jobConf, context.getLocalFS(),

context.getLocalDirAllocator(), reporter, context.getCodec(),

context.getCombinerClass(), context.getCombineCollector(),

context.getSpilledRecordsCounter(),

context.getReduceCombineInputCounter(),

context.getMergedMapOutputsCounter(), this, context.getMergePhase(),

context.getMapOutputFile());

//创建MergeMnagerImpl对象和Merge线程

rIter = shuffleConsumerPlugin.run();

//从各个Mapper复制其输出文件，并加以合并排序，等待直到完成为止

// free up the data structures

mapOutputFilesOnDisk.clear();

sortPhase.complete();

//排序阶段完成

setPhase(TaskStatus.Phase.REDUCE);

//进入reduce阶段

statusUpdate(umbilical);

Class keyClass = job.getMapOutputKeyClass();

Class valueClass = job.getMapOutputValueClass();

RawComparator comparator = job.getOutputValueGroupingComparator();

//3.Reduce 1.Reduce任务的最后一个阶段。它会准备好Map的 keyClass（"mapred.output.key.class""mapred.mapoutput.key.class"）,valueClass("mapred.mapoutput.value.class"或"mapred.output.value.class")和 Comparator （“mapred.output.value.groupfn.class”或“mapred.output.key.comparator.class”）

if (useNewApi) {

//2.根据参数useNewAPI判断执行runNewReduce还是runOldReduce。分析润runNewReduce

runNewReducer(job, umbilical, reporter, rIter, comparator,

keyClass, valueClass);

//0.像报告进程书写一些信息，1.获得一个TaskAttemptContext对象。通过这个对象创建reduce、output及用于跟踪的统计output的RecordWrit、最后创建用于收集reduce结果的Context，2.reducer.run(reducerContext)开始执行reduce

} else {//老API

runOldReducer(job, umbilical, reporter, rIter, comparator,

keyClass, valueClass);

}

shuffleConsumerPlugin.close();

done(umbilical, reporter);

}

(1)reduce分为三个阶段(copy就是远程拷贝Map的输出数据、sort就是对所有的数据做排序、reduce做聚集就是我们自己写的reducer)，为这三个阶段分别设置Progress，用来和TaskTracker通信报道状态。

(2)上面代码的15-40行和MapReduce的MapTask任务的运行源码级分析中对应部分基本相同，可参考之；

(3)codec = initCodec()这句是检查map的输出是否是压缩的，压缩的则返回压缩codec实例，否则返回null，这里讨论不压缩的；

(4)我们讨论完全分布式的hadoop，即isLocal==false，然后构造一个ReduceCopier对象reduceCopier，并调用reduceCopier.fetchOutputs()方法拷贝各个Mapper的输出，到本地；

(5)然后copy阶段完成，设置接下来的阶段是sort阶段，更新状态信息；

(6)根据isLocal来选择KV迭代器，完全分布式的会使用reduceCopier.createKVIterator(job, rfs, reporter)作为KV迭代器；

(7)sort阶段完成，设置接下来的阶段是reduce阶段，更新状态信息；

(8)然后获取一些配置信息，并根据是否使用新API选择不同的处理方式，这里是新的API，调用runNewReducer(job, umbilical, reporter, rIter, comparator, keyClass, valueClass)会执行reducer；

(9)done(umbilical, reporter)这个方法用于做结束任务的一些清理工作：更新计数器updateCounters()；如果任务需要提交，设置Taks状态为COMMIT_PENDING，并利用TaskUmbilicalProtocol，汇报Task完成，等待提交，然后调用commit提交任务；设置任务结束标志位；结束Reporter通信线程；发送最后一次统计报告(通过sendLastUpdate方法)；利用TaskUmbilicalProtocol报告结束状态（通过sendDone方法)。

有些人将Reduce Task分为了5个阶段：一、shuffle阶段：也称为Copy阶段，就是从各个MapTask上远程拷贝一片数据，如果大小超过一定阈值就写到磁盘，否则放入内存；二、Merge阶段：在远程拷贝数据的同时，Reduce Task启动了两个后台线程对内存和磁盘上的文件进行合并，防止内存使用过多和磁盘文件过多；三、sort阶段：用户编写的reduce方法的输入数据是按key进行聚集的，需要对copy过来的数据排序，这里用的是归并排序，因为Map Task的结果是有序的；四、Reduce阶段：将每组数据依次交给用户编写的Reduce方法处理；五、write阶段：就是将结果写入HDFS。

上面的5个阶段分的比较细了，代码里分为3个阶段copy、sort、reduce，我们在eclipse运行MR程序时，控制台看到的reduce阶段的百分比就分为3个阶段各占33.3%。

这里的shuffleConsumerPlugin是实现了ShuffleConsumerPlugin的某个类对象。具体可以通过配置文件mapreduce.job.reduce.shuffle.consumer.plugin.class选项设置，默认情况下是使用shuffle。我们在代码中分析过完成shuffleConsumerPlugin.run，通常是shuffle.run，因为有了这个过程Mapper的合成的spill文件才能通过HTTP协议传输到Reducer端。有了数据才能进行runNewReducer或者runOldReducer。可以说shuffle对象就是MapTask的搬运工。而且shuffle的搬运方式不是一遍搬运一遍Reducer处理，而是要把MapTask所有的数据都搬运过来，并且进行合并排序之后才开始提供给对应的Reducer。

一般而言，MapTask和ReduceTask是多对多的关系，假如有M个Mapper有N个Reducer。我们知道N个Reducer对应着N个partition，所以每个Mapper都会被划分成N个Partition，每个Reducer承担着一个Partition部分的操作。这样每一个Reducer从每个不同的Mapper内拿来属于自己的那部分数据，这样每个Reducer就有M份不同Mapper的数据，把M份数据合并在一起就是一个最终完整的Partition，有必要还会进行排序，这时候才成为了Reducer的具体输入数据。这个数据搬运和重组的过程被叫做shuffle过程。shuffle这个过程开销颇大，会占用较大的网络流量，因为涉及到大量数据的传输，shuffle过程也会有延迟，因为M个Mapper的计算有快有慢，但是shuffle要所有的Mapper完成才能开始，Reduce又必须等shuffle完成才能开始，当然这种延迟不是shuffle造成的，如果Reducer不需要全部Partition数据到位并排序，就不用与最慢的Mapper同步，这是排序付出的代价。

所以shuffle在MapReduce框架中起着非常重要的作用。我们先看shuffle的摘要：获取更多大数据视频资料请加QQ群：947967114

public class Shuffle implements ShuffleConsumerPlugin, ExceptionReporter

private ShuffleConsumerPlugin.Context context;

private TaskAttemptID reduceId;

private JobConf jobConf;

private TaskUmbilicalProtocol umbilical;

private ShuffleSchedulerImpl scheduler;

private MergeManager merger;

private Task reduceTask; //Used for status updates

private Map localMapFiles;

public void init(ShuffleConsumerPlugin.Context context)

public RawKeyValueIterator run() throws IOException, InterruptedException

在ReduceTask.run中看到调用了shuffle.init，在run理创建了ShuffleSchedulerImpl和MergeManagerImpl对象。后面会讲解就是是做什么用的。

之后就是对shuffle.run的调用，shuffle虽然有一个run但是并非是一个线程，只是用了这个名字而已。

我们看：ReduceTask.run->Shuffle.run

public RawKeyValueIterator run() throws IOException, InterruptedException {

int eventsPerReducer = Math.max(MIN_EVENTS_TO_FETCH,

MAX_RPC_OUTSTANDING_EVENTS / jobConf.getNumReduceTasks());

int maxEventsToFetch = Math.min(MAX_EVENTS_TO_FETCH, eventsPerReducer);

// Start the map-completion events fetcher thread

final EventFetcher eventFetcher =

new EventFetcher(reduceId, umbilical, scheduler, this,

maxEventsToFetch);

eventFetcher.start();

//通过查看EventFetcher我们看到他继承了Thread，所以他是一个线程

// Start the map-output fetcher threads

boolean isLocal = localMapFiles != null;

final int numFetchers = isLocal ? 1 :

jobConf.getInt(MRJobConfig.SHUFFLE_PARALLEL_COPIES, 5);

Fetcher[] fetchers = new Fetcher[numFetchers];

//创建了一个线程池

if (isLocal) {

//如果Mapper和Reducer在同一台机器上，就在本地fetche

fetchers[0] = new LocalFetcher(jobConf, reduceId, scheduler,

merger, reporter, metrics, this, reduceTask.getShuffleSecret(),

localMapFiles);

//LocalFetcher是对Fetcher的扩展，也是线程。

fetchers[0].start();//本地Fecher只有一个

} else {

//Mapper集合Reducer不在同一个机器上，需要跨多个节点Fecher

for (int i=0; i < numFetchers; ++i) {

//启动所有的Fecher

fetchers[i] = new Fetcher(jobConf, reduceId, scheduler, merger,

reporter, metrics, this,

reduceTask.getShuffleSecret());

//创建Fecher线程

fetchers[i].start();

//跨节点的Fecher需要好多个，都需要开启

}

// Wait for shuffle to complete successfully

while (!scheduler.waitUntilDone(PROGRESS_FREQUENCY)) {

reporter.progress();

//等待所有的Fecher都完成，如果有超时情况就报告进度

synchronized (this) {

if (throwable != null) {

throw new ShuffleError("error in shuffle in " + throwingThreadName,

throwable);

}

// Stop the event-fetcher thread

eventFetcher.shutDown();

//关闭eventFetcher，代表shuffle操作完成，所有的MapTask的数据都拷贝过来了

// Stop the map-output fetcher threads

for (Fetcher fetcher : fetchers) {

fetcher.shutDown();//关闭所有的fetcher。

}

// stop the scheduler

scheduler.close();

//也不需要shuffle的调度，所以关闭

copyPhase.complete(); // copy is already complete

//文件复制阶段结束

以下就是Reduce阶段的MergeSort了

taskStatus.setPhase(TaskStatus.Phase.SORT);

//完成排序

reduceTask.statusUpdate(umbilical);

//通过umbilical向MRAppMaster汇报，更新状态

// Finish the on-going merges…

RawKeyValueIterator kvIter = null;

try {

kvIter = merger.close();

//合并和排序，完成后返回一个队列kvIter 。

} catch (Throwable e) {

throw new ShuffleError("Error while doing final merge " , e);

}

// Sanity check

synchronized (this) {

if (throwable != null) {

throw new ShuffleError("error in shuffle in " + throwingThreadName,

throwable);

}

return kvIter;

}

数据从MapTask转移到ReduceTask就两种方式，一MapTask送，二ReduceTask取，hadoop采用的是第二种方式，就是文件的复制。在Shuffle进入run之前，RduceTask.run调用过他的init函数shuffleConsumerPlugin.init(shuffleContext)，在init里创建了scheduler和用于合并排序的merge，进入run后又创建了EventFetcher线程和若干个Fetcher线程。Fetcher的作用就是拿取，向MapTask节点提取数据。但是我们要清楚EventFetcher虽然也是Fetcher，但是提取的是event，不是数据本身。我们可以认为它只是对Fetcher过程的一个事件的控制。

Fetcher线程的数量也不一定，Uber模式下，MapTask和ReduceTask在同一个节点上，并且只有一个MapTask，所以只有一个Fetcher就能够完成，而且这个Fetcher是localFetcher。如果不是Uber模式可能会有很多MapTask并且一般和ReduceTask不在同一个节点。这时Fetcher的数量可以进行配置，默认有5个。数组fetchers就相当于Fetcher的线程池。

创建了EventFetcher和Fetcher线程池后，进入了while循环，但是while循环什么都不做，一直等待，所以实际的操作都是在线程完成的，也就是通过EventFetcher和若干的Fetcher完成。EventFetcher起到了非常关键的枢纽的作用。

我们查看EventFetcher的源代码摘要，我们提取关键的东西：

class EventFetcher extends Thread {

private final TaskAttemptID reduce;

private final TaskUmbilicalProtocol umbilical;

private final ShuffleScheduler scheduler;

private final int maxEventsToFetch;

public void run() {

int failures = 0;

LOG.info(reduce + " Thread started: " + getName());

try {

while (!stopped && !Thread.currentThread().isInterrupted()) {//线程没有被打断

try {

int numNewMaps = getMapCompletionEvents();

//获取Map的完成的事件，接着我们看getMapCompletionEvents源代码：

protected int getMapCompletionEvents()

throws IOException, InterruptedException {

int numNewMaps = 0;

TaskCompletionEvent events[] = null;

do {

MapTaskCompletionEventsUpdate update =

umbilical.getMapCompletionEvents(

(org.apache.hadoop.mapred.JobID)reduce.getJobID(),

fromEventIdx,

maxEventsToFetch,

(org.apache.hadoop.mapred.TaskAttemptID)reduce);

//汇报umbilical从MRAppMaster获取Map完成的时间的报告

events = update.getMapTaskCompletionEvents();

//获取有关具体的MapTask结束运行的情况

LOG.debug("Got " + events.length + " map completion events from " +

fromEventIdx);

assert !update.shouldReset() : "Unexpected legacy state";

//做了一个断言获取更多大数据视频资料请加QQ群：947967114

// Update the last seen event ID

fromEventIdx += events.length;

// Process the TaskCompletionEvents:

// 1. Save the SUCCEEDED maps in knownOutputs to fetch the outputs.

// 2. Save the OBSOLETE/FAILED/KILLED maps in obsoleteOutputs to stop

// fetching from those maps.

// 3. Remove TIPFAILED maps from neededOutputs since we don’t need their

// outputs at all.

for (TaskCompletionEvent event : events) {

//对于获取的每个事件的报告

scheduler.resolve(event);

//这里使用了ShuffleSchedullerImpl.resolve函数，源代码如下：

public void resolve(TaskCompletionEvent event) {

switch (event.getTaskStatus()) {

case SUCCEEDED://如果成功

URI u = getBaseURI(reduceId, event.getTaskTrackerHttp());//获取其URI

addKnownMapOutput(u.getHost() + ":" + u.getPort(),

u.toString(),

event.getTaskAttemptId());

//记录这个MapTask的节点主机记录下来，供Fetcher使用，getBaseURI的源代码：

static URI getBaseURI(TaskAttemptID reduceId, String url) {

StringBuffer baseUrl = new StringBuffer(url);

if (!url.endsWith("/")) {

baseUrl.append("/");

}

baseUrl.append("mapOutput?job=");

baseUrl.append(reduceId.getJobID());

baseUrl.append("&reduce=");

baseUrl.append(reduceId.getTaskID().getId());

baseUrl.append("&map=");

URI u = URI.create(baseUrl.toString());

return u;

获取各种信息，然后添加都URI对象中。

}

回到源代码

maxMapRuntime = Math.max(maxMapRuntime, event.getTaskRunTime());

//最大的尝试时间

break;

case FAILED:

case KILLED:

case OBSOLETE://如果MapTask运行失败

obsoleteMapOutput(event.getTaskAttemptId());//获取TaskId

LOG.info("Ignoring obsolete output of " + event.getTaskStatus() +

" map-task: ‘" + event.getTaskAttemptId() + "’");//写日志

break;

case TIPFAILED://如果失败

tipFailed(event.getTaskAttemptId().getTaskID());

LOG.info("Ignoring output of failed map TIP: ‘" +

event.getTaskAttemptId() + "’");//写日志

break;

}

回到源代码

if (TaskCompletionEvent.Status.SUCCEEDED == event.getTaskStatus()) {//如果事件成功

++numNewMaps;//增加map数量

}

} while (events.length == maxEventsToFetch);

return numNewMaps;

}

回到源代码

failures = 0;

if (numNewMaps > 0) {

LOG.info(reduce + ": " + "Got " + numNewMaps + " new map-outputs");

}

LOG.debug("GetMapEventsThread about to sleep for " + SLEEP_TIME);

if (!Thread.currentThread().isInterrupted()) {

Thread.sleep(SLEEP_TIME);

}

} catch (InterruptedException e) {

LOG.info("EventFetcher is interrupted.. Returning");

return;

} catch (IOException ie) {

LOG.info("Exception in getting events", ie);

// check to see whether to abort

if (++failures >= MAX_RETRIES) {

throw new IOException("too many failures downloading events", ie);//失败数量大于重试的数量

}

// sleep for a bit

if (!Thread.currentThread().isInterrupted()) {

Thread.sleep(RETRY_PERIOD);

}

} catch (InterruptedException e) {

return;

} catch (Throwable t) {

exceptionReporter.reportException(t);

return;

}

MapTask和ReduceTask没有直接的关系，MapTask不知道ReduceTask在哪些节点上，它只是把进度的时间报告给MRAppMaster。ReduceTask通过“脐带”执行getMapCompletionEvents操作想MRAppMaster获取MapTask结束运行的时间报告。有个别的MapTask可能会失败，但是绝大多数都会成功，只要成功的就通过Fetcher去索取输出数据，这个信息就是通过shcheduler完成的也就是ShuffleSchedulerImpl对象，ShuffleSchedulerImpl对象并不多，只是个普通的对象。

fetchers就像线程池，里面有若干线程（默认有5个），这些线程等待EventFetcher的通知，一旦有MapTask完成就前往提取数据。

获取更多大数据视频资料请加QQ群：947967114
大数据： Hadoop reduce阶段

我们看Fetcher线程类的run方法：

public void run() {

try {

while (!stopped && !Thread.currentThread().isInterrupted()) {

MapHost host = null;

try {

// If merge is on, block

merger.waitForResource();

// Get a host to shuffle from

host = scheduler.getHost();

//从scheduler获取一个已经成功完成的MapTask的节点。

metrics.threadBusy();

//线程变成繁忙状态

// Shuffle

copyFromHost(host);

//开始复制这个节点的数据

} finally {

if (host != null) {//maphost还有运行中的

scheduler.freeHost(host);

//状态设置成空闲状态，等待其完成。

metrics.threadFree();

}

} catch (InterruptedException ie) {

return;

} catch (Throwable t) {

exceptionReporter.reportException(t);

}

这里的重点是copyFromHost获取数据的函数。

protected void copyFromHost(MapHost host) throws IOException {

// reset retryStartTime for a new host

//这是在ReduceTask的节点上运行的

retryStartTime = 0;

// Get completed maps on ‘host’

List<TaskAttemptID> maps = scheduler.getMapsForHost(host);

//获取目标节点上的MapTask集合。

// Sanity check to catch hosts with only ‘OBSOLETE’ maps,

// especially at the tail of large jobs

if (maps.size() == 0) {

return;//没有完成的直接返回

}

if(LOG.isDebugEnabled()) {

LOG.debug("Fetcher " + id + " going to fetch from " + host + " for: "

maps);

}

// List of maps to be fetched yet

Set remaining = new HashSet(maps);

//已经完成、等待shuffle的MapTask集合。

// Construct the url and connect

DataInputStream input = null;

URL url = getMapOutputURL(host, maps);

//生成MapTask所在节点的URL，下面要看getMapOutputURL源码：

private URL getMapOutputURL(MapHost host, Collection maps

) throws MalformedURLException {

// Get the base url

StringBuffer url = new StringBuffer(host.getBaseUrl());

boolean first = true;

for (TaskAttemptID mapId : maps) {

if (!first) {

url.append(",");

}

url.append(mapId);//在URL后面加上mapid

first = false;

}

LOG.debug("MapOutput URL for " + host + " -> " + url.toString());

//写日志

return new URL(url.toString());

//返回URL

}

回到主代码：

try {

setupConnectionsWithRetry(host, remaining, url);

//和对方主机建立HTTP连接，setupConnectionsWithRetry使用了openConnectionWithRetry函数打开链接。

openConnectionWithRetry(host, remaining, url);

这段源代码有使用了openConnection(url);方式，继续查看。

如下是链接的主要过程：

protected synchronized void openConnection(URL url)

throws IOException {

HttpURLConnection conn = (HttpURLConnection) url.openConnection();

//使用的是HTTPURL进行连接

if (sslShuffle) {//如果是有信任证书的

HttpsURLConnection httpsConn = (HttpsURLConnection) conn;

//强转conn类型

try {

httpsConn.setSSLSocketFactory(sslFactory.createSSLSocketFactory());//添加一个证书socket的工厂

} catch (GeneralSecurityException ex) {

throw new IOException(ex);

}

httpsConn.setHostnameVerifier(sslFactory.getHostnameVerifier());

}

connection = conn;

}

在setupConnectionsWithRetry中继续写到：

setupShuffleConnection(encHash);

//建立了Shuffle链接

connect(connection, connectionTimeout);

// verify that the thread wasn’t stopped during calls to connect

if (stopped) {

return;

}

verifyConnection(url, msgToEncode, encHash);

}

//至此连接通过。

if (stopped) {

abortConnect(host, remaining);

//这里边是关闭连接，可以点进去看一下，满足列表和等待的两个条件

return;

}

} catch (IOException ie) {

boolean connectExcpt = ie instanceof ConnectException;

ioErrs.increment(1);

LOG.warn("Failed to connect to " + host + " with " + remaining.size() +

" map outputs", ie);

回到主代码

input = new DataInputStream(connection.getInputStream());

//实例一个输入流对象。

try {

// Loop through available map-outputs and fetch them

// On any error, faildTasks is not null and we exit

// after putting back the remaining maps to the

// yet_to_be_fetched list and marking the failed tasks.

TaskAttemptID[] failedTasks = null;

while (!remaining.isEmpty() && failedTasks == null) {

//如果需要fetcher的列表不空，并且失败的task数量没有

try {

failedTasks = copyMapOutput(host, input, remaining, fetchRetryEnabled);

//复制数据出来copyMapOutput的源代码如下：

try {

ShuffleHeader header = new ShuffleHeader();

header.readFields(input);

mapId = TaskAttemptID.forName(header.mapId);

//获取mapID

compressedLength = header.compressedLength;

decompressedLength = header.uncompressedLength;

forReduce = header.forReduce;

} catch (IllegalArgumentException e) {

badIdErrs.increment(1);

LOG.warn("Invalid map id ", e);

//Don’t know which one was bad, so consider all of them as bad

return remaining.toArray(new TaskAttemptID[remaining.size()]);

}

InputStream is = input;

is = CryptoUtils.wrapIfNecessary(jobConf, is, compressedLength);

compressedLength -= CryptoUtils.cryptoPadding(jobConf);

decompressedLength -= CryptoUtils.cryptoPadding(jobConf);

//如果需要解压或解密

// Do some basic sanity verification

if (!verifySanity(compressedLength, decompressedLength, forReduce,

remaining, mapId)) {

return new TaskAttemptID[] {mapId};

}

if(LOG.isDebugEnabled()) {

LOG.debug("header: " + mapId + ", len: " + compressedLength +

", decomp len: " + decompressedLength);

}

try {

mapOutput = merger.reserve(mapId, decompressedLength, id);

//为merge预留一个MapOutput：是内存还是磁盘上。

} catch (IOException ioe) {

// kill this reduce attempt

ioErrs.increment(1);

scheduler.reportLocalError(ioe);

//报告错误

return EMPTY_ATTEMPT_ID_ARRAY;

}

// Check if we can shuffle now …

if (mapOutput == null) {

LOG.info("fetcher#" + id + " – MergeManager returned status WAIT …");

//Not an error but wait to process data.

return EMPTY_ATTEMPT_ID_ARRAY;

}

// The codec for lz0,lz4,snappy,bz2,etc. throw java.lang.InternalError

// on decompression failures. Catching and re-throwing as IOException

// to allow fetch failure logic to be processed

try {

// Go!

LOG.info("fetcher#" + id + " about to shuffle output of map "

mapOutput.getMapId() + " decomp: " + decompressedLength
" len: " + compressedLength + " to " + mapOutput.getDescription());

mapOutput.shuffle(host, is, compressedLength, decompressedLength,

metrics, reporter);

//跨节点把Mapper的文件内容拷贝到reduce的内存或者磁盘上。

} catch (java.lang.InternalError e) {

LOG.warn("Failed to shuffle for fetcher#"+id, e);

throw new IOException(e);

}

// Inform the shuffle scheduler

long endTime = Time.monotonicNow();

// Reset retryStartTime as map task make progress if retried before.

retryStartTime = 0;

scheduler.copySucceeded(mapId, host, compressedLength,

startTime, endTime, mapOutput);

//告诉调度器完成了一个节点的Map输出的文件拷贝。

remaining.remove(mapId);

//这个MapTask的输出已经shuffle完毕

metrics.successFetch();

return null;后面的异常失败信息我们不管。

这里的mapOutput是用来容纳MapTask输出文件的存储空间，根据输出文件的内容大小和内存的情况，可以是内存的Output也可以是DiskOutput。如果是内存需要预约，因为不止一个Fetcher。我们以InMemoryMapOutput为例。

代码结构;

Fetcher.run–>copyFromHost–>copyMapOutput–>merger.reserve(MergeManagerImpl.reserve)–>InmemoryMapOutput.shuffle

public void shuffle(MapHost host, InputStream input,

long compressedLength, long decompressedLength,

ShuffleClientMetrics metrics,

Reporter reporter) throws IOException {

//跨节点从Mapper拷贝spill文件

IFileInputStream checksumIn =

new IFileInputStream(input, compressedLength, conf);

//校验和的输入流

input = checksumIn;

// Are map-outputs compressed?

if (codec != null) {

//如果涉及到了压缩

decompressor.reset();

//重启解压器

input = codec.createInputStream(input, decompressor);

//加了解压器的输入流

}

try {

IOUtils.readFully(input, memory, 0, memory.length);

//从Mapper方把特定的Partition数据读入Reducer的内存缓冲区。

metrics.inputBytes(memory.length);

reporter.progress();//汇报进度

LOG.info("Read " + memory.length + " bytes from map-output for " +

getMapId());

/**

We’ve gotten the amount of data we were expecting. Verify the
decompressor has nothing more to offer. This action also forces the
decompressor to read any trailing bytes that weren’t critical
for decompression, which is necessary to keep the stream
in sync.

if (input.read() >= 0 ) {

throw new IOException("Unexpected extra bytes from input stream for " +

getMapId());

}

} catch (IOException ioe) {

// Close the streams

IOUtils.cleanup(LOG, input);

// Re-throw

throw ioe;

} finally {

CodecPool.returnDecompressor(decompressor);

//释放解压器

}

从对方把spill文件中属于本partition数据复制过来，回到copyFromHost中，通过scheduler.copySuccessed告知scheduler，并把这个MapTask的ID从remaining集合中删除，进入下一个循环，复制下一个MapTask数据。直到把所有的属于本Partition的数据都复制过来。

以上是Reducer端Fetcher的过程，它向Mapper端发送HTTP GET请求，下载文件。在MapTask就有一个与之对应的Server，这个网络协议的源代码不做深究，课下有兴趣自己研究。获取更多大数据视频资料请加QQ群：947967114

原创文章，作者：carmelaweatherly，如若转载，请注明出处：https://blog.ytso.com/190727.html

大数据 ： Hadoop reduce阶段

相关推荐

发表回复

大数据： Hadoop reduce阶段