Hadoop的MapReduce是一個很經(jīng)典的分布式并行計算框架,一直對各個階段的具體含義有些模糊。花時間看了下stackoverflow上的理解,記錄一下。
stackoverflow鏈接:https://stackoverflow.com/questions/22141631/what-is-the-purpose-of-shuffling-and-sorting-phase-in-the-reducer-in-map-reduce
看下面這個例子,一目了然。

上圖演示的是經(jīng)典的word count的例子:
map階段:map的作用是takes a set of data and converts it into another set of data, where individual elements are broken down into tuples (key/value pairs). 這里是將分好詞的文本轉(zhuǎn)成(word,num)的形式。
combine階段:combine的作用是Shrinks the output of each Mapper. It would save the time spending for moving the data from one node to another,即merge每個map階段的結(jié)果,如圖所示,相當于一個內(nèi)部整合的作用。
shuffle&sort階段:shuffle的作用是Makes it easy for the run-time to schedule (spawn/start) new reducers, where while going through the sorted item list, whenever the current key is different from the previous, it can spawn a new reducer. shuffle模塊會根據(jù)reduce的數(shù)目,將combine的結(jié)果哈希到某個partion,默認是key的順序。圖中可以看到是按照key的字符串順序排序的,將相同的(key,value)哈希到一個partion,以便reduce操作。注意的是,shuffle并不是hadoop的必要階段,配置中可選。
reduce階段: 對shuffle的結(jié)果處理,輸出結(jié)果。