015 Hadoop 中的分布式緩存: 最全面的指南

015 Distributed Cache in Hadoop: Most Comprehensive Guide

1. Distributed Cache in Hadoop: Objective

1. 在 Hadoop 分布式緩存: 目標(biāo)

In our blog about Hadoop distributed cache you will learn what is distributed cache in Hadoop, Working and implementations of distributed cache in Hadoop framework. This tutorial also covers various Advantages of Distributed Cache, limitations of Apache Hadoop Distributed Cache.

在我們關(guān)于 Hadoop 分布式緩存的博客中,您將了解Hadoop 中的分布式緩存是什么:Hadoop 框架中分布式緩存的工作和實(shí)現(xiàn).本教程還介紹了分布式緩存的各種優(yōu)點(diǎn),Apache Hadoop 的局限性分布式緩存

Distributed Cache in Hadoop: Most Comprehensive Guide

2. Introduction to Hadoop

2. Hadoop 入門

Apache Hadoop is an open-source software framework. It is a system for distributed storage and processing of large data sets. Hadoop follows master slave architecture. In which master is NameNode and slave is DataNode. Namenode stores meta-data i.e. number of blocks, their location, replicas. Datanode stores actual data in HDFS. And it perform read and write operation as per request for the client.
In Hadoop, data chunks process in parallel among Datanodes, using a program written by the user. If we want to access some files from all the Datanodes, then we will put that file to distributed cache.
Read:** Automatic Failover in Hadoop**

Apache Hadoop是一個(gè)開(kāi)源的軟件框架.它是一個(gè)對(duì)大數(shù)據(jù)集進(jìn)行分布式存儲(chǔ)和處理的系統(tǒng).Hadoop 遵循主從架構(gòu).有哪位高手有、復(fù)制指令和奴隸是 DataNode.南德存儲(chǔ)元數(shù)據(jù),即塊的數(shù)量、位置、副本.Datanode將實(shí)際數(shù)據(jù)存儲(chǔ)在HDFS.它根據(jù)客戶端的請(qǐng)求執(zhí)行讀寫操作.
在 Hadoop 中,數(shù)據(jù)塊使用用戶編寫的程序在數(shù)據(jù)節(jié)點(diǎn)之間并行處理.如果我們想從所有數(shù)據(jù)節(jié)點(diǎn)訪問(wèn)一些文件,那么我們將把該文件放入分布式緩存.

閱讀:Hadoop 中的自動(dòng)故障切換

3. What is Distributed Cache in Hadoop?

Hadoop 中的分布式緩存是什么 3.

Distributed Cache is a facility provided by the Hadoop MapReduce framework. It cache files when needed by the applications. It can cache read only text files, archives, jar files etc. Once we have cached a file for our job, Hadoop will make it available on each datanodes where map/reduce tasks are running.
Thus, we can access files from all the datanodes in our map and reduce job.

分布式緩存是由提供的設(shè)施Hadoop MapReduce 框架.當(dāng)應(yīng)用程序需要時(shí),它會(huì)緩存文件.它可以緩存只讀文本文件、歸檔文件、 jar 文件等.一旦我們?yōu)槲覀兊墓ぷ骶彺媪艘粋€(gè)文件,Hadoop 將使它在運(yùn)行 map/reduce 任務(wù)的每個(gè)數(shù)據(jù)節(jié)點(diǎn)上
因此,我們可以從我們的地圖減少工作.

3.1. Working and Implementation of Distributed Cache in Hadoop

3.1.Hadoop 中分布式緩存的工作與實(shí)現(xiàn)

First of all, an application which need to use distributed cache to distribute a file:

首先,需要使用分布式緩存來(lái)分發(fā)文件的應(yīng)用程序:

  • Should make sure that the file is available.

  • And also make sure that file can accessed via urls. Urls can be either hdfs****: // or http****://.

  • 應(yīng)該確保文件可用.

  • 并確??梢酝ㄟ^(guò) url 訪問(wèn)該文件.網(wǎng)址可以是Hdfs****://或者Http****://.

Now, if the file is present on the above urls, the user mentions it to be a cache file to the distributed cache. MapReduce job will copy the cache file on all the nodes before starting of tasks on those nodes.
The Process is as Follows:

現(xiàn)在,如果上述 url 上存在該文件,用戶會(huì)提到它是分布式緩存的緩存文件.MapReduce job 將在所有節(jié)點(diǎn)上復(fù)制緩存文件,然后在這些節(jié)點(diǎn)上開(kāi)始任務(wù).
流程如下:

  • Copy the requisite file to the HDFS:

  • 將必要的文件復(fù)制到HDFS:

$ hdfs dfs-put/user/dataflair/lib/jar_file.jar

  • Setup the application’s JobConf:

  • 設(shè)置應(yīng)用程序的 job conf:

DistributedCache.addFileToClasspath(new Path (“/user/dataflair/lib/jar-file.jar”), conf)

DistributedCache.addFileToClasspath (新路徑 (“/user/dataflair/lib/jar-file.jar”),conf)

  • Add it in Driver class.

  • 在驅(qū)動(dòng)程序類中添加它.

3.2. Size of Distributed Cache in Hadoop

3.2.Hadoop 中分布式緩存的大小

With cache size property in mapred*****-site.xml* it is possible to control the size of distributed cache. By default size of Hadoop distributed cache is 10 GB.
Read: Important Features of Hadoop

中的緩存大小屬性地圖紅*****-Site.xml*可以控制分布式緩存的大小.Hadoop 分布式緩存默認(rèn)大小為 10gb.
閱讀:Hadoop 的重要特性

4. Benefits of Distributed Cache in Hadoop

4. Hadoop 分布式緩存的好處

Below are some advantages of MapReduce Distributed Cache-

下面是 MapReduce 分布式緩存的一些優(yōu)點(diǎn)-

4.1. Store Complex Data

4.1.存儲(chǔ)復(fù)雜數(shù)據(jù)

It distributes simple, read-only text file and complex types like jars, archives. These achieves are then un-archived at the slave node.

它分發(fā)簡(jiǎn)單、只讀的文本文件和像 jars 、 archives 這樣的復(fù)雜類型.然后,在從屬節(jié)點(diǎn)上取消存檔這些成就.

4.2. Data Consistency

4.2.數(shù)據(jù)一致性

Hadoop Distributed Cache tracks the modification timestamps of cache files. And it notifies that the files should not change until a job is executing. Using hashing algorithm, the cache engine can always determine on which node a particular key-value pair resides. Since, there is always a single state of the cache cluster, it is never inconsistent.

Hadoop 分布式緩存跟蹤緩存文件的修改時(shí)間戳.它會(huì)通知文件在作業(yè)執(zhí)行之前不應(yīng)更改.使用哈希算法,緩存引擎總是可以確定特定的節(jié)點(diǎn)鍵值對(duì)居住.因?yàn)榫彺婕旱臓顟B(tài)總是單一的,所以它從來(lái)都不是不一致的.

4.3. Single point of Failure

4.3.單點(diǎn)故障

A distributed cache runs as an independent process across many nodes. Thus, failure of a single node does not result in a complete failure of the cache.
Read: How Hadoop works internally?

分布式緩存作為一個(gè)獨(dú)立的進(jìn)程跨多個(gè)節(jié)點(diǎn)運(yùn)行.因此,單個(gè)節(jié)點(diǎn)的失敗不會(huì)導(dǎo)致緩存的完全失敗.
閱讀:Hadoop 內(nèi)部是如何工作的?

5. Overhead of Distributed Cache

5. 開(kāi)銷的分布式緩存

A MapReduce distributed cache has overhead that will make it slower than an in-process cache:

MapReduce 分布式緩存的開(kāi)銷會(huì)比進(jìn)程內(nèi)緩存慢:

5.1. Object serialization

5.1.對(duì)象序列化

A distributed cache must serialize objects. But the serialization mechanism has two major problems:

分布式緩存必須序列化對(duì)象.但是序列化機(jī)制有兩大問(wèn)題:

  • Very slow– Serialization uses reflection to inspect the type of information at runtime. Reflection is a very slow process as compared to pre-compiled code.

  • Very bulky– Serialization stores complete class name, cluster, and assembly details. It also stores references to other instances in member variables. All this makes the serialization very bulky.

  • 非常慢-序列化使用反射在運(yùn)行時(shí)檢查信息類型.與預(yù)編譯代碼相比,反射是一個(gè)非常緩慢的過(guò)程.

  • 非常笨重-序列化存儲(chǔ)完整的類名稱、集群和組件詳細(xì)信息.它還在成員變量中存儲(chǔ)對(duì)其他實(shí)例的引用.這一切使得序列化非常龐大.

6. Distributed Cache in Hadoop – Conclusion

Hadoop 分布式緩存的 6.-結(jié)束

In conclusion to Distributed cache in Hadoop, it is a mechanism that Hadoop MapReduce framework supports. Using distributed cache in Hadoop, we can broadcast small or moderate sized files (read only) to all the worker nodes. The distributed cache files will be deleted from worker node once the job runs successfully.

綜上所述,Hadoop 中的分布式緩存是 Hadoop MapReduce 框架所支持的一種機(jī)制.在 Hadoop 中使用分布式緩存,我們可以向所有工作節(jié)點(diǎn)廣播大小適中的文件 (只讀).作業(yè)成功運(yùn)行后,分布式緩存文件將從工作節(jié)點(diǎn)中刪除.

See Also-

另見(jiàn)-

Reference

參考

If you like this post or have any query about hadoop Distributed Caching, do leave a comment.

如果你喜歡這篇文章,或者對(duì) hadoop 分布式緩存有任何疑問(wèn),請(qǐng)留下評(píng)論.

https://data-flair.training/blogs/hadoop-distributed-cache

?著作權(quán)歸作者所有,轉(zhuǎn)載或內(nèi)容合作請(qǐng)聯(lián)系作者
【社區(qū)內(nèi)容提示】社區(qū)部分內(nèi)容疑似由AI輔助生成,瀏覽時(shí)請(qǐng)結(jié)合常識(shí)與多方信息審慎甄別。
平臺(tái)聲明:文章內(nèi)容(如有圖片或視頻亦包括在內(nèi))由作者上傳并發(fā)布,文章內(nèi)容僅代表作者本人觀點(diǎn),簡(jiǎn)書(shū)系信息發(fā)布平臺(tái),僅提供信息存儲(chǔ)服務(wù)。

相關(guān)閱讀更多精彩內(nèi)容

友情鏈接更多精彩內(nèi)容