Tez線上部署及性能測試:

背景:

如果作業(yè)由多個MR任務(wù)完成,則必然經(jīng)過多次完整的Map–shuffer–Reduce,中間節(jié)點(diǎn)的數(shù)據(jù)多次寫入HDFS,浪費(fèi)IO讀寫。(可以將HDFS理解為多個任務(wù)之間的共享存儲。)Tez的引入可以較小的代價的解決這一問題。

Tez采用了DAG(有向無環(huán)圖)來組織MR任務(wù)。
核心思想:將Map任務(wù)和Reduce任務(wù)進(jìn)一步拆分,Map任務(wù)拆分為Input-Processor-Sort-Merge-Output,Reduce任務(wù)拆分為Input-Shuffer-Sort-Merge-Process-output,Tez將若干小任務(wù)靈活重組,形成一個大的DAG作業(yè)。
Tez與oozie不同:oozie只能以MR任務(wù)為整體來管理、組織,本質(zhì)上仍然是多個MR任務(wù)的執(zhí)行,不能解決上面提到的多個任務(wù)之間硬盤IO冗余的問題。 Tez只是一個Client,部署很方便。
目前Hive使用了Tez(Hive是一個將用戶的SQL請求翻譯為MR任務(wù),最終查詢HDFS的工具Tez采用了DAG(有向無環(huán)圖)來組織MR任務(wù)。 核心思想:將Map任務(wù)和Reduce任務(wù)進(jìn)一步拆分,Map任務(wù)拆分為Input-Processor-Sort-Merge-Output,Reduce任務(wù)拆分為Input-Shuffer-Sort-Merge-Process-output,Tez將若干小任務(wù)靈活重組,形成一個大的DAG作業(yè)。 Tez與oozie不同:oozie只能以MR任務(wù)為整體來管理、組織,本質(zhì)上仍然是多個MR任務(wù)的執(zhí)行,不能解決上面提到的多個任務(wù)之間硬盤IO冗余的問題。 Tez只是一個Client,部署很方便。 目前Hive使用了Tez(Hive是一個將用戶的SQL請求翻譯為MR任務(wù),最終查詢HDFS的工具)

傳統(tǒng)的MR:

image.png

tez:

image.png

TEZ技術(shù):

  • Application Master Pool 初始化AM池。Tez先將作業(yè)提交到AMPoolServer服務(wù)上。AMPoolServer服務(wù)啟動時就申請多個AM,Tez提交作業(yè)會優(yōu)先使用緩沖池資源
  • Container Pool AM啟動時會預(yù)先申請多個Container
  • Container重用

Tez實(shí)現(xiàn)方法:

Tez對外提供了6種可編程組件,分別是:

  • Input:對輸入數(shù)據(jù)源的抽象,它解析輸入數(shù)據(jù)格式,并吐出一個個Key/value
  • Output:對輸出數(shù)據(jù)源的抽象,它將用戶程序產(chǎn)生的Key/value寫入文件系統(tǒng)
  • Paritioner:對數(shù)據(jù)進(jìn)行分片,類似于MR中的Partitioner
  • Processor:對計算的抽象,它從一個Input中獲取數(shù)據(jù),經(jīng)處理后,通過Output輸出
  • Task:對任務(wù)的抽象,每個Task由一個Input、Ouput和Processor組成
  • Maser :管理各個Task的依賴關(guān)系,并按順依賴關(guān)系執(zhí)行他們

除了以上6種組件,Tez還提供了兩種算子,分別是Sort(排序)和Shuffle(混洗),為了用戶使用方便,它還提供了多種Input、Output、Task和Sort的實(shí)現(xiàn)

TEZ執(zhí)行引擎的問世,可以幫助我們解決現(xiàn)有MR框架的一些不足,比如迭代計算和交互計算,除了Hive組件,Pig組件也將TEZ用到了自己的優(yōu)化中。
另外,TEZ是基于YARN的,所以可以與原有的MR共存,不會相互沖突,在實(shí)際的應(yīng)用中,我們只需在hadoop-env.sh文件中配置TEZ的環(huán)境變量,并在mapred-site.xml設(shè)置執(zhí)行作業(yè)的架構(gòu)為yarn-tez,這樣在YARN上運(yùn)行的作業(yè)就會跑TEZ計算模式,所以原有的系統(tǒng)接入TEZ很便捷。當(dāng)然,如果我們只想Hive使用TEZ,并不想對整個系統(tǒng)做修改,那我們也可以單獨(dú)在Hive中做修改,也很簡單,這樣Hive可以在MR和TEZ之間自由切換而對原有的Hadoop MR任務(wù)沒有影響,所以TEZ這款計算框架的耦合很低,讓我們使用很容易和方便。




1 CDH集群測試環(huán)境

組件 版本
CDH 5.11.0
HADOOP 2.6.0
HIVE 1.1.0

CDH集群中Tez部署

版本:tez-0.8.5

部署

上傳Tez tar包及l(fā)ib

 # 上傳tez-tar包至hdfs;
    $ hdfs dfs -put ./tez-0.8.5.tar.gz /apps/tez/
    # 拷貝編譯完的Tez,apache-tez-0.8.5-src-2.6.0-cdh5.11.0/tez-dist/target/tez-0.8.5目錄下的lib包至CDH-Hive的lib目錄(所有Hiveserver2以及Hive Metastore Server)
    $ cd apache-tez-0.8.5-src-2.6.0-cdh5.11.0/tez-dist/target/tez-0.8.5
    $ cp ./*.jar  /opt/cloudera/parcels/CDH-5.11.0-1.cdh5.11.0.p0.34/lib/hive/lib
    $ cp ./lib/*.jar  /opt/cloudera/parcels/CDH-5.11.0-1.cdh5.11.0.p0.34/lib/hive/lib
    ###################################################################
    # 同步所有CDH-Hive節(jié)點(diǎn)(包括gateway,metastore,hievserver機(jī)器)
    ###################################################################

配置Tez配置文件

image.png

添加Tez配置文件:

<property>
   <name>tez.lib.uris</name>
   <value>${fs.defaultFS}/apps/tez/tez-0.8.5.tar.gz</value>
</property>
<!-- 由于TEZ-UI與timeline之間問題沒解決,暫時注釋掉 -->
<!-- <property>
     <name>tez.tez-ui.history-url.base</name>
     <value>http://tez-ui-serverIP:port/tez-ui/</value>
  </property>
  <property>
     <name>yarn.timeline-service.hostname</name>
     <value>${timelineServerIP}</value>
   </property> -->
<property>
   <name>tez.runtime.io.sort.mb</name>
   <value>1600</value>
   <description>40%*hive.tez.container.size</description>
</property>
<property>
   <name>hive.auto.convert.join.noconditionaltask.size</name>
   <value>1300</value>
   <description>多個mapjoin轉(zhuǎn)換為1個時,所有小表的文件大小總和的最大值,這個值只是限制輸入的表文件的大小,并不代表實(shí)際mapjoin時hashtable的大小。 建議值:1/3* hive.tez.container.size</description>
</property>
<property>
   <name>tez.runtime.unordered.output.buffer.size-mb</name>
   <value>400</value>
   <description>Size of the buffer to use if not writing directly to disk.。 建議值:10%* hive.tez.container.size</description>
</property>
<property>
   <name>hive.tez.container.size</name>
   <value>4096</value>
   <description>Set hive.tez.container.size to be the same as or a small multiple(1 or 2 times that) of YARN container size yarn.scheduler.minimum-allocation-mb but NEVER more than yarn.scheduler.maximum-allocation-mb</description>
</property>

###################################################################
同步配置至所有Hiveserver2以及Hive Metastore Server、gateway機(jī)器;

###################################################################

hive服務(wù)端配置

image.png

在圖示位置添加Tez配置:

<property>
      <name>tez.lib.uris</name>
      <value>${fs.defaultFS}/apps/tez/tez-0.8.5.tar.gz</value>
   </property>
   <property>
      <name>tez.runtime.io.sort.mb</name>
      <value>1600</value>
      <description>40%*hive.tez.container.size</description>
   </property>
   <property>
      <name>hive.auto.convert.join.noconditionaltask.size</name>
      <value>1300</value>
      <description>多個mapjoin轉(zhuǎn)換為1個時,所有小表的文件大小總和的最大值,這個值只是限制輸入的表文件的大小,并不代表實(shí)際mapjoin時hashtable的大小。 建議值:1/3* hive.tez.container.size</description>
   </property>
   <property>
      <name>tez.runtime.unordered.output.buffer.size-mb</name>
      <value>400</value>
      <description>Size of the buffer to use if not writing directly to disk.。 建議值:10%* hive.tez.container.size</description>
   </property>
   <property>
      <name>hive.tez.container.size</name>
      <value>4096</value>
      <description>Set hive.tez.container.size to be the same as or a small multiple(1 or 2 times that) of YARN container size yarn.scheduler.minimum-allocation-mb but NEVER more than yarn.scheduler.maximum-allocation-mb</description>
   </property>

TEZ配置注意事項(xiàng)

"hive.tez.container.size" and "hive.tez.java.opts" are the parameters that alter Tez memory settings in Hive. If "hive.tez.container.size" is set to "-1" (default value), it picks the value of "mapreduce.map.memory.mb". If "hive.tez.java.opts" is not specified, it relies on the "mapreduce.map.java.opts" setting. Thus, if Tez specific memory settings are left as default values, memory sizes are picked from mapreduce mapper memory settings "mapreduce.map.memory.mb".

Important: Please note that the setting for "hive.tez.java.opts" must be smaller than the size specified for "hive.tez.container.size", or "mapreduce.{map|reduce}.memory.mb" if "hive.tez.container.size" is not specified. Don't forget to review both of them when setting either one to ensure "hive.tez.java.opts" is smaller then "hive.tez.container.size" or "mapreduce.{map|reduce}.java.opts" is smaller then "mapreduce.{map|reduce}.memory.mb".

See Configuring Heapsize for Mappers and Reducers in Hadoop 2 for more information about the "mapreduce.map.memory.mb" and "mapreduce.map.java.opts" properties.

### Yarn Timeline server 配置(本次未啟用Timeline server)

.tez使用timelineserver存儲application數(shù)據(jù),由于tez-ui與yarn-timeline server目前還有問題沒解決,暫時沒有添加tez-ui。貼出yarn timeline server配置

CDH集群yarn-site配置添加

<property>
  <description>The hostname of the Timeline service web application.</description>
  <name>yarn.timeline-service.hostname</name>
  <value>10.10.15.107</value>
</property>
<property>
  <description>Address for the Timeline server to start the RPC server.</description>
  <name>yarn.timeline-service.address</name>
  <value>${yarn.timeline-service.hostname}:10200</value>
</property>
 
<property>
  <description>The http address of the Timeline service web application.</description>
  <name>yarn.timeline-service.webapp.address</name>
  <value>${yarn.timeline-service.hostname}:8188</value>
</property>
 
<property>
  <description>The https address of the Timeline service web application.</description>
  <name>yarn.timeline-service.webapp.https.address</name>
  <value>${yarn.timeline-service.hostname}:8190</value>
</property>
 
<property>
  <description>Handler thread count to serve the client RPC requests.</description>
  <name>yarn.timeline-service.handler-thread-count</name>
  <value>10</value>
</property>
 
<property>
  <description>Enables cross-origin support (CORS) for web services where
  cross-origin web response headers are needed. For example, javascript making
  a web services request to the timeline server.</description>
  <name>yarn.timeline-service.http-cross-origin.enabled</name>
  <value>true</value>
</property>
 
<property>
  <description>Comma separated list of origins that are allowed for web
  services needing cross-origin (CORS) support. Wildcards (*) and patterns
  allowed</description>
  <name>yarn.timeline-service.http-cross-origin.allowed-origins</name>
  <value>*</value>
</property>
 
<property>
  <description>Comma separated list of methods that are allowed for web
  services needing cross-origin (CORS) support.</description>
  <name>yarn.timeline-service.http-cross-origin.allowed-methods</name>
  <value>GET,POST,HEAD</value>
</property>
 
<property>
  <description>Comma separated list of headers that are allowed for web
  services needing cross-origin (CORS) support.</description>
  <name>yarn.timeline-service.http-cross-origin.allowed-headers</name>
  <value>X-Requested-With,Content-Type,Accept,Origin</value>
</property>
 
<property>
  <description>The number of seconds a pre-flighted request can be cached
  for web services needing cross-origin (CORS) support.</description>
  <name>yarn.timeline-service.http-cross-origin.max-age</name>
  <value>1800</value>
</property>
<property>
  <description>Indicate to clients whether Timeline service is enabled or not.
  If enabled, the TimelineClient library used by end-users will post entities
  and events to the Timeline server.</description>
  <name>yarn.timeline-service.enabled</name>
  <value>true</value>
</property>
 
<property>
  <description>Store class name for timeline store.</description>
  <name>yarn.timeline-service.store-class</name>
  <value>org.apache.hadoop.yarn.server.timeline.LeveldbTimelineStore</value>
</property>
 
<property>
  <description>Enable age off of timeline store data.</description>
  <name>yarn.timeline-service.ttl-enable</name>
  <value>true</value>
</property>
 
<property>
  <description>Time to live for timeline store data in milliseconds.</description>
  <name>yarn.timeline-service.ttl-ms</name>
  <value>604800000</value>
</property>
 
<property>
  <name>yarn.resourcemanager.system-metrics-publisher.enabled</name>
  <value>true</value>
</property>

記得改完配置,重啟Hive服務(wù)

Tez出現(xiàn)問題

Q1: java.lang.ArithmeticException: / by zero

Vertex failed, vertexName=Map 1, vertexId=vertex_1557110571873_0003_1_00, diagnostics=[Vertex vertex_1557110571873_0003_1_00 [Map 1] killed/failed due to:ROOT_INPUT_INIT_FAILURE, Vertex Input: tez initializer failed, vertex=vertex_1557110571873_0003_1_00 [Map 1], java.lang.ArithmeticException: / by zero
    at org.apache.hadoop.hive.ql.exec.tez.HiveSplitGenerator.initialize(HiveSplitGenerator.java:123)
    at org.apache.tez.dag.app.dag.RootInputInitializerManager$InputInitializerCallable$1.run(RootInputInitializerManager.java:278)
    at org.apache.tez.dag.app.dag.RootInputInitializerManager$InputInitializerCallable$1.run(RootInputInitializerManager.java:269)
    at java.security.AccessController.doPrivileged(Native Method)
    at javax.security.auth.Subject.doAs(Subject.java:422)
    at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1920)
    at org.apache.tez.dag.app.dag.RootInputInitializerManager$InputInitializerCallable.call(RootInputInitializerManager.java:269)
    at org.apache.tez.dag.app.dag.RootInputInitializerManager$InputInitializerCallable.call(RootInputInitializerManager.java:253)
    at java.util.concurrent.FutureTask.run(FutureTask.java:266)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
    at java.lang.Thread.run(Thread.java:748)

解決:

SET hive.tez.container.size = ;(上面配置已經(jīng)補(bǔ)充,根據(jù)集群設(shè)置而定)

Q2: map 0%卡住

解決

查看了一下hive中hive.map.aggr.hash.percentmemory屬性的說明: Hive Map 端聚合的哈稀存儲所占用虛擬機(jī)的內(nèi)存比例。 意思是說,當(dāng)內(nèi)存的Map大小,占到JVM配置的Map進(jìn)程的25%的時候(默認(rèn)是50%),就將這個數(shù)據(jù)flush到reducer去,以釋放內(nèi)存Map的空間。 錯誤原因:Map端聚合時hash表所占用的內(nèi)存比例默認(rèn)為0.5,這個值超過可用內(nèi)存大小,導(dǎo)致內(nèi)存溢出。

Q3: java.lang.NoClassDefFoundError: com/esotericsoftware/kryo/Serializer

ERROR : Vertex failed, vertexName=Map 1, vertexId=vertex_1558679295627_0001_1_00, diagnostics=[Vertex vertex_1558679295627_0001_1_00 [Map 1] killed/failed due to:ROOT_INPUT_INIT_FAILURE, Vertex Input: tez initializer failed, vertex=vertex_1558679295627_0001_1_00 [Map 1], java.lang.NoClassDefFoundError: com/esotericsoftware/kryo/Serializer at org.apache.hadoop.hive.ql.exec.tez.HiveSplitGenerator.initialize(HiveSplitGenerator.java:107) at org.apache.tez.dag.app.dag.RootInputInitializerManager$InputInitializerCallable$1.run(RootInputInitializerManager.java:278) at org.apache.tez.dag.app.dag.RootInputInitializerManager$InputInitializerCallable$1.run(RootInputInitializerManager.java:269) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:422) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1920) at org.apache.tez.dag.app.dag.RootInputInitializerManager$InputInitializerCallable.call(RootInputInitializerManager.java:269) at org.apache.tez.dag.app.dag.RootInputInitializerManager$InputInitializerCallable.call(RootInputInitializerManager.java:253) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:748) Caused by: java.lang.ClassNotFoundException: com.esotericsoftware.kryo.Serializer at java.net.URLClassLoader.findClass(URLClassLoader.java:381) at java.lang.ClassLoader.loadClass(ClassLoader.java:424) at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:349) at java.lang.ClassLoader.loadClass(ClassLoader.java:357) ... 12 more

解決

# 切換指定目錄,拷貝缺少jar
$ cd /opt/cloudera/parcels/CDH-5.11.0-1.cdh5.11.0.p0.34/jars
###################################################################
# 同步配置至所有Hiveserver2以及Hive Metastore Server、gateway機(jī)器;
###################################################################
cp ./kryo-2.22.jar ../lib/hive/auxlib/
# 如果在hive命令行執(zhí)行,則不會報錯,如果在hue中執(zhí)行,需要重啟hive。

Q4: Caused by: java.lang.OutOfMemoryError: Java heap space

Caused by: java.lang.OutOfMemoryError: Java heap space
 at org.apache.hadoop.hive.serde2.WriteBuffers.nextBufferToWrite(WriteBuffers.java:206)
 at org.apache.hadoop.hive.serde2.WriteBuffers.write(WriteBuffers.java:182)
 at org.apache.hadoop.hive.ql.exec.persistence.MapJoinBytesTableContainer$LazyBinaryKvWriter.writeValue(MapJoinBytesTableContainer.java:248)
 at org.apache.hadoop.hive.ql.exec.persistence.BytesBytesMultiHashMap.writeFirstValueRecord(BytesBytesMultiHashMap.java:574)
 at org.apache.hadoop.hive.ql.exec.persistence.BytesBytesMultiHashMap.put(BytesBytesMultiHashMap.java:229)
 at org.apache.hadoop.hive.ql.exec.persistence.MapJoinBytesTableContainer.putRow(MapJoinBytesTableContainer.java:288)

解決
https://mapr.com/support/s/article/How-to-change-Tez-container-heapsize?language=en_US

修改配置:
<property>
  <name>tez.am.resource.memory.mb</name>
  <value>2048</value>
</property>

Q5 Caused by: org.apache.hadoop.ipc.RemoteException(org.apache.tez.dag.api.TezException): App master already running a DAG

Caused by: org.apache.hadoop.ipc.RemoteException(org.apache.tez.dag.api.TezException): App master already running a DAG
    at org.apache.tez.dag.app.DAGAppMaster.submitDAGToAppMaster(DAGAppMaster.java:1379)
    at org.apache.tez.dag.api.client.DAGClientHandler.submitDAG(DAGClientHandler.java:140)
    at org.apache.tez.dag.api.client.rpc.DAGClientAMProtocolBlockingPBServerImpl.submitDAG(DAGClientAMProtocolBlockingPBServerImpl.java:175)
    at org.apache.tez.dag.api.client.rpc.DAGClientAMProtocolRPC$DAGClientAMProtocol$2.callBlockingMethod(DAGClientAMProtocolRPC.java:7636)
    at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:617)
    at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1073)
    at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2220)
    at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2216)
    at java.security.AccessController.doPrivileged(Native Method)
    at javax.security.auth.Subject.doAs(Subject.java:422)
    at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1920)
    at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2214)
image.png
添加配置;
[beeswax]
    max_number_of_sessions=10

Q6 java.lang.AbstractMethodError: org.codehaus.jackson.map.AnnotationIntrospector.findSerializer(Lorg/codehaus/jackson/map/introspect/Annotated;)Ljava/lang/Object;

java.lang.AbstractMethodError: org.codehaus.jackson.map.AnnotationIntrospector.findSerializer(Lorg/codehaus/jackson/map/introspect/Annotated;)Ljava/lang/Object;
    at org.codehaus.jackson.map.ser.BasicSerializerFactory.findSerializerFromAnnotation(BasicSerializerFactory.java:362)
    at org.codehaus.jackson.map.ser.BeanSerializerFactory.createSerializer(BeanSerializerFactory.java:252)
    at org.codehaus.jackson.map.ser.StdSerializerProvider._createUntypedSerializer(StdSerializerProvider.java:782)
    at org.codehaus.jackson.map.ser.StdSerializerProvider._createAndCacheUntypedSerializer(StdSerializerProvider.java:735)
    at org.codehaus.jackson.map.ser.StdSerializerProvider.findValueSerializer(StdSerializerProvider.java:344)
    at org.codehaus.jackson.map.ser.StdSerializerProvider.findTypedValueSerializer(StdSerializerProvider.java:420)
    at org.codehaus.jackson.map.ser.StdSerializerProvider._serializeValue(StdSerializerProvider.java:601)
    at org.codehaus.jackson.map.ser.StdSerializerProvider.serializeValue(StdSerializerProvider.java:256)
    at org.codehaus.jackson.map.ObjectMapper.writeValue(ObjectMapper.java:1604)
    at org.codehaus.jackson.jaxrs.JacksonJsonProvider.writeTo(JacksonJsonProvider.java:527)
    at com.sun.jersey.api.client.RequestWriter.writeRequestEntity(RequestWriter.java:300)
    at com.sun.jersey.client.urlconnection.URLConnectionClientHandler._invoke(URLConnectionClientHandler.java:204)
    at com.sun.jersey.client.urlconnection.URLConnectionClientHandler.handle(URLConnectionClientHandler.java:147)
    at org.apache.hadoop.yarn.client.api.impl.TimelineClientImpl$TimelineJerseyRetryFilter$1.run(TimelineClientImpl.java:226)
    at org.apache.hadoop.yarn.client.api.impl.TimelineClientImpl$TimelineClientConnectionRetry.retryOn(TimelineClientImpl.java:162)
    at org.apache.hadoop.yarn.client.api.impl.TimelineClientImpl$TimelineJerseyRetryFilter.handle(TimelineClientImpl.java:237)
    at com.sun.jersey.api.client.Client.handle(Client.java:648)
    at com.sun.jersey.api.client.WebResource.handle(WebResource.java:670)
    at com.sun.jersey.api.client.WebResource.access$200(WebResource.java:74)
    at com.sun.jersey.api.client.WebResource$Builder.post(WebResource.java:563)
    at org.apache.hadoop.yarn.client.api.impl.TimelineClientImpl.doPostingObject(TimelineClientImpl.java:472)
    at org.apache.hadoop.yarn.client.api.impl.TimelineClientImpl.doPosting(TimelineClientImpl.java:321)
    at org.apache.hadoop.yarn.client.api.impl.TimelineClientImpl.putEntities(TimelineClientImpl.java:301)
    at org.apache.tez.dag.history.logging.ats.ATSHistoryLoggingService.handleEvents(ATSHistoryLoggingService.java:358)
    at org.apache.tez.dag.history.logging.ats.ATSHistoryLoggingService.serviceStop(ATSHistoryLoggingService.java:233)
    at org.apache.hadoop.service.AbstractService.stop(AbstractService.java:221)
    at org.apache.hadoop.service.ServiceOperations.stop(ServiceOperations.java:52)
    at org.apache.hadoop.service.ServiceOperations.stopQuietly(ServiceOperations.java:80)
    at org.apache.hadoop.service.CompositeService.stop(CompositeService.java:157)
    at org.apache.hadoop.service.CompositeService.serviceStop(CompositeService.java:131)
    at org.apache.tez.dag.history.HistoryEventHandler.serviceStop(HistoryEventHandler.java:111)
    at org.apache.hadoop.service.AbstractService.stop(AbstractService.java:221)
    at org.apache.hadoop.service.ServiceOperations.stop(ServiceOperations.java:52)
    at org.apache.hadoop.service.ServiceOperations.stopQuietly(ServiceOperations.java:80)
    at org.apache.hadoop.service.ServiceOperations.stopQuietly(ServiceOperations.java:65)
    at org.apache.tez.dag.app.DAGAppMaster.stopServices(DAGAppMaster.java:1949)
    at org.apache.tez.dag.app.DAGAppMaster.serviceStop(DAGAppMaster.java:2140)
    at org.apache.hadoop.service.AbstractService.stop(AbstractService.java:221)
    at org.apache.tez.dag.app.DAGAppMaster$DAGAppMasterShutdownHook.run(DAGAppMaster.java:2438)
    at org.apache.hadoop.util.ShutdownHookManager$1.run(ShutdownHookManager.java:54)

解決:

更改tez的jackson JAR為1.9.13版本

Tez與MR在Hive的性能測試

在Hive中創(chuàng)建待測試表

# 表UserVisits
$ create table UserVisits (sourceIP VARCHAR(116), destURL VARCHAR(100), visitDate DATE, adRevenue FLOAT, userAgent VARCHAR(256), countryCode CHAR(3), languageCode CHAR(6), searchWord VARCHAR(32), duration INT) ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t' STORED AS TEXTFILE;
# 表Rankings
$ create table Rankings (pageURL varchar(300), pageRank INT ,avgDuration INT) ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t' STORED AS TEXTFILE;

導(dǎo)入數(shù)據(jù)(一次1G數(shù)據(jù)集測試,一次2G數(shù)據(jù)集測試)

load data local inpath 'path' into table Rankings; ....

測試

scan操作

$ SELECT pageURL, pageRank FROM rankings WHERE pageRank > 1000000;

聚合操作

$ SELECT SUM(adRevenue) FROM uservisits GROUP BY SUBSTR(sourceIP, 1, 4);

join操作

$ SELECT sourceIP, totalRevenue FROM ( SELECT sourceIP, SUM(adRevenue) as totalRevenue FROM Rankings AS R, UserVisits AS UV WHERE R.pageURL = UV.destURL GROUP BY UV.sourceIP ) test ORDER BY totalRevenue DESC LIMIT 1;

External Script Query

$ CREATE TABLE url_counts_total AS SELECT SUM(pageRank) AS totalCount, pageURL FROM Rankings GROUP BY pageURL;

測試結(jié)果

image.png

image.png
最后編輯于
?著作權(quán)歸作者所有,轉(zhuǎn)載或內(nèi)容合作請聯(lián)系作者
【社區(qū)內(nèi)容提示】社區(qū)部分內(nèi)容疑似由AI輔助生成,瀏覽時請結(jié)合常識與多方信息審慎甄別。
平臺聲明:文章內(nèi)容(如有圖片或視頻亦包括在內(nèi))由作者上傳并發(fā)布,文章內(nèi)容僅代表作者本人觀點(diǎn),簡書系信息發(fā)布平臺,僅提供信息存儲服務(wù)。

友情鏈接更多精彩內(nèi)容