這篇文章詳細(xì)的介紹了spark廣播變量,值得一看
https://jaceklaskowski.gitbooks.io/mastering-apache-spark/content/spark-broadcast.html
在此只摘錄其中的Example
Let’s start with an introductory example to check out how to use broadcast variables and build your initial understanding.
You’re going to use a static mapping of interesting projects with their websites, i.e. Map[String, String] that the tasks, i.e. closures (anonymous functions) in transformations, use.
scala> val pws = Map("Apache Spark" -> "http://spark.apache.org/", "Scala" -> "http://www.scala-lang.org/")
pws: scala.collection.immutable.Map[String,String] = Map(Apache Spark -> http://spark.apache.org/, Scala -> http://www.scala-lang.org/)
scala> val websites = sc.parallelize(Seq("Apache Spark", "Scala")).map(pws).collect
...
websites: Array[String] = Array(http://spark.apache.org/, http://www.scala-lang.org/)
It works, but is very ineffective as the pws map is sent over the wire to executors while it could have been there already. If there were more tasks that need the pws map, you could improve their performance by minimizing the number of bytes that are going to be sent over the network for task execution.
Enter broadcast variables.
val pwsB = sc.broadcast(pws)
val websites = sc.parallelize(Seq("Apache Spark", "Scala")).map(pwsB.value).collect
// websites: Array[String] = Array(http://spark.apache.org/, http://www.scala-lang.org/)
Semantically, the two computations - with and without the broadcast value - are exactly the same, but the broadcast-based one wins performance-wise when there are more executors spawned to execute many tasks that use pws map.
總結(jié)
通過(guò)這篇文章可以知道,如果在driver中定義一個(gè)普通的變量,也是可以在不同的task中傳遞的,只不過(guò)是通過(guò)拷貝一個(gè)副本的方式傳遞。為了提高性能通過(guò)定義廣播變量,在每個(gè)機(jī)器上只生成一個(gè)只讀變量,共享給這個(gè)機(jī)器上所有的task。