Deploy Apache Spark on macOS Sierra

萬事起頭難…特別是環境設定

OS: macOS Sierra (v 10.12.4)
Spark release: 2.1.1 (May 02 2017)
package type: Pre-built for Apache Hadoop 2.7 and later
Java version: 1.8.0_131

tar -xvzf spark-2.1.1-bin-hadoop2.7.tgz
mv spark-2.1.1-bin-hadoop2.7 /usr/local/spark
cd /usr/local/spark/

因為下載的版本是pre-built的,如上述解壓縮後就可以用囉,以Python為例:

  • 互動式命令列
./bin/pyspark --master local[2]  
  • 官方spark package中的範例程式也有Python版本的
./bin/spark-submit examples/src/main/python/pi.py 10

接著要要介紹如何用Python IDE開發Spark,這款IDE是PyCharm,我是下載Community版的,安裝完後直接安裝就好。

接著打開PyCharm,接著要引入兩包函式庫(如下圖):

  1. 把Spark Python目錄引入IDE當函式庫,
  2. py4j-x.y.z-src.zip

在開始寫程式之前,要記得設定Spark的環境變數,不然就是要在程式中記載;接著就可以開始寫程式囉!我們用下面這段來測試

"""SimpleApp.py"""
from pyspark import SparkContext

logFile = "/usr/local/spark/README.md" # Should be some file on your system
sc = SparkContext("local", "Simple App")
logData = sc.textFile(logFile).cache()

numAs = logData.filter(lambda s: 'a' in s).count()
numBs = logData.filter(lambda s: 'b' in s).count()

print("Lines with a: %i, lines with b: %i" % (numAs, numBs))

sc.stop()

執行結果如下圖:

References:

One clap, two clap, three clap, forty?

By clapping more or less, you can signal to us which stories really stand out.