pySpark 匯出 PMML( Predictive Model Markup Language)

徐子函
徐子函
Sep 4, 2018 · 3 min read

最近因為公司需要一個實時的相似度匹配模型,而Spark適用的場景為批量,故需要一個額外新增一個中間服務,這個中間服務是基於PMML,而我們要如何將pySpark的模型數據導出為PMML格式呢? 下面會解說。

首先我們先安裝 PySpark2PMML這個包

pip install --upgrade git+https://github.com/jpmml/pyspark2pmml.git

然後再依照之前安裝的Spark版本來選擇合適的JPMML

Spark版本, 下載地址

2.0.X ,1.1.20

2.1.X,1.2.12

2.2.X,1.3.8

2.3.X,1.4.5

將JPMML的包下載到你知道的地址,例如: ~/hadoop/jpmml

下載完後,執行下面指令開啟pySpark

pyspark --jars ~/hadoop/jpmml/jpmml-sparkml-executable-${version}.jar

然後依序運行下面代碼:

from pyspark.ml import Pipeline
from pyspark.ml.classification import DecisionTreeClassifier
from pyspark.ml.feature import RFormula
df = spark.read.csv("/data/Iris.csv", header = True, inferSchema = True)
formula = RFormula(formula = "Species ~ .")
classifier = DecisionTreeClassifier()
pipeline = Pipeline(stages = [formula, classifier])
pipelineModel = pipeline.fit(df)

來訓練模型,訓練完成後我們再來運行下面代碼來以PMML格式匯出模型:

from pyspark2pmml import PMMLBuilder

pmmlBuilder = PMMLBuilder(sc, df, pipelineModel) \
.putOption(classifier, "compact", True)

pmmlBuilder.buildFile("~/pmmlModels/DecisionTreeIris.pmml")

運行完後,我們於~/pmmlModels可以得到匯出完後的模型。

當然也可以利用spark-submit,指令如下:

spark-submit --jars /you_path/jpmml-sparkml-executable-${version}.jar your_app.py

    徐子函

    Written by

    徐子函

    https://github.com/aspspspsp https://www.linkedin.com/in/%E5%AD%90%E5%87%BD-%E5%BE%90-290a46107/

    Welcome to a place where words matter. On Medium, smart voices and original ideas take center stage - with no ads in sight. Watch
    Follow all the topics you care about, and we’ll deliver the best stories for you to your homepage and inbox. Explore
    Get unlimited access to the best stories on Medium — and support writers while you’re at it. Just $5/month. Upgrade