I’ve seen a few people struggling with integrating Spark and Ignite in the mailing lists. As with so much of this stuff, it’s actually quite straightforward if the stars align. But getting those stars to align can take some trial and error. Here’s my experience so, hopefully, you can skip some of it.
First, the caveats. My use case here is mainly just playing around with data. I’m not setting up a cluster of Spark nodes (though the principles apply). One of the strengths of Spark is the ability to use a DataFrame from Python, making it easy do lots of ad hoc queries. But Ignite has a faster SQL engine and the ability to operate over a whole clusters-worth of data. I find the combination quite effective!
Step one: let’s start PySpark, as they call the Python version of Spark, with the ability to connect to an Ignite cluster. I’m sure there are more minimal ways of doing it, but this is what I use:
bin/pyspark --jars $IGNITE_HOME/libs/*.jar,$IGNITE_HOME/libs/optional/ignite-spark/*jar,$IGNITE_HOME/libs/ignite-spring/*jar,$IGNITE_HOME/libs/ignite-indexing/*jar
I’m using Spark 2.3 here, which is one version behind the current one. I didn’t dig too deeply, but I was getting weird errors when I used 2.4 that just went away when I went back a version. On the Ignite side I’m using 2.7 (the current version) and, yes, there are lots of stars in the command-line. I’m aiming to minimise typing rather than memory. When going into production you might want to filter the list of JARs.
If you’re using a Spark cluster, the
spark-submit command looks very similar.
So we can play around, let’s load in a CSV file:
input = spark.read.format("csv")
Let’s write that CSV file to Ignite:
import osconfigFile = os.environ['IGNITE_HOME'] + "/config/default-config.xml"input.write.format("ignite")
I’m running my Ignite node and Spark client on the same machine but that’s not a requirement. As long as Spark can “see” the cluster — and you tell it how using the ‘config’ parameter — it should be okay. You might want to run the
count() command on the resulting DataFrame to make sure you got the primary key right.
And then let’s pull it back in again!
ignite = spark.read.format("ignite")
Now we have two DataFrames, one in Spark’s memory, the other in Ignite, and both pretty much look the same from a programming point of view. For example, we can do things like:
input.select(input["isix"], input["entryPrcAsk"] - input["entryPrcBid"]).groupby("isix").mean().show()ignite.select(ignite["isix"], ignite["entryPrcAsk"] - ignite["entryPrcBid"]).groupby("isix").mean().show()
And they both return similar values.
(Why similar and not identical? Well, they’re different SQL engines and the floating point numbers are stored slightly differently, meaning that calculations have slightly different errors. If I’d used integers or strings I would not have expected the difference.)
You’ll generally find that the more complex the SQL expression and the more data you have, the faster the Ignite version will go relative to the pure-Spark one.
Even without the performance boost, I love the fact that I can do so much more with Ignite without having to compile any code.
Once you get the basics right, you can starting tuning it to your particular use case. For that you should check out the official documentation here. Good luck!