Thanks so much for following up on this!
Hmm, I wonder if we should have a concerted effort to chart performance on
various pieces of hardware...
On 9 Mar 2015 21:08, "Ulanov, Alexander" <alexander.ulanov@hp.com> wrote:
> Hi Everyone, I've updated the benchmark as Xiangrui suggested. Added the
> comment that BIDMat 0.9.7 uses Float matrices in GPU (although I see the
> support of Double in the current source code), did the test with BIDMat and
> CPU Double matrices. BIDMat MKL is indeed on par with netlib MKL.
>
>
> https://docs.google.com/spreadsheets/d/1lWdVSuSragOobb0A_oeouQgHUMx378T9J5r7kwKSPkY/edit?usp=sharing
>
> Best regards, Alexander
>
> Original Message
> From: Sam Halliday [mailto:sam.halliday@gmail.com]
> Sent: Tuesday, March 03, 2015 1:54 PM
> To: Xiangrui Meng; Joseph Bradley
> Cc: Evan R. Sparks; Ulanov, Alexander; dev@spark.apache.org
> Subject: Re: Using CUDA within Spark / boosting linear algebra
>
> BTW, is anybody on this list going to the London Meetup in a few weeks?
>
>
> https://skillsmatter.com/meetups/6987apachesparklivingthepostmapreduceworld#community
>
> Would be nice to meet other people working on the guts of Spark! :)
>
>
> Xiangrui Meng <mengxr@gmail.com> writes:
>
> > Hey Alexander,
> >
> > I don't quite understand the part where netlibcublas is about 20x
> > slower than netlibopenblas. What is the overhead of using a GPU BLAS
> > with netlibjava?
> >
> > CC'ed Sam, the author of netlibjava.
> >
> > Best,
> > Xiangrui
> >
> > On Wed, Feb 25, 2015 at 3:36 PM, Joseph Bradley <joseph@databricks.com>
> wrote:
> >> Better documentation for linking would be very helpful! Here's a JIRA:
> >> https://issues.apache.org/jira/browse/SPARK6019
> >>
> >>
> >> On Wed, Feb 25, 2015 at 2:53 PM, Evan R. Sparks
> >> <evan.sparks@gmail.com>
> >> wrote:
> >>
> >>> Thanks for compiling all the data and running these benchmarks,
> >>> Alex. The big takeaways here can be seen with this chart:
> >>>
> >>> https://docs.google.com/spreadsheets/d/1aRm2IADRfXQV7G2vrcVh4StF50uZ
> >>> Hl6kmAJeaZZggr0/pubchart?oid=1899767119&format=interactive
> >>>
> >>> 1) A properly configured GPU matrix multiply implementation (e.g.
> >>> BIDMat+GPU) can provide substantial (but less than an order of
> >>> BIDMat+magnitude)
> >>> benefit over a welltuned CPU implementation (e.g. BIDMat+MKL or
> >>> netlibjava+openblascompiled).
> >>> 2) A poorly tuned CPU implementation can be 12 orders of magnitude
> >>> worse than a welltuned CPU implementation, particularly for larger
> matrices.
> >>> (netlibf2jblas or netlibref) This is not to pick on netlib  this
> >>> basically agrees with the authors own benchmarks (
> >>> https://github.com/fommil/netlibjava)
> >>>
> >>> I think that most of our users are in a situation where using GPUs
> >>> may not be practical  although we could consider having a good GPU
> >>> backend available as an option. However, *ALL* users of MLlib could
> >>> benefit (potentially tremendously) from using a welltuned CPUbased
> >>> BLAS implementation. Perhaps we should consider updating the mllib
> >>> guide with a more complete section for enabling high performance
> >>> binaries on OSX and Linux? Or better, figure out a way for the
> >>> system to fetch these automatically.
> >>>
> >>>  Evan
> >>>
> >>>
> >>>
> >>> On Thu, Feb 12, 2015 at 4:18 PM, Ulanov, Alexander <
> >>> alexander.ulanov@hp.com> wrote:
> >>>
> >>>> Just to summarize this thread, I was finally able to make all
> >>>> performance comparisons that we discussed. It turns out that:
> >>>> BIDMatcublas>>BIDMat
> >>>> MKL==netlibmkl==netlibopenblascompiled>netlibopenblasyumrepo=
> >>>> =netlibcublas>netlibblas>f2jblas
> >>>>
> >>>> Below is the link to the spreadsheet with full results.
> >>>>
> >>>> https://docs.google.com/spreadsheets/d/1lWdVSuSragOobb0A_oeouQgHUMx
> >>>> 378T9J5r7kwKSPkY/edit?usp=sharing
> >>>>
> >>>> One thing still needs exploration: does BIDMatcublas perform
> >>>> copying to/from machine’s RAM?
> >>>>
> >>>> Original Message
> >>>> From: Ulanov, Alexander
> >>>> Sent: Tuesday, February 10, 2015 2:12 PM
> >>>> To: Evan R. Sparks
> >>>> Cc: Joseph Bradley; dev@spark.apache.org
> >>>> Subject: RE: Using CUDA within Spark / boosting linear algebra
> >>>>
> >>>> Thanks, Evan! It seems that ticket was marked as duplicate though
> >>>> the original one discusses slightly different topic. I was able to
> >>>> link netlib with MKL from BIDMat binaries. Indeed, MKL is
> >>>> statically linked inside a 60MB library.
> >>>>
> >>>> A*B size  BIDMat MKL  Breeze+NetlibMKL from BIDMat
> >>>> Breeze+NetlibOpenBlas(native system) Breeze+Netlibf2jblas 
> >>>>
> ++
> >>>> 100x100*100x100  0,00205596  0,000381  0,03810324  0,002556 
> >>>> 1000x1000*1000x1000  0,018320947  0,038316857  0,51803557
> >>>> 1,638475459 
> >>>> 10000x10000*10000x10000  23,78046632  32,94546697 445,0935211 
> >>>> 1569,233228 
> >>>>
> >>>> It turn out that precompiled MKL is faster than precompiled
> >>>> OpenBlas on my machine. Probably, I’ll add two more columns with
> >>>> locally compiled openblas and cuda.
> >>>>
> >>>> Alexander
> >>>>
> >>>> From: Evan R. Sparks [mailto:evan.sparks@gmail.com]
> >>>> Sent: Monday, February 09, 2015 6:06 PM
> >>>> To: Ulanov, Alexander
> >>>> Cc: Joseph Bradley; dev@spark.apache.org
> >>>> Subject: Re: Using CUDA within Spark / boosting linear algebra
> >>>>
> >>>> Great  perhaps we can move this discussion offlist and onto a
> >>>> JIRA ticket? (Here's one:
> >>>> https://issues.apache.org/jira/browse/SPARK5705)
> >>>>
> >>>> It seems like this is going to be somewhat exploratory for a while
> >>>> (and there's probably only a handful of us who really care about
> >>>> fast linear
> >>>> algebra!)
> >>>>
> >>>>  Evan
> >>>>
> >>>> On Mon, Feb 9, 2015 at 4:48 PM, Ulanov, Alexander <
> >>>> alexander.ulanov@hp.com<mailto:alexander.ulanov@hp.com>> wrote:
> >>>> Hi Evan,
> >>>>
> >>>> Thank you for explanation and useful link. I am going to build
> >>>> OpenBLAS, link it with Netlibjava and perform benchmark again.
> >>>>
> >>>> Do I understand correctly that BIDMat binaries contain statically
> >>>> linked Intel MKL BLAS? It might be the reason why I am able to run
> >>>> BIDMat not having MKL BLAS installed on my server. If it is true, I
> >>>> wonder if it is OK because Intel sells this library. Nevertheless,
> >>>> it seems that in my case precompiled MKL BLAS performs better than
> >>>> precompiled OpenBLAS given that BIDMat and Netlibjava are supposed
> to be on par with JNI overheads.
> >>>>
> >>>> Though, it might be interesting to link Netlibjava with Intel MKL,
> >>>> as you suggested. I wonder, are John Canny (BIDMat) and Sam
> >>>> Halliday
> >>>> (Netlibjava) interested to compare their libraries.
> >>>>
> >>>> Best regards, Alexander
> >>>>
> >>>> From: Evan R. Sparks [mailto:evan.sparks@gmail.com<mailto:
> >>>> evan.sparks@gmail.com>]
> >>>> Sent: Friday, February 06, 2015 5:58 PM
> >>>>
> >>>> To: Ulanov, Alexander
> >>>> Cc: Joseph Bradley;
> >>>> dev@spark.apache.org<mailto:dev@spark.apache.org>
> >>>> Subject: Re: Using CUDA within Spark / boosting linear algebra
> >>>>
> >>>> I would build OpenBLAS yourself, since good BLAS performance comes
> >>>> from getting cache sizes, etc. set up correctly for your particular
> >>>> hardware  this is often a very tricky process (see, e.g. ATLAS),
> >>>> but we found that on relatively modern Xeon chips, OpenBLAS builds
> >>>> quickly and yields performance competitive with MKL.
> >>>>
> >>>> To make sure the right library is getting used, you have to make
> >>>> sure it's first on the search path  export
> >>>> LD_LIBRARY_PATH=/path/to/blas/library.so will do the trick here.
> >>>>
> >>>> For some examples of getting netlibjava setup on an ec2 node and
> >>>> some example benchmarking code we ran a while back, see:
> >>>> https://github.com/shivaram/matrixbench
> >>>>
> >>>> In particular  buildopenblasec2.sh shows you how to build the
> >>>> library and set up symlinks correctly, and scala/runnetlib.sh
> >>>> shows you how to get the path setup and get that library picked up by
> netlibjava.
> >>>>
> >>>> In this way  you could probably get cuBLAS set up to be used by
> >>>> netlibjava as well.
> >>>>
> >>>>  Evan
> >>>>
> >>>> On Fri, Feb 6, 2015 at 5:43 PM, Ulanov, Alexander <
> >>>> alexander.ulanov@hp.com<mailto:alexander.ulanov@hp.com>> wrote:
> >>>> Evan, could you elaborate on how to force BIDMat and netlibjava to
> >>>> force loading the right blas? For netlib, I there are few JVM
> >>>> flags, such as
> >>>> Dcom.github.fommil.netlib.BLAS=com.github.fommil.netlib.F2jBLAS,
> >>>> so I can force it to use Java implementation. Not sure I understand
> how to force use a specific blas (not specific wrapper for blas).
> >>>>
> >>>> Btw. I have installed openblas (yum install openblas), so I suppose
> >>>> that netlib is using it.
> >>>>
> >>>> From: Evan R. Sparks [mailto:evan.sparks@gmail.com<mailto:
> >>>> evan.sparks@gmail.com>]
> >>>> Sent: Friday, February 06, 2015 5:19 PM
> >>>> To: Ulanov, Alexander
> >>>> Cc: Joseph Bradley;
> >>>> dev@spark.apache.org<mailto:dev@spark.apache.org>
> >>>>
> >>>> Subject: Re: Using CUDA within Spark / boosting linear algebra
> >>>>
> >>>> Getting breeze to pick up the right blas library is critical for
> >>>> performance. I recommend using OpenBLAS (or MKL, if you already have
> it).
> >>>> It might make sense to force BIDMat to use the same underlying BLAS
> >>>> library as well.
> >>>>
> >>>> On Fri, Feb 6, 2015 at 4:42 PM, Ulanov, Alexander <
> >>>> alexander.ulanov@hp.com<mailto:alexander.ulanov@hp.com>> wrote:
> >>>> Hi Evan, Joseph
> >>>>
> >>>> I did few matrix multiplication test and BIDMat seems to be ~10x
> >>>> faster than netlibjava+breeze (sorry for weird table formatting):
> >>>>
> >>>> A*B size  BIDMat MKL  Breeze+Netlibjava
> >>>> native_system_linux_x8664
> >>>> Breeze+Netlibjava f2jblas 
> >>>>
> ++
> >>>> 100x100*100x100  0,00205596  0,03810324  0,002556 
> >>>> 1000x1000*1000x1000  0,018320947  0,51803557 1,638475459 
> >>>> 10000x10000*10000x10000  23,78046632  445,0935211  1569,233228
> >>>> 
> >>>>
> >>>> Configuration: Intel(R) Xeon(R) CPU E31240 3.3 GHz, 6GB RAM, Fedora
> >>>> 19 Linux, Scala 2.11.
> >>>>
> >>>> Later I will make tests with Cuda. I need to install new Cuda
> >>>> version for this purpose.
> >>>>
> >>>> Do you have any ideas why breezenetlib with native blas is so much
> >>>> slower than BIDMat MKL?
> >>>>
> >>>> Best regards, Alexander
> >>>>
> >>>> From: Joseph Bradley [mailto:joseph@databricks.com<mailto:
> >>>> joseph@databricks.com>]
> >>>> Sent: Thursday, February 05, 2015 5:29 PM
> >>>> To: Ulanov, Alexander
> >>>> Cc: Evan R. Sparks;
> >>>> dev@spark.apache.org<mailto:dev@spark.apache.org>
> >>>> Subject: Re: Using CUDA within Spark / boosting linear algebra
> >>>>
> >>>> Hi Alexander,
> >>>>
> >>>> Using GPUs with Spark would be very exciting. Small comment:
> >>>> Concerning your question earlier about keeping data stored on the
> >>>> GPU rather than having to move it between main memory and GPU
> >>>> memory on each iteration, I would guess this would be critical to
> >>>> getting good performance. If you could do multiple local
> >>>> iterations before aggregating results, then the cost of data
> >>>> movement to the GPU could be amortized (and I believe that is done
> >>>> in practice). Having Spark be aware of the GPU and using it as
> another part of memory sounds like a much bigger undertaking.
> >>>>
> >>>> Joseph
> >>>>
> >>>> On Thu, Feb 5, 2015 at 4:59 PM, Ulanov, Alexander <
> >>>> alexander.ulanov@hp.com<mailto:alexander.ulanov@hp.com>> wrote:
> >>>> Thank you for explanation! I’ve watched the BIDMach presentation by
> >>>> John Canny and I am really inspired by his talk and comparisons with
> Spark MLlib.
> >>>>
> >>>> I am very interested to find out what will be better within Spark:
> >>>> BIDMat or netlibjava with CPU or GPU natives. Could you suggest a
> >>>> fair way to benchmark them? Currently I do benchmarks on artificial
> >>>> neural networks in batch mode. While it is not a “pure” test of
> >>>> linear algebra, it involves some other things that are essential to
> machine learning.
> >>>>
> >>>> From: Evan R. Sparks [mailto:evan.sparks@gmail.com<mailto:
> >>>> evan.sparks@gmail.com>]
> >>>> Sent: Thursday, February 05, 2015 1:29 PM
> >>>> To: Ulanov, Alexander
> >>>> Cc: dev@spark.apache.org<mailto:dev@spark.apache.org>
> >>>> Subject: Re: Using CUDA within Spark / boosting linear algebra
> >>>>
> >>>> I'd be surprised of BIDMat+OpenBLAS was significantly faster than
> >>>> netlibjava+OpenBLAS, but if it is much faster it's probably due to
> >>>> netlibjava+data
> >>>> layout and fewer levels of indirection  it's definitely a
> >>>> worthwhile experiment to run. The main speedups I've seen from
> >>>> using it come from highly optimized GPU code for linear algebra. I
> >>>> know that in the past Canny has gone as far as to write custom GPU
> >>>> kernels for performancecritical regions of code.[1]
> >>>>
> >>>> BIDMach is highly optimized for single node performance or
> >>>> performance on small clusters.[2] Once data doesn't fit easily in
> >>>> GPU memory (or can be batched in that way) the performance tends to
> >>>> fall off. Canny argues for hardware/software codesign and as such
> >>>> prefers machine configurations that are quite different than what
> >>>> we find in most commodity cluster nodes  e.g. 10 disk cahnnels and
4
> GPUs.
> >>>>
> >>>> In contrast, MLlib was designed for horizontal scalability on
> >>>> commodity clusters and works best on very big datasets  order of
> terabytes.
> >>>>
> >>>> For the most part, these projects developed concurrently to address
> >>>> slightly different use cases. That said, there may be bits of
> >>>> BIDMach we could repurpose for MLlib  keep in mind we need to be
> >>>> careful about maintaining crosslanguage compatibility for our Java
> >>>> and Pythonusers, though.
> >>>>
> >>>>  Evan
> >>>>
> >>>> [1]  http://arxiv.org/abs/1409.5402 [2] 
> >>>> http://eecs.berkeley.edu/~hzhao/papers/BD.pdf
> >>>>
> >>>> On Thu, Feb 5, 2015 at 1:00 PM, Ulanov, Alexander <
> >>>> alexander.ulanov@hp.com<mailto:alexander.ulanov@hp.com><mailto:
> >>>> alexander.ulanov@hp.com<mailto:alexander.ulanov@hp.com>>>
wrote:
> >>>> Hi Evan,
> >>>>
> >>>> Thank you for suggestion! BIDMat seems to have terrific speed. Do
> >>>> you know what makes them faster than netlibjava?
> >>>>
> >>>> The same group has BIDMach library that implements machine
> >>>> learning. For some examples they use Caffe convolutional neural
> >>>> network library owned by another group in Berkeley. Could you
> >>>> elaborate on how these all might be connected with Spark Mllib? If
> >>>> you take BIDMat for linear algebra why don’t you take BIDMach for
> optimization and learning?
> >>>>
> >>>> Best regards, Alexander
> >>>>
> >>>> From: Evan R. Sparks [mailto:evan.sparks@gmail.com<mailto:
> >>>> evan.sparks@gmail.com><mailto:evan.sparks@gmail.com<mailto:
> >>>> evan.sparks@gmail.com>>]
> >>>> Sent: Thursday, February 05, 2015 12:09 PM
> >>>> To: Ulanov, Alexander
> >>>> Cc: dev@spark.apache.org<mailto:dev@spark.apache.org><mailto:
> >>>> dev@spark.apache.org<mailto:dev@spark.apache.org>>
> >>>> Subject: Re: Using CUDA within Spark / boosting linear algebra
> >>>>
> >>>> I'd expect that we can make GPUaccelerated BLAS faster than CPU
> >>>> blas in many cases.
> >>>>
> >>>> You might consider taking a look at the codepaths that BIDMat (
> >>>> https://github.com/BIDData/BIDMat) takes and comparing them to
> >>>> netlibjava/breeze. John Canny et. al. have done a bunch of work
> >>>> optimizing to make this work really fast from Scala. I've run it on
> >>>> my laptop and compared to MKL and in certain cases it's 10x faster at
> matrix multiply.
> >>>> There are a lot of layers of indirection here and you really want
> >>>> to avoid data copying as much as possible.
> >>>>
> >>>> We could also consider swapping out BIDMat for Breeze, but that
> >>>> would be a big project and if we can figure out how to get
> >>>> breeze+cublas to comparable performance that would be a big win.
> >>>>
> >>>> On Thu, Feb 5, 2015 at 11:55 AM, Ulanov, Alexander <
> >>>> alexander.ulanov@hp.com<mailto:alexander.ulanov@hp.com><mailto:
> >>>> alexander.ulanov@hp.com<mailto:alexander.ulanov@hp.com>>>
wrote:
> >>>> Dear Spark developers,
> >>>>
> >>>> I am exploring how to make linear algebra operations faster within
> Spark.
> >>>> One way of doing this is to use Scala Breeze library that is
> >>>> bundled with Spark. For matrix operations, it employs Netlibjava
> >>>> that has a Java wrapper for BLAS (basic linear algebra subprograms)
> >>>> and LAPACK native binaries if they are available on the worker
> >>>> node. It also has its own optimized Java implementation of BLAS. It
> >>>> is worth mentioning, that native binaries provide better performance
> only for BLAS level 3, i.e.
> >>>> matrixmatrix operations or general matrix multiplication (GEMM).
> >>>> This is confirmed by GEMM test on Netlibjava page
> >>>> https://github.com/fommil/netlibjava. I also confirmed it with my
> >>>> experiments with training of artificial neural network
> >>>> https://github.com/apache/spark/pull/1290#issuecomment70313952.
> >>>> However, I would like to boost performance more.
> >>>>
> >>>> GPU is supposed to work fast with linear algebra and there is
> >>>> Nvidia CUDA implementation of BLAS, called cublas. I have one Linux
> >>>> server with Nvidia GPU and I was able to do the following. I linked
> >>>> cublas (instead of cpubased blas) with Netlibjava wrapper and put
> >>>> it into Spark, so Breeze/Netlib is using it. Then I did some
> >>>> performance measurements with regards to artificial neural network
> >>>> batch learning in Spark MLlib that involves matrixmatrix
> >>>> multiplications. It turns out that for matrices of size less than
> >>>> ~1000x780 GPU cublas has the same speed as CPU blas. Cublas becomes
> >>>> slower for bigger matrices. It worth mentioning that it is was not a
> test for ONLY multiplication since there are other operations involved.
> >>>> One of the reasons for slowdown might be the overhead of copying
> >>>> the matrices from computer memory to graphic card memory and back.
> >>>>
> >>>> So, few questions:
> >>>> 1) Do these results with CUDA make sense?
> >>>> 2) If the problem is with copy overhead, are there any libraries
> >>>> that allow to force intermediate results to stay in graphic card
> >>>> memory thus removing the overhead?
> >>>> 3) Any other options to speedup linear algebra in Spark?
> >>>>
> >>>> Thank you, Alexander
> >>>>
> >>>> 
> >>>>  To unsubscribe, email: devunsubscribe@spark.apache.org<mailto:
> >>>> devunsubscribe@spark.apache.org><mailto:devunsubscribe@spark.apac
> >>>> he.org <mailto:devunsubscribe@spark.apache.org>>
> >>>> For additional commands, email: devhelp@spark.apache.org<mailto:
> >>>> devhelp@spark.apache.org><mailto:devhelp@spark.apache.org<mailto:
> >>>> devhelp@spark.apache.org>>
> >>>>
> >>>>
> >>>>
> >>>>
> >>>
>
> 
> Best regards,
> Sam
>
