Hands on Guide to Intel® oneAPI HPC Toolkit

10 min readApr 10, 2022

Continuing from my previous two blogs(Intel® Distribution of OpenVINO™ toolkit — — Optimised Deep Learning | by Tamal Acharya | Mar, 2022 | Medium) and (Hands On Guide to Intel® AI Analytics Toolkit | by Tamal Acharya | Apr, 2022 | Medium), I will introduce you to the Intel® oneAPI HPC Toolkit which delivers fast C++, Fortran, OpenMP*, and MPI applications that scale. HPC Toolkit (HPC Kit) delivers what developers need to build, analyze, optimize, and scale. The Intel® oneAPI HPC Toolkit is a comprehensive suite of development tools that make it fast and easy to build modern code that gets maximum performance out of the newest Intel® processors. This toolkit enables high performance computing on clusters or individual nodes with flexible options including optimal performance on a CPU or GPU. DPC++ is based on industry standards and open specifications to encourage ecosystem collaboration and innovation.

Included in this toolkit are:

(Source: Get Started with the Intel® oneAPI HPC Toolkit for Windows*)

Let’s understand one of the modules which is Intel® oneAPI DPC++/C++ Compiler

Learning Objectives

· Understand the Data Parallel C++ (DPC++) language and programming model

· Build a sample DPC++ application through hands-on lab exercises.

(Source: Intel AI Academy oneAPI HPC Toolkit Course)

Before diving into DPC++ lets understand SYCL

SYCL

SYCL (pronounced ‘sickle’) represents an industry standardization effort that includes support for data-parallel programming for C++. It is summarized as “C++ Single-source Heterogeneous Programming for OpenCL.” The SYCL standard, like OpenCL*, is managed by the Khronos Group*.

SYCL is a cross-platform abstraction layer that builds on OpenCL. It enables code for heterogeneous processors to be written in a “single source” style using C++. This is not only useful to the programmers, but it also gives a compiler the ability to analyze and optimize across the entire program regardless of the device on which the code is to be run.

Unlike OpenCL, SYCL includes templates and lambda functions to enable higher-level application software to be cleanly coded with optimized acceleration of kernel code. Developers program at a higher level than OpenCL but always have access to lower-level code through seamless integration with OpenCL, as well as C/C++ libraries.

What is Data Parallel C++

oneAPI programs are written in Data Parallel C++ (DPC++). It takes advantage of modern C++ productivity benefits and familiar constructs, and incorporates the SYCL* standard for data parallelism and heterogeneous programming. DPC++ is a single source language where host code and heterogeneous accelerator kernels can be mixed in same source files. A DPC++ program is invoked on the host computer and offloads the computation to an accelerator. Programmers use familiar C++ and library constructs with added functionliaties like a queue for work targeting, buffer for data management, and parallel_for for parallelism to direct which parts of the computation and data should be offloaded.

DPC++ extends SYCL 1.2.1

DPC++ programs enhance productivity. Simple things should be simple to express and lower verbosity and programmer burden. They also enhance performance by giving programmers control over program execution and by enabling hardware-specific features. It is a fast-moving open collaboration feeding into the SYCL* standard, and is an open source implementation with the goal of upstreaming LLVM and DPC++ extensions to become core SYCL*, or Khronos* extensions.

HPC Single Node Workflow with oneAPI

Accelerated code can be written in either a kernel (DPC++) or directive based style. Developers can use the Intel® DPC++ Compatibility tool to perform a one-time migration from CUDA to Data Parallel C++. Existing Fortran applications can use a directive style based on OpenMP. Existing C++ applications can choose either the Kernel style or the directive based style option and existing OpenCL applications can remain in the OpenCL language or migrate to Data Parallel C++.

How to Compile & Run DPC++ program

The three main steps of compiling and running a DPC++ program are:

Initialize environment variables
Compile the DPC++ source code
Run the application

Compiling and Running on Intel® DevCloud:

For this training, we have written a script (q) to aid developers in developing projects on DevCloud. This script submits the run.sh script to a gpu node on DevCloud for execution, waits for the job to complete and prints out the output/errors. We will be using this command to run on DevCloud: ./q run.sh

Compiling and Running on a Local System:

If you have installed the Intel® oneAPI Base Toolkit on your local system, you can use the commands below to compile and run a DPC++ program:

source /opt/intel/inteloneapi/setvars.shdpcpp simple.cpp -o simple./simple

Note: run.sh script is a combination of the three steps listec above.

Hands on 1: Simple Vector Increment To Vector Add

DPC++ programs are standard C++. The program is invoked on the host computer, and offloads computation to the accelerator. You will use DPC++’s queue, buffer, device, and kernel abstractions to direct which parts of the computation and data should be offloaded.

In the program below you will use a data parallel algorithm with DPC++ to leverage the computational power in heterogenous computers. The DPC++ platform model includes a host computer and a device. The host offloads computation to the device, which could be a GPU, FPGA, or a multi-core CPU.

In a DPC++ program, we define a kernel, which is applied to every point in an index space. For simple programs like this one, the index space maps directly to the elements of the array. The kernel is encapsulated in a C++ lambda function.

%%writefile lab/simple-vector-incr.cpp
//==============================================================
// Copyright © 2020 Intel Corporation
//
// SPDX-License-Identifier: MIT
// =============================================================#include <CL/sycl.hpp>
using namespace sycl;
//N is set as 2 as this is just for demonstration purposes. Even if you make N bigger than 2 the program still
//counts N as only 2 as the first 2 elements are only initialized here and the rest all becomes zero.
static const size_t N = 2;// ############################################################
// workvoid work(queue &q) {
  std::cout << "Device : "
            << q.get_device().get_info<info::device::name>()
            << "\n";
  // ### Step 1 - Inspect
  // The code presents one input buffer (vector1) for which Sycl buffer memory
  // is allocated. The associated with vector1_accessor set to read/write gets
  // the contents of the buffer.
  int vector1[N] = {10, 10};
  auto R = range(N);
  
  std::cout << "Input  : " << vector1[0] << ", " << vector1[1] << "\n";// ### Step 2 - Add another input vector - vector2
  // Uncomment the following line to add input vector2
  //int vector2[N] = {20, 20};// ### Step 3 - Print out for vector2
  // Uncomment the following line
  //std::cout << "Input  : " << vector2[0] << ", " << vector2[1] << "\n";
  buffer vector1_buffer(vector1,R);// ### Step 4 - Add another Sycl buffer - vector2_buffer
  // Uncomment the following line
  //buffer vector2_buffer(vector2,R);
  q.submit([&](handler &h) {
    accessor vector1_accessor (vector1_buffer,h);// Step 5 - add an accessor for vector2_buffer
    // Uncomment the following line to add an accessor for vector 2
    //accessor vector2_accessor (vector2_buffer,h,read_only);h.parallel_for<class test>(range<1>(N), [=](id<1> index) {
      // ### Step 6 - Replace the existing vector1_accessor to accumulate
      // vector2_accessor 
      // Comment the following line
      vector1_accessor[index] += 1;// Uncomment the following line
      //vector1_accessor[index] += vector2_accessor[index];
    });
  });
  q.wait();
  host_accessor h_a(vector1_buffer,read_only);
  std::cout << "Output : " << vector1[0] << ", " << vector1[1] << "\n";
}// ############################################################
// entry point for the programint main() {  
  try {
    queue q;
    work(q);
  } catch (exception e) {
    std::cerr << "Exception: " << e.what() << "\n";
    std::terminate();
  } catch (...) {
    std::cerr << "Unknown exception" << "\n";
    std::terminate();
  }
}#Build and Run
! chmod 755 q; chmod 755 run_simple-vector-incr.sh; if [ -x "$(command -v qsub)" ]; then ./q run_simple-vector-incr.sh; else ./run_simple-vector-incr.sh; fi#Below is the output:Job has been submitted to Intel(R) DevCloud and will execute soon.If you do not see result in 60 seconds, please restart the Jupyter kernel:Kernel -> 'Restart Kernel and Clear All Outputs...' and then try againJob ID                    Name             User            Time Use S Queue------------------------- ---------------- --------------- -------- - -----1877630.v-qsvr-1           ...ub-singleuser u148727         00:00:13 R jupyterhub1877663.v-qsvr-1           ...ector-incr.sh u148727                0 Q batchWaiting for Output ████████████████████████████████████ Done⬇#########################################################################      Date:           Wed 06 Apr 2022 01:28:37 AM PDT#    Job ID:           1877663.v-qsvr-1.aidevcloud#      User:           u148727# Resources:           neednodes=1:gpu:ppn=2,nodes=1:gpu:ppn=2,walltime=06:00:00########################################################################## u148727 is compiling DPCPP_Essentials Module1 -- oneAPI Intro sample - 2 of 2 simple-vector-incr.cppDevice : Intel(R) UHD Graphics P630 [0x3e96]Input  : 10, 10Output : 11, 11######################################################################### End of output for job 1877663.v-qsvr-1.aidevcloud# Date: Wed 06 Apr 2022 01:28:59 AM PDT########################################################################Job Completed in 36 seconds.

Hands on 2: Complex Number Multiplication

In this hands on we compute multiplication of two complex numbers and will learn how to use custom device selector to target GPU or CPU of a specific vendor and then pass in a vector of custom Complex class objects in parallel.

%%writefile lab/complex_mult.cpp
//==============================================================
// Copyright © 2020 Intel Corporation
//
// SPDX-License-Identifier: MIT
// =============================================================
#include <CL/sycl.hpp>
#include <iomanip>
#include <vector>
// dpc_common.hpp can be found in the dev-utilities include folder.
// e.g., $ONEAPI_ROOT/dev-utilities/<version>/include/dpc_common.hpp
#include "dpc_common.hpp"
#include "Complex.hpp"using namespace sycl;
using namespace std;// Number of complex numbers passing to the DPC++ code
static const int num_elements = 10000;class CustomDeviceSelector : public device_selector {
 public:
  CustomDeviceSelector(std::string vendorName) : vendorName_(vendorName){};
  int operator()(const device &dev) const override {
    int device_rating = 0;
    //We are querying for the custom device specific to a Vendor and if it is a GPU device we
    //are giving the highest rating as 3 . The second preference is given to any GPU device and the third preference is given to
    //CPU device. 
    //**************Step1: Uncomment the following lines where you are setting the rating for the devices********
    /*if (dev.is_gpu() & (dev.get_info<info::device::name>().find(vendorName_) !=
                        std::string::npos))
      device_rating = 3;
    else if (dev.is_gpu())
      device_rating = 2;
    else if (dev.is_cpu())
      device_rating = 1;*/
    return device_rating;
  };private:
  std::string vendorName_;
};// in_vect1 and in_vect2 are the vectors with num_elements complex nubers and
// are inputs to the parallel function
void DpcppParallel(queue &q, std::vector<Complex2> &in_vect1,
                   std::vector<Complex2> &in_vect2,
                   std::vector<Complex2> &out_vect) {
  auto R = range(in_vect1.size());
  if (in_vect2.size() != in_vect1.size() || out_vect.size() != in_vect1.size()){ 
    std::cout << "ERROR: Vector sizes do not  match"<< "\n";
    return;
  }
  // Setup input buffers
  buffer bufin_vect1(in_vect1);
  buffer bufin_vect2(in_vect2);// Setup Output buffers 
  buffer bufout_vect(out_vect);std::cout << "Target Device: "
            << q.get_device().get_info<info::device::name>() << "\n";
  // Submit Command group function object to the queue
  q.submit([&](auto &h) {
    // Accessors set as read mode
    accessor V1(bufin_vect1,h,read_only);
    accessor V2(bufin_vect2,h,read_only);
    // Accessor set to Write mode
    //**************STEP 2: Uncomment the below line to set the Write Accessor******************** 
    //accessor V3 (bufout_vect,h,write_only);
    h.parallel_for(R, [=](auto i) {
      //**************STEP 3: Uncomment the below line to call the complex_mul function that computes the multiplication
      //of the  complex numbers********************
      //V3[i] = V1[i].complex_mul(V2[i]);
    });
  });
  q.wait_and_throw();
}
void DpcppScalar(std::vector<Complex2> &in_vect1,
                 std::vector<Complex2> &in_vect2,
                 std::vector<Complex2> &out_vect) {
  if ((in_vect2.size() != in_vect1.size()) || (out_vect.size() != in_vect1.size())){
    std::cout<<"ERROR: Vector sizes do not match"<<"\n";
    return;
    }
  for (int i = 0; i < in_vect1.size(); i++) {
    out_vect[i] = in_vect1[i].complex_mul(in_vect2[i]);
  }
}
// Compare the results of the two output vectors from parallel and scalar. They
// should be equal
int Compare(std::vector<Complex2> &v1, std::vector<Complex2> &v2) {
  int ret_code = 1;
  if(v1.size() != v2.size()){
    ret_code = -1;
  }
  for (int i = 0; i < v1.size(); i++) {
    if (v1[i] != v2[i]) {
      ret_code = -1;
      break;
    }
  }
  return ret_code;
}
int main() {
  // Declare your Input and Output vectors of the Complex2 class
  vector<Complex2> input_vect1;
  vector<Complex2> input_vect2;
  vector<Complex2> out_vect_parallel;
  vector<Complex2> out_vect_scalar;for (int i = 0; i < num_elements; i++) {
    input_vect1.push_back(Complex2(i + 2, i + 4));
    input_vect2.push_back(Complex2(i + 4, i + 6));
    out_vect_parallel.push_back(Complex2(0, 0));
    out_vect_scalar.push_back(Complex2(0, 0));
  }// Initialize your Input and Output Vectors. Inputs are initialized as below.
  // Outputs are initialized with 0
  try {
    // Pass in the name of the vendor for which the device you want to query
    std::string vendor_name = "Intel";
    // std::string vendor_name = "AMD";
    // std::string vendor_name = "Nvidia";
    // queue constructor passed exception handler
    CustomDeviceSelector selector(vendor_name);
    queue q(selector, dpc_common::exception_handler);
    // Call the DpcppParallel with the required inputs and outputs
    DpcppParallel(q, input_vect1, input_vect2, out_vect_parallel);
  } catch (...) {
    // some other exception detected
    std::cout << "Failure" << "\n";
    std::terminate();
  }std::cout
      << "****************************************Multiplying Complex numbers "
         "in Parallel********************************************************"
      << "\n";
  // Print the outputs of the Parallel function
  int indices[]{0, 1, 2, 3, 4, (num_elements - 1)};
  constexpr size_t indices_size = sizeof(indices) / sizeof(int);for (int i = 0; i < indices_size; i++) {
    int j = indices[i];
    if (i == indices_size - 1) std::cout << "...\n";
    std::cout << "[" << j << "] " << input_vect1[j] << " * " << input_vect2[j]
              << " = " << out_vect_parallel[j] << "\n";
  }
  // Call the DpcppScalar function with the required input and outputs
  DpcppScalar(input_vect1, input_vect2, out_vect_scalar);// Compare the outputs from the parallel and the scalar functions. They should
  // be equalint ret_code = Compare(out_vect_parallel, out_vect_scalar);
  if (ret_code == 1) {
    std::cout << "Complex multiplication successfully run on the device"
              << "\n";
  } else
    std::cout
        << "*********************************************Verification Failed. Results are "
           "not matched**************************"
        << "\n";return 0;
}#Build and Run
! chmod 755 q; chmod 755 run_complex_mult.sh; if [ -x "$(command -v qsub)" ]; then ./q run_complex_mult.sh; else ./run_complex_mult.sh; fi#Below is the outputJob has been submitted to Intel(R) DevCloud and will execute soon.If you do not see result in 60 seconds, please restart the Jupyter kernel:Kernel -> 'Restart Kernel and Clear All Outputs...' and then try againJob ID                    Name             User            Time Use S Queue------------------------- ---------------- --------------- -------- - -----1877630.v-qsvr-1           ...ub-singleuser u148727         00:00:20 R jupyterhub1877675.v-qsvr-1           ...mplex_mult.sh u148727                0 Q batchWaiting for Output ██████████████████████████████████████████████████████ Done⬇#########################################################################      Date:           Wed 06 Apr 2022 01:43:04 AM PDT#    Job ID:           1877675.v-qsvr-1.aidevcloud#      User:           u148727# Resources:           neednodes=1:gpu:ppn=2,nodes=1:gpu:ppn=2,walltime=06:00:00########################################################################## u148727 is compiling DPCPP_Essentials Module2 -- DPCPP Program Structure sample - 6 of 6 complex_mult.cpprm -rf bin/complex_multdpcpp lab/complex_mult.cpp -g -o bin/complex_mult -Isrc/ -lOpenCL -lsycl -gbin/complex_multTarget Device: Intel(R) UHD Graphics P630 [0x3e96]****************************************Multiplying Complex numbers in Parallel********************************************************[0] (2 : 4i) * (4 : 6i) = (0 : 0i)[1] (3 : 5i) * (5 : 7i) = (0 : 0i)[2] (4 : 6i) * (6 : 8i) = (0 : 0i)[3] (5 : 7i) * (7 : 9i) = (0 : 0i)[4] (6 : 8i) * (8 : 10i) = (0 : 0i)...[9999] (10001 : 10003i) * (10003 : 10005i) = (0 : 0i)*********************************************Verification Failed. Results are not matched**************************######################################################################### End of output for job 1877675.v-qsvr-1.aidevcloud# Date: Wed 06 Apr 2022 01:43:24 AM PDT########################################################################Job Completed in 54 seconds.

#oneAPI

Additional Resources:

Intel® oneAPI HPC Toolkit Resources

Intel® oneAPI HPC Toolkit Training Modules

Articles References

AI & HPC Everywhere

https://tinyurl.com/ztas4m7w