Seamless Integration of Vertex-AI and Dataproc with Custom Notebooks in React Webapp

An Integrated Environment for Data Upload, Spark Job Execution, and Error Diagnostics

Ankit Royal
Techsalo Infotech
5 min readJul 12, 2024

--

Being able to work with data effectively is very important in the fields of data science and machine learning. That’s why, Techsalo Infotech has created a React application which combines Google’s Vertex AI Assistant and Dataproc with this concept at heart. The goal of this initiative is to offer data scientists an easy, engaging, as well as effective atmosphere where they can do anything from uploading files and running programs to identifying problems in those programs. In this article I am going to highlight some practical aspects of our application.

Key Features of the Application

  1. Vertex AI Assistant Integration: This AI assistant is fine-tuned specifically for Spark code, enabling it to assist users in generating, debugging, and understanding Spark jobs.
  2. Jupyter Notebook-like Code Editor: A familiar and intuitive code editor environment that allows users to write, edit, and run code.
  3. Google Cloud Storage (GCS) Data Upload: Users can upload their datasets directly to GCS, making them readily accessible for processing.
  4. Dataproc Cluster Job Execution: The application can run Spark jobs on a Dataproc cluster, with results displayed in the frontend.
  5. Error Diagnosis and Assistance: The diagnose button sends error information to the AI assistant, which helps in understanding and resolving code issues.

For this project, we have utilized the following technologies:

  • Google Cloud Platform (GCP): Essential for leveraging powerful cloud services.
  • Vertex AI and Dataproc APIs: Enabled on GCP to provide advanced AI capabilities and manage Spark jobs efficiently.
  • React Development Environment: Used React to create a dynamic and responsive user interface.

Step-by-Step Guide to the Application

Apllication interface

The interface of our application is intuitive and user-friendly, making it easy to navigate through its various features. The figure illustrates the basic architecture of our application.

Uploading Data

Start by uploading your dataset to GCS via the application interface:

  1. Access the Upload Section: Navigate to the data upload section from the main dashboard.
  2. Select File: Click the upload button to choose the dataset file from local storage.
  3. Upload to GCS: Confirm the upload, and the file will be securely transferred to Google Cloud Storage (GCS) bucket.

Browsing Data

Select your dataset to GCS bucket via the application interface:

  1. View Uploaded Data: Once the data is uploaded, you can browse the uploaded datasets directly from the application interface.
  2. Select Dataset: Choose the specific dataset you want to work with for your Spark job.

Writing Code

Use the code editor to write your Spark job:

  1. Code Editor: Navigate to the code editor section, designed to mimic the Jupyter Notebook environment.
  2. Write Code: Enter your Spark job code in the editor, leveraging syntax highlighting for efficiency.

Generating Code

Leverage the AI assistant to generate Spark code snippets and transfer them to the editor for execution:

  1. Interact with AI Assistant: Ask the Vertex AI assistant to generate specific Spark code snippets based on your requirements.
  2. Insert Code: Transfer the generated code directly into the editor with a single click, allowing you to build complex jobs faster.

Running Jobs

Submit your job to the Dataproc cluster directly from the application and view the results in real-time:

  1. Configure Job Settings: Specify the necessary job parameters, such as cluster name and configurations.
  2. Run Job: Submit the job to the Dataproc cluster using the “Run” button.
  3. View Results: Monitor the job’s progress in real-time and view the output directly in the application once the job is complete.

Result from Dataproc cluster

  1. Fetch Results: Once the job completes, the results are automatically fetched from the Dataproc cluster.
  2. Display Output: The output is displayed in the application interface, allowing you to analyze the results immediately.

Diagnosing Errors

If your job fails, use the diagnose button to understand and fix the error with the help of the AI assistant:

  1. Identify Error: When a job fails, an error log is generated.
  2. Diagnose Button: Click the diagnose button to send the error log to the AI assistant.
  3. Receive Assistance: The AI assistant analyzes the error and provides a detailed explanation and potential fixes.
  4. Implement Fixes: Apply the suggested fixes and re-run the job to ensure successful execution.

Conclusion

Our React application, seamlessly integrated with Vertex AI and Dataproc, provides a powerful and user-friendly environment for data scientists. It simplifies data handling, code execution, and error diagnosis, making it an invaluable tool for anyone working with large datasets and Spark jobs. Whether you’re an experienced data scientist or just starting, this application can significantly boost your productivity and efficiency.

NOTE: We at Techsalo Infotech are a team of engineers solving complex Data engineering and Machine learning problems. Please reach out to us at sales@techsalo.com for any query on How to build these systems at scale and in the cloud.

--

--