Share Datasets on the cloud using Python

DocArray is a data structure that not only lets you pre-process data of any kind but also allows sharing of processed data on the cloud for efficient collaboration within data teams!

Shubham Saboo
Jina AI
4 min readJun 15, 2022

--

Remote collaboration within data teams made easy with DocArray!

Background

One day, an employee at Jina AI was working on a remote GPU, i.e. Google Colab. As they tried to make a DocumentArray and work with its manipulation, they got an error that wasn’t solvable by them and needed external assistance. Upon getting external help, they found that the error was due to the missing environment dependencies on the remote GPU. They then decided to move the development to their local system that was already configured to work with Jina.

The above paragraph describes a very common and challenging situation faced by almost everybody when working in a remote environment. Firstly, it is difficult to determine what environment is running in the backend. Secondly, there may be different kinds of issues and bugs that need to be addressed, and giving others access to the entire file is not a super safe way to do it.

Overview

We all know about Jina’s DocumentArray that is used to store Document objects. It is similar to Python’s list implementation, where you can construct, delete, insert, sort, and traverse an DocumentArray object. This blog will walk you through a very special feature of DocumentArray: push and pull.

DocumentArray is capable of importing and exporting data. This is not limited to a single format but spans over various formats. For example, if you want to work with a JSON file, you might use .to_json() for exporting data into JSON. And if somebody needs to read that JSON file they can just use the .load_json() for importing the data.

Similarly, Jina allows the import and export of data to and from any remote cloud storage. The method we use for exporting data to the cloud is known as push() and the method used for importing data is known as pull(). This allows users to share the DocumentArray object across machines from anywhere in the world.

DocumentArray Push/Pull Example

Let’s look at a simple example of push and pull in action. Jack wants to create a DocumentArray, do some pre-processing from his side, and share it with his colleague Janice living in the other part of the world. So Jack creates an DocumentArray object da, applies the pre-processing logic and pushes the da object with a unique key ID to store it on a remote cloud machine using the push() method.

He uses the following code to do that 👉

Push method for exporting DocumentArray

Now, Janice wants to use the same DocumentArray object that her colleague Jack has created and to do that, she needs to know the unique key ID associated with the particular object. Once she has the key, she can use the pull() method to fetch the DocumentArray object into her local system from the cloud.

She uses the following code to do so 👉

Pull method for importing DocumentArray

DocumentArray Push Flow

When a user pushes the data from their local system to the server, the following process takes place:

The data along with the user token is sent to Jina API. After that, Jina API server verifies the following:

  • S3 address of the request
  • The expiration of the request made
  • The token and its validity
  • The size of the data and time of the creation of the request
  • Metadata such as Jina’s version etc.

Once verified, the response is sent in the form of a success message. Otherwise, a failure message is sent.

Note: The DocumentArray storage is temporary and will be deleted automatically after seven days of creating the token. Also, using the same token will override the existing data.

DocumentArray Pull Flow

When a user tries to access a DocumentArray from the cloud, the following process takes place:

  • A request is made, and the token is sent to the Jina API
  • The server verifies and stores the download time and the metadata. Upon successful verification, a response is sent back in the form of a URL. This URL can be used for downloading the data.
  • Once a get request is made on that URL, the requested data is sent from the S3 server to the user machine.

Summary

What you saw above is the story of how collaboration looks with Jina’s search framework. It lets you work in a collaborative environment without worrying about how to share a piece of code or logic safely and securely. With DocumentArray’s new push and pull methods, you can efficiently work with data stored on the cloud by transferring it to the local system.

Learning Resources

Venture into the exciting world of Neural Search with Jina’s Learning Bootcamp. Get certified and be a part of Jina’s Hall of Fame! 🏆

Stay tuned for more exciting updates on the upcoming products and features from Jina AI! 👋

--

--