Share Datasets on the cloud using Python
DocArray is a data structure that not only lets you pre-process data of any kind but also allows sharing of processed data on the cloud for efficient collaboration within data teams!
Background
One day, an employee at Jina AI was working on a remote GPU, i.e. Google Colab. As they tried to make a DocumentArray
and work with its manipulation, they got an error that wasn’t solvable by them and needed external assistance. Upon getting external help, they found that the error was due to the missing environment dependencies on the remote GPU. They then decided to move the development to their local system that was already configured to work with Jina.
The above paragraph describes a very common and challenging situation faced by almost everybody when working in a remote environment. Firstly, it is difficult to determine what environment is running in the backend. Secondly, there may be different kinds of issues and bugs that need to be addressed, and giving others access to the entire file is not a super safe way to do it.
Overview
We all know about Jina’s DocumentArray
that is used to store Document objects. It is similar to Python’s list implementation, where you can construct, delete, insert, sort, and traverse an DocumentArray
object. This blog will walk you through a very special feature of DocumentArray
: push
and pull
.
DocumentArray
is capable of importing and exporting data. This is not limited to a single format but spans over various formats. For example, if you want to work with a JSON file, you might use .to_json()
for exporting data into JSON. And if somebody needs to read that JSON file they can just use the .load_json()
for importing the data.
Similarly, Jina allows the import and export of data to and from any remote cloud storage. The method we use for exporting data to the cloud is known as push()
and the method used for importing data is known as pull()
. This allows users to share the DocumentArray
object across machines from anywhere in the world.
DocumentArray Push/Pull Example
Let’s look at a simple example of push
and pull
in action. Jack wants to create a DocumentArray
, do some pre-processing from his side, and share it with his colleague Janice living in the other part of the world. So Jack creates an DocumentArray
object da
, applies the pre-processing logic and pushes the da
object with a unique key ID to store it on a remote cloud machine using the push()
method.
He uses the following code to do that 👉
Now, Janice wants to use the same DocumentArray
object that her colleague Jack has created and to do that, she needs to know the unique key ID associated with the particular object. Once she has the key, she can use the pull()
method to fetch the DocumentArray
object into her local system from the cloud.
She uses the following code to do so 👉
DocumentArray Push Flow
When a user pushes the data from their local system to the server, the following process takes place:
The data along with the user token is sent to Jina API. After that, Jina API server verifies the following:
- S3 address of the request
- The expiration of the request made
- The token and its validity
- The size of the data and time of the creation of the request
- Metadata such as Jina’s version etc.
Once verified, the response is sent in the form of a success message. Otherwise, a failure message is sent.
Note: The
DocumentArray
storage is temporary and will be deleted automatically after seven days of creating the token. Also, using the same token will override the existing data.
DocumentArray Pull Flow
When a user tries to access a DocumentArray
from the cloud, the following process takes place:
- A request is made, and the token is sent to the Jina API
- The server verifies and stores the download time and the metadata. Upon successful verification, a response is sent back in the form of a URL. This URL can be used for downloading the data.
- Once a
get
request is made on that URL, the requested data is sent from the S3 server to the user machine.
Summary
What you saw above is the story of how collaboration looks with Jina’s search framework. It lets you work in a collaborative environment without worrying about how to share a piece of code or logic safely and securely. With DocumentArray
’s new push and pull methods, you can efficiently work with data stored on the cloud by transferring it to the local system.
Learning Resources
Venture into the exciting world of Neural Search with Jina’s Learning Bootcamp. Get certified and be a part of Jina’s Hall of Fame! 🏆
Stay tuned for more exciting updates on the upcoming products and features from Jina AI! 👋