Share Datasets on the cloud using Python
DocArray is a data structure that not only lets you pre-process data of any kind but also allows sharing of processed data on the cloud for efficient collaboration within data teams!
One day, an employee at Jina AI was working on a remote GPU, i.e. Google Colab. As they tried to make a
DocumentArray and work with its manipulation, they got an error that wasn’t solvable by them and needed external assistance. Upon getting external help, they found that the error was due to the missing environment dependencies on the remote GPU. They then decided to move the development to their local system that was already configured to work with Jina.
The above paragraph describes a very common and challenging situation faced by almost everybody when working in a remote environment. Firstly, it is difficult to determine what environment is running in the backend. Secondly, there may be different kinds of issues and bugs that need to be addressed, and giving others access to the entire file is not a super safe way to do it.
We all know about Jina’s
DocumentArray that is used to store Document objects. It is similar to Python’s list implementation, where you can construct, delete, insert, sort, and traverse an
DocumentArray object. This blog will walk you through a very special feature of
DocumentArray is capable of importing and exporting data. This is not limited to a single format but spans over various formats. For example, if you want to work with a JSON file, you might use
.to_json() for exporting data into JSON. And if somebody needs to read that JSON file they can just use the
.load_json() for importing the data.
Similarly, Jina allows the import and export of data to and from any remote cloud storage. The method we use for exporting data to the cloud is known as
push() and the method used for importing data is known as
pull(). This allows users to share the
DocumentArray object across machines from anywhere in the world.
DocumentArray Push/Pull Example
Let’s look at a simple example of
pull in action. Jack wants to create a
DocumentArray, do some pre-processing from his side, and share it with his colleague Janice living in the other part of the world. So Jack creates an
da, applies the pre-processing logic and pushes the
da object with a unique key ID to store it on a remote cloud machine using the
He uses the following code to do that 👉
Now, Janice wants to use the same
DocumentArray object that her colleague Jack has created and to do that, she needs to know the unique key ID associated with the particular object. Once she has the key, she can use the
pull() method to fetch the
DocumentArray object into her local system from the cloud.
She uses the following code to do so 👉
DocumentArray Push Flow
When a user pushes the data from their local system to the server, the following process takes place:
The data along with the user token is sent to Jina API. After that, Jina API server verifies the following:
- S3 address of the request
- The expiration of the request made
- The token and its validity
- The size of the data and time of the creation of the request
- Metadata such as Jina’s version etc.
Once verified, the response is sent in the form of a success message. Otherwise, a failure message is sent.
DocumentArraystorage is temporary and will be deleted automatically after seven days of creating the token. Also, using the same token will override the existing data.
DocumentArray Pull Flow
When a user tries to access a
DocumentArray from the cloud, the following process takes place:
- A request is made, and the token is sent to the Jina API
- The server verifies and stores the download time and the metadata. Upon successful verification, a response is sent back in the form of a URL. This URL can be used for downloading the data.
- Once a
getrequest is made on that URL, the requested data is sent from the S3 server to the user machine.
What you saw above is the story of how collaboration looks with Jina’s search framework. It lets you work in a collaborative environment without worrying about how to share a piece of code or logic safely and securely. With
DocumentArray’s new push and pull methods, you can efficiently work with data stored on the cloud by transferring it to the local system.
Venture into the exciting world of Neural Search with Jina’s Learning Bootcamp. Get certified and be a part of Jina’s Hall of Fame! 🏆
Stay tuned for more exciting updates on the upcoming products and features from Jina AI! 👋