Introducing Python and R APIs for collaborating on datasets

Riwaj Sapkota
5 min readApr 6, 2020

--

As more and more public and private institutions have made their datasets publicly available, the need for sharing and collaborating on the datasets has increased many folds.

The sources of data and their veracity is one important aspect. The other important aspect is how can people collaborate on the privately and publicly available datasets. Collaborating on datasets means being able to do much more than just uploading them. Even with the same data source, as new data parameters are added each hour, collaborators need to download the same version of the dataset, keep a regular track of data revisions, and share them without a lot of hassle.

This blog post describes how you can use dstack APIs as a convenient way to address the above-stated problems. More details on how to create an account in dstack and install the dstack package have been discussed in another post.

Share your dataset with others

Are you working on a project where you have to share your dataset with someone else in the team? Do you wanna keep track of revisions of the dataset that you have shared with your team? In this scenario, dstack offers APIs to publish the dataset that can be shared conveniently as well as tracked for each revision.

In general, you will

  1. Publish the dataset using dstack APIs available with dstack package.
  2. Share the dataset using the web application.

Let us start by publishing datasets. In the following example where you want to share a static dataset that you created yourself.

The published dataframe can be accessed via the URL

https://dstack.ai/<user>/static_dataset_example
static dataset published using pandas dataframe

How about sharing interactive datasets? As an example, let us assume that you got hold of a dataset (player_data.csv) and you want to share this dataset with others.

Notice how you can use dstack method to commit multiple datasets in the same frame with the parameters you want to use for interactivity.

In this way, you can determine how users can interact with your dataset. The dstack methods used to publish the dataset can be found under docs.dstack.ai.

Here is how the resulting interactive dataset looks like in dstack web application.

Notice the share button which allows you to share your dataset with others

Once the dataset is published, now you can use the share button or the URL itself to share your dataset easily with your peers and clients via a URL.

Use dataset from others

Imagine a scenario that you would like to use a dataset that is hosted at dstack by someone else and is available for you. In this case, you have two options to obtain the dataset to use it.

  1. Download the dataset from dstack web application as a CSV file
  2. Fetch the dataset as pandas dataframe using the dstack API
Login to the dstack web application to download or fetch the dataset via dstack API

As a Python or R user, you might find it convenient to simply use the dstack API to fetch the dataset. Here is a sample snippet that fetches the dataset hosted at dstack using the pull API.

Track revisions of datasets

Now here is a cool part! Each revision of the dataset with all its parameters can be fetched using a button on the dstack web application. In this way, you can get the historical record of how your dataset evolved during the process of your work.

In essence, this is how it works.

  • Push every new version of the datasets using dstack APIs to the same stack.
  • Login to the dstack web application and go to the URL address of your dataset.
  • Underneath the name of the stack, e.g., College_player_data, you will see the widget.
  • Click on it to find all the versions of the dataset along with the latest one which is termed as Head.
Each revision history is denoted by a timestamp

Collaborate on the dataset

Imagine you would like to write your thoughts on the dataset as a comment to the contributor of the dataset. The comment feature enables you to get involved in a discussion (Think of Instagram, Facebook). In order to add a comment, you must login to your dstack.ai account and have permission to collaborate on the dataset by the one who published the dataset in dstack.

The management of the permission to share and comment in dstack works in the following way.

  1. Every user can use the settings menu to choose whether to make new datasets either private or public by default.

2. Once a dataset is published, the publisher can choose whom to share the dataset. Alternatively, the dataset can also be made public even if the default setting is chosen as private.

You do not need a dstack account to view a publicly shared stack, but you will need one to comment

All the services are also available from a mobile phone as a convenient way to review the updated datasets.

Try now!

dstack.ai can be used from your notebooks, scripts, jobs or even applications and we offer all the above-stated services for free.

Please feel free to use your own datasets as well as public datasets on Covid-19 at dstack to explore features tailored to datasets as well as data visualizations.

Do you think we are missing out on some use cases and features? Please sign up and come back to us with feedback and suggestions. We are curious to hear your thoughts.

Thank you very much.

Feedback survey: https://forms.gle/6MSLAGaHJFvvpB5Q6

Sign up: https://dstack.ai/auth/signup

Learn more https://dstack.ai

Documentation: docs.dstack.ai

Email for feedback: team@dstack.ai

Disclaimer and limitations:

  1. The feature is in beta version and early adopters are welcome to try.
  2. APIs for R will be available soon.
  3. The feature is in beta version and early adopters are welcome to try.
  4. We do not share the data with third parties except when obliged by law.
  5. We don’t enforce checks on the veracity and the source of the data and rely on the user. Please report to us in case you find cases of infringements.
  6. In case we notice that the sharing of data results harms any party or violates any law, we will be obliged to delete the data immediately.
  7. You can protect access to the data, e.g., to limit it to selected users by changing stack settings or changing the default in your account settings.

--

--