How to Upload/Download Files to/from Notebook in my Local machine

Charles gomes
3 min readAug 14, 2017

--

Problem:

I saved my Pandas or Spark dataframe to a file in a notebook. Where did it go? How do I read the file I just saved?

Pandas and most other libraries have APIs to read or write from or to local file system.

Where is the local file system for a notebook?

Let me explain you with the help of Data Science Experience architecture.

Every notebook is tightly coupled with a Spark service on Bluemix. You can also couple it with Amazon EMR. But A notebook must have a platform to run its notebook instance.

Since a notebook runs on an associated service, it uses that service’s local disk; in Apache spark service’s case, that is GPFS. If you run !pwd in a Python notebook, you will see the GPFS tenant directory for your spark service.

/gpfs/fs01/user/sc1f-6086fed227dde8-2c631c8ff999/notebook/work

Now you know that sc1f-6086fed227dde8-2c631c8ff999 is the tenant ID for your spark service.

Let’s say you have:

import base64  
import pandas as pd
df = pd.DataFrame(data = [[1,2],[3,4]], columns=['Col 1', 'Col 2'])

Save the file to local GPFS:

#write dataframe to file
df.to_csv("test1.csv")
#read it back
pd.read_csv("test1.csv").head()

Now, you might want to download the above dataframe to your local machine.

Solutions

Option 1

Download the file through the notebook — but only if the file is in CSV format.

The following function was provided by Polong Lin:

from IPython.display import HTML

def create_download_link( df, title = "Download CSV file", filename = "data.csv"):
csv = df.to_csv()
b64 = base64.b64encode(csv.encode())
payload = b64.decode()
html = '<a download="{filename}" href="data:text/csv;base64,{payload}" target="_blank">{title}</a>'
html = html.format(payload=payload,title=title,filename=filename)
return HTML(html)

create_download_link(df)

Running this function will give you a link to download the file into the notebook output cell.

Option 2

You can download or upload any type of file. Since the files are on Spark service’s local disk,
the Spark service exposes an API to allow you to download or upload files.

Use the REST API described in Using Spark Interactive API to upload or download the file from your Spark service.

For example, this REST tenant API only exposes one directory under your tenant directory https://spark.bluemix.net/tenant/data/\:

!ls -l /gpfs/fs01/user/sc1f-6086fed227dde8-2c631c8ff999/data/total 3552  
-rw-r----- 1 sc1f-6086fed227dde8-2c631c8ff999 users 1207391 May 17 19:36 96ae523867d444f6284ceed939588a57.CSV
drwx------ 2 sc1f-6086fed227dde8-2c631c8ff999 users 4096 May 3 15:55 b5219fcc83e93b4eae8d36990ead96682bea61af
drwx------ 2 sc1f-6086fed227dde8-2c631c8ff999 users 4096 May 3 15:11 libs
drwx------ 3 sc1f-6086fed227dde8-2c631c8ff999 users 4096 May 3 15:11 spark131batch
drwx------ 3 sc1f-6086fed227dde8-2c631c8ff999 users 4096 May 3 15:11 spark141batch
drwx------ 2 sc1f-6086fed227dde8-2c631c8ff999 users 4096 May 3 15:11 workdir

You can either move the file we saved earlier to a data directory

/gpfs/fs01/user/sc1f-6086fed227dde8-2c631c8ff999/data/

by running

!cp test1.csv /gpfs/fs01/user/sc1f-6086fed227dde8-2c631c8ff999/data/

or you can save the file directly to that directory:

df.to_csv("/gpfs/fs01/user/sc1f-6086fed227dde8-2c631c8ff999/data/test1.csv")

For downloading, you can use the curl command at your local machine's command prompt or terminal.

For Download

curl -O -X GET -u tenant_id:tenant_secret -H 'X-Spark-service-instance-id:instance_id' https://spark.bluemix.net/tenant/data/<filename.ext>

Values for the variables can be found in Bluemix.net, just open your Spark service and look for service credentials.

Here is the command that I ran on my local machine:-

charless-MacBook-Pro-3:~ charles$ curl -O -X GET -u sc1f-6086fed227dde8-2c631c8ff999:22b2bc31-f19e-4fc8-8b60-47eebd950786 -H 'X-Spark-service-instance-id:17e8347c-571d-4128-9c1f-6086fed227dd' https://spark.bluemix.net/tenant/data/test1.csv% Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
Dload Upload Total Spent Left Speed
100 25 0 25 0 0 97 0 --:--:-- --:--:-- --:--:-- 99
charless-MacBook-Pro-3:~ charles$ cat test1.csv
,Col 1,Col 2
0,1,2
1,3,4

Now you can upload the files to the Spark service

For example:

curl -X PUT -k -u tenant_id:tenant_secret -H 'X-Spark-service-instance-id:instance_id' --data-binary "@nameofthefiletouploa_withextension"  https://spark.bluemix.net/tenant/data/nameofthefiletouploa_withextension
charless-MacBook-Pro-3:~ charles$ cat > mylocaldataset.txt
this is my file from local machine.
^Z
[1]+ Stopped cat > mylocaldataset.txt

charless-MacBook-Pro-3:~ charles$ cat mylocaldataset.txt
this is my file from local machine.
charless-MacBook-Pro-3:~ charles$ curl -X PUT -k -u sc1f-6086fed227dde8-2c631c8ff999:22b2bc31-f19e-4fc8-8b60-47eebd950786 -H 'X-Spark-service-instance-id:17e8347c-571d-4128-9c1f-6086fed227dd' --data-binary "@mylocaldataset.txt" https://spark.bluemix.net/tenant/data/mylocaldataset.txt {"url":"https://spark.bluemix.net/tenant/data/mylocaldataset.txt"}charless-MacBook-Pro-3:~ charles

Note the @ sign it is important.

If the URL is returned the file was successfully uploaded.

Originally published at datascience.ibm.com on August 14, 2017.

--

--