How to Upload/Download Files to/from Notebook in my Local machine
Problem:
I saved my Pandas or Spark dataframe to a file in a notebook. Where did it go? How do I read the file I just saved?
Pandas and most other libraries have APIs to read or write from or to local file system.
Where is the local file system for a notebook?
Let me explain you with the help of Data Science Experience architecture.
Every notebook is tightly coupled with a Spark service on Bluemix. You can also couple it with Amazon EMR. But A notebook must have a platform to run its notebook instance.
Since a notebook runs on an associated service, it uses that service’s local disk; in Apache spark service’s case, that is GPFS. If you run !pwd
in a Python notebook, you will see the GPFS tenant directory for your spark service.
/gpfs/fs01/user/sc1f-6086fed227dde8-2c631c8ff999/notebook/work
Now you know that sc1f-6086fed227dde8-2c631c8ff999
is the tenant ID for your spark service.
Let’s say you have:
import base64
import pandas as pd
df = pd.DataFrame(data = [[1,2],[3,4]], columns=['Col 1', 'Col 2'])
Save the file to local GPFS:
#write dataframe to file
df.to_csv("test1.csv")
#read it back
pd.read_csv("test1.csv").head()
Now, you might want to download the above dataframe to your local machine.
Solutions
Option 1
Download the file through the notebook — but only if the file is in CSV format.
The following function was provided by Polong Lin:
from IPython.display import HTML
def create_download_link( df, title = "Download CSV file", filename = "data.csv"):
csv = df.to_csv()
b64 = base64.b64encode(csv.encode())
payload = b64.decode()
html = '<a download="{filename}" href="data:text/csv;base64,{payload}" target="_blank">{title}</a>'
html = html.format(payload=payload,title=title,filename=filename)
return HTML(html)
create_download_link(df)
Running this function will give you a link to download the file into the notebook output cell.
Option 2
You can download or upload any type of file. Since the files are on Spark service’s local disk,
the Spark service exposes an API to allow you to download or upload files.
Use the REST API described in Using Spark Interactive API to upload or download the file from your Spark service.
For example, this REST tenant API only exposes one directory under your tenant directory https://spark.bluemix.net/tenant/data/\
:
!ls -l /gpfs/fs01/user/sc1f-6086fed227dde8-2c631c8ff999/data/total 3552
-rw-r----- 1 sc1f-6086fed227dde8-2c631c8ff999 users 1207391 May 17 19:36 96ae523867d444f6284ceed939588a57.CSV
drwx------ 2 sc1f-6086fed227dde8-2c631c8ff999 users 4096 May 3 15:55 b5219fcc83e93b4eae8d36990ead96682bea61af
drwx------ 2 sc1f-6086fed227dde8-2c631c8ff999 users 4096 May 3 15:11 libs
drwx------ 3 sc1f-6086fed227dde8-2c631c8ff999 users 4096 May 3 15:11 spark131batch
drwx------ 3 sc1f-6086fed227dde8-2c631c8ff999 users 4096 May 3 15:11 spark141batch
drwx------ 2 sc1f-6086fed227dde8-2c631c8ff999 users 4096 May 3 15:11 workdir
You can either move the file we saved earlier to a data directory
/gpfs/fs01/user/sc1f-6086fed227dde8-2c631c8ff999/data/
by running
!cp test1.csv /gpfs/fs01/user/sc1f-6086fed227dde8-2c631c8ff999/data/
or you can save the file directly to that directory:
df.to_csv("/gpfs/fs01/user/sc1f-6086fed227dde8-2c631c8ff999/data/test1.csv")
For downloading, you can use the curl
command at your local machine's command prompt or terminal.
For Download
curl -O -X GET -u tenant_id:tenant_secret -H 'X-Spark-service-instance-id:instance_id' https://spark.bluemix.net/tenant/data/<filename.ext>
Values for the variables can be found in Bluemix.net, just open your Spark service and look for service credentials.
Here is the command that I ran on my local machine:-
charless-MacBook-Pro-3:~ charles$ curl -O -X GET -u sc1f-6086fed227dde8-2c631c8ff999:22b2bc31-f19e-4fc8-8b60-47eebd950786 -H 'X-Spark-service-instance-id:17e8347c-571d-4128-9c1f-6086fed227dd' https://spark.bluemix.net/tenant/data/test1.csv% Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed
100 25 0 25 0 0 97 0 --:--:-- --:--:-- --:--:-- 99charless-MacBook-Pro-3:~ charles$ cat test1.csv
,Col 1,Col 2
0,1,2
1,3,4
Now you can upload the files to the Spark service
For example:
curl -X PUT -k -u tenant_id:tenant_secret -H 'X-Spark-service-instance-id:instance_id' --data-binary "@nameofthefiletouploa_withextension" https://spark.bluemix.net/tenant/data/nameofthefiletouploa_withextension
charless-MacBook-Pro-3:~ charles$ cat > mylocaldataset.txt
this is my file from local machine.
^Z
[1]+ Stopped cat > mylocaldataset.txt
charless-MacBook-Pro-3:~ charles$ cat mylocaldataset.txt
this is my file from local machine.charless-MacBook-Pro-3:~ charles$ curl -X PUT -k -u sc1f-6086fed227dde8-2c631c8ff999:22b2bc31-f19e-4fc8-8b60-47eebd950786 -H 'X-Spark-service-instance-id:17e8347c-571d-4128-9c1f-6086fed227dd' --data-binary "@mylocaldataset.txt" https://spark.bluemix.net/tenant/data/mylocaldataset.txt {"url":"https://spark.bluemix.net/tenant/data/mylocaldataset.txt"}charless-MacBook-Pro-3:~ charles
Note the @ sign it is important.
If the URL is returned the file was successfully uploaded.
Originally published at datascience.ibm.com on August 14, 2017.