In DSX, we use project to organize resources like data, notebooks, models & connections. To easily interact with these assets now we have project-lib along with object storage APIs. Project-lib is programmatic interface to interact with your data stored in object storage. It allows you to easily access all your project assets including files, connections and metadata.
In this blog, we will explore this project-lib library in a python notebook.
Set-Up Your Project
Project-lib is pre-installed on DSX and can be imported in notebook through simple steps:
- Click on more (three dots) option from top right hand side panel and select insert project token.
2. If this is the first time you are using a token in this project, you will receive below message.
Click on project settings and create token by clicking new token option from access tokens sections and again repeat step 1.
Once you insert project token, below code snippet will get added in the first cell of your notebook. This code snippet contains spark context, project id and token respectively.
If you are using notebook with environments then you don’t have spark context in your notebook. In this case, you will have to replace sc with None in the above code snippet.
#The project token is an authorization token that is used to access project resources like data sources, connections, and used by platform APIs.
from project_lib import Project
project = Project(None, '**************', '**************')
pc = project.project_context
Get Project Details
You can fetch project details like project name, description and bucket name programmatically using different methods from project-lib.
You can use get_assets method to fetch list of all assets with their names and ids. Filter these assets based on asset type into data or connections.
You can also list files in your project using function get_files.
Upload Data to Object Storage
You can use save_data method to upload data to object storage. It takes 4 parameters.
- file_name : Name of the file. Your data will get saved with this name in object storage.
2. data: Data to upload. It should be file like object. Example — byte-buffers, string-buffers. Learn more about file like objects here.
3. set_project_asset : Add asset to project after successful upload (optional, default True). If this is set to False, file will only get added to the COS bucket and not to the project.
4. overwrite: Overwrite the object if already exists. (optional, default False)
Suppose you are working on pandas data frame and you want to save that data frame as csv file in object storage, you can use the code below to save this data:
# Save dataframe as csv file to storage
Get Data from Object Storage
To get data from object storage use method get_file and provide name of the file:
#Getting csv file and loading as pandas datafarme
Upload Pickle File to Object Storage
While working with machine learning models we can save model as pickle object in object storage.
# save pickle
project.save_data(data=pickle.dumps(clf),file_name='RF.pkl',overwrite=True)# Download and load model
Bring Python Scripts into your DSX Environment
Suppose you have folder with python scripts on your local machine and you want to bring these scripts into DSX.
First zip your folder with all the files. Upload that zip file to the project using find and add data option on top panel.
Then use below function to get all those files in the notebook.
file_name = Name of zip file you want to download from object storage
fobj = open(file_name, "wb")
z = zipfile.ZipFile(file_name)
except Exception as e:
print('Files downloaded successfully')
You can use magic commands to load and run these python scripts or you can import function from python script. In below code snippet, we are importing function prim_numbers from cal_prime.py script.
# Load script
%load Python\ Scripts/cal_prime.py#run script
%run Python\ Scripts/cal_prime.py# Import function from python script
Create Zip File from Files in Object Storage
You have images stored in your object storage. You want to download these images and zip them together and again upload that zip file to object storage.
You can use get_connections method to list all the available connections in the project. To get connection use get_connection method and pass either connection id or name of the connection. It will return credentials required for the connection.
Here is an example of how to read data from connection. I have DB2 connection in project and I want to read one table from this database into python data frame.
Get Data Using Spark
If you are using a notebook running on Spark service and want to get data into Spark data frame then get_file_url will help you fetch a file from the object storage using Spark.
from pyspark.sql import SparkSession# @hidden_cell
# The project token is an authorization token that is used to access project resources like data sources, connections, and used by platform APIs.from project_lib import Project
project = Project(sc, '******************', '******************')
pc = project.project_context# Get file url using file name
url=project.get_file_url('iris.csv')# get data into spark dataframe
spark = SparkSession.builder.getOrCreate()
df = spark.read.format('org.apache.spark.sql.execution.datasources.csv.CSVFileFormat')\