The Use of Python in Data Analyst Work

Katarina Nimas Kusumawati
Blibli.com Tech Blog
6 min readJun 16, 2023

What is Python?

Python is a high-level programming language that is used for a wide range of applications such as web development, data analysis, artificial intelligence, automation, and more.

It is a popular language because it is easy to learn and read, which means that even people without a technical background can quickly pick it up. Python is also known for its flexibility and versatility, making it a great choice for beginners and experts alike.

In essence, Python is a tool that allows you to write code to solve problems and automate tasks in a way that is easy to understand and execute. Whether you’re looking to build a simple program or tackle a complex data analysis project, Python is a great language to get started with.

Python X Data Analyst

Data analysts often rely on various tools and techniques to accomplish their daily tasks. While Python may not be the primary choice for many data analysts, there are instances where utilizing Python can significantly enhance the effectiveness and efficiency of their work. In this article, we will explore some key implementations of Python that can assist data analysts in their work.

Example python implementation in data analytics

Get data from BigQuery

Data analysts often encounter situations where they need to save queried data in different formats for sharing with users. BigQuery provides users with convenient options to store queried data in formats such as CSV, JSON, BigQuery tables, and even copy to the clipboard. However, there are limitations when it comes to retrieving data from BigQuery, particularly in terms of size limitations. Additionally, there can be format mismatches when dealing with excessively large datasets.

Storing Queried Data in Various Formats

BigQuery offers flexibility in saving queried data by providing options to store it in CSV, JSON, BigQuery tables, or copying it to the clipboard. This feature enables data analysts to choose the most suitable format for sharing the data with users. Whether it’s a spreadsheet-friendly CSV format or a structured BigQuery table, analysts can accommodate different user needs.

Limitations in Data Retrieval from BigQuery

Despite its advantages, BigQuery has limitations when it comes to retrieving large datasets. There are size limitations on the amount of data that can be extracted from BigQuery at once. This constraint can pose challenges for data analysts working with extensive datasets or when attempting to extract substantial portions of data for further analysis or reporting purposes.

Format Challenges with Large Data

Another challenge that data analysts face with BigQuery is the potential format mismatch when dealing with excessively large datasets. As the size of the data increases, certain formats may not be suitable or compatible. This situation may require analysts to explore alternative approaches or employ additional tools to handle and extract data efficiently.

Code Example

# Import the library
import pandas_gbq as pd_gbq

# Select project id
PROJECT_ID = 'project_id_bq'

# Connect to BigQuery, if it's your first time, you will directed to google for some credential need

BQ_CONFIGURATION = {
'query': {
"useQueryCache": True
}
}

def create_df_from_sql(sql):
df = pd_gbq.read_gbq(sql, project_id=PROJECT_ID, dialect='standard', configuration=BQ_CONFIGURATION, verbose=True)
return df

# Input the query
sql = """
SELECT 1
FROM order
"""

df = create_df_from_sql(sql)
df.head()

# CSV output
df.to_csv('Output.csv', index=False)

# Excel output
df.to_excel('Output.xlsx', sheet_name='Sheet1', index=False)

Copy, cut, and delete table through BigQuery API

When it comes to transferring tables to different datasets in BigQuery, there are multiple approaches available. Two common methods include using the query CREATE OR REPLACE TABLE and performing a table COPY operation. While CREATE OR REPLACE TABLE queries can be resource-intensive, copying the table offers a faster and more effective alternative. However, the copy operation only creates a new table without removing the previous one. To optimize storage costs and maintain a tidy structure, it is necessary to delete the previous table. Deletion can be accomplished through the BigQuery UI or by using the DROP TABLE query. Another efficient way to copy and delete tables is by utilizing the BigQuery API. In this case, all you need is an Excel file (.xlsx) containing the list of tables to be deleted, which can then be executed using Python.

Leveraging the BigQuery API for Copying and Deleting Tables

Utilizing the BigQuery API offers an even more efficient approach to copying and deleting tables. By leveraging Python and having an Excel file containing the list of tables to be deleted, you can execute the necessary operations programmatically, streamlining the process and saving time.

Code Example

from google.cloud import bigquery
import pydata_google_auth
import pandas as pd

SCOPES = [
'https://www.googleapis.com/auth/cloud-platform'
]

credentials = pydata_google_auth.get_user_credentials(
SCOPES,
# Set auth_local_webserver to True to have a slightly more convienient
# authorization flow. Note, this doesn't work if you're running from a
# notebook on a remote server, such as over SSH or with Google Colab.
auth_local_webserver=True,
)

#READ FROM FILE
file_path = "File.xlsx" #TO-DO: path to xlsx file in local
sheet_name = "Sheet1" #TO-DO: sheet name in xlsx file
dataset_backup = "dataset_id" #TO-DO: dataset used to backup the original table

df = pd.read_excel(file_path, sheet_name=sheet_name, header=0)
# Optional filtered_tables
filtered_tables = df[df['table_type'].str.lower() == 'table']
copy_tables = list(filtered_tables['project_id'] + '.' + filtered_tables['dataset_id'] + '.' + filtered_tables['table_id'])

run_project_id = 'project_id' #TO-DO: which project the process will take place
client = bigquery.Client(credentials=credentials, project=run_project_id)

EXECUTE_PROCESS = True #TO-DO: Mark this as True to Execute the copy table function

if EXECUTE_PROCESS:
for table in copy_tables:
try:
project_id, dataset_id, table_id = table.split('.')
destination_table = f'{project_id}.{dataset_backup}.{dataset_id}.{table_id}'
copy_job = client.copy_table(table, destination_table)
copy_job.result()
print(f"Finished copying {table} to {destination_table}")

delete_job = client.delete_table(table)
print(f"Deleted table {table}")
except Exception as e:
print(e)

#Note:
#1. Can only delete table (not view)
#2. Result/Backup will be copied to project_id.dataset_backup.table_id
#3. modify all 'TO-DO' variables, based on the needed requirements
#4. To execute the process, run the code
#5. If table name is same (project_a.dataset_a.ABC & project_a.dataset_b.ABC) then second table will fail
#6. Delete will only happen after copying success

Send email Outlook personalization

Data analysts often need to send emails to multiple users, with each email personalized according to the information contained in an Excel file (.xlsx). To accomplish this, Python can be utilized for automating the process of sending personalized emails.

Personalized Email Generation

Python offers libraries and modules that enable data analysts to read data from an Excel file and generate personalized email content. By extracting relevant information from the Excel file, such as recipient names, email addresses, and custom fields, analysts can dynamically create tailored email content for each user.

Code Example

# Import Library
import win32com.client
import pandas as pd

# Load Excel File
df = pd.read_excel("file.xlsx")

# Insert static value
for r in df.itertuples():
owner = r.Owner
name = r.name
city = r.city
# aggregate for column if needed for dynamic value
order_data= [r.order_item_0, r.order_item_0]
category_data = [r.category_0, r.category_1]
msg1 =""

msg= f"""Dear {name},

This is information about your order from our store in {city}: """
for i in range(len(order_date)):
if (order_data[i] and category[i]) == "-": ## If it's found '-', then it will be skipped
continue
else:
msg1 += f"""{i+1}. Category: {category[i]} Item: {order_data[i]} \n"""
msg2 = """

Best regards,

Data Store"""
msg_email = msg+msg1+msg2

print(msg_email)

# Need to be adjust to send the email
outlook = win32com.client.Dispatch("Outlook.Application")
mail = outlook.CreateItem(0)
mail.To = email
mail.CC = 'ervin@datastore.com'
mail.Subject = f'Checkout Information'
mail.Body = msg_email
mail.Send()

Output example 1:

Checkout Information
To: regina@datastore.com
CC: ervin@datastore.com'

Dear Regina,
This is information about your order from our store in Lampung:
1. Category: beauty Item: foundation
2. Category: beauty Item: comb

Best regards,

Data Store

Output example 2:

Checkout Information
To: helena@datastore.com
CC: ervin@datastore.com'

Dear Helena,
This is information about your order from our store in Bandung:
1. Category: women fashion Item: skirt
2. Category: women fashion Item: jacket

Best regards,

Data Store

Output example 3:

Checkout Information
To: helena@datastore.com
CC: ervin@datastore.com'

Dear Austen,
This is information about your order from our store in Jakarta:
1. Category: accessories Item: necklace

Best regards,

Data Store

Conclusion

Although Python is not commonly utilized by data analysts, its implementation can significantly assist their work by reducing manual tasks and enhancing overall efficiency and effectiveness. By leveraging Python’s capabilities, data analysts can automate processes, streamline data analysis workflows, and gain valuable insights from their data. Embracing Python as a tool empowers data analysts to unlock new possibilities, optimize their work, and ultimately deliver more impactful results.

--

--

Katarina Nimas Kusumawati
Blibli.com Tech Blog

Sometimes I struggle with data, sometimes I just wanna be a Pikachu