Writing Efficient Codes and Automation for a Data Scientist

Jeff Lin
Analytics Vidhya
Published in
4 min readJan 28, 2020

Principles of Programming and Python Libraries to know

A data scientist writes codes with the knowledge of building models and summarizes the finding. Sometimes they share the code or model leading up to the results. Data scientists come from various backgrounds, and many don’t have a computer science or programming training. The codes that are shared should be written with some of the easily-overlooked programming skills in mind.

DRY (Don’t Repeat Yourself): writing modular/reusable codes

DRY is a simple programming principle where it basically means that you should not write the same code/configuration in multiple places.

For example, when trying to plot a scatter plot, and 2 boxplots, the setting of labels, titles, and other visualization parameters for these individual plots can become a lot of codes being repeated, as can be seen below.

sns.scatterplot(x='x', y='y', data=data, palette='cubehelix')
plt.xlabel('xlabel')
plt.ylabel('ylabel')
plt.xticks(rotation=45)
plt.title('Scatter Plot Title')
plt.legend(title='Legend Title')
ax.spines['left'].set_color('k')
ax.spines['bottom'].set_color('k')
sns.boxplot(x=x, y=y, hue=hue, data=data, palette='cubehelix')
plt.xlabel('xlabel')
plt.ylabel('ylabel')
plt.xticks(rotation=45)
plt.title('1st Box Plot Title')
plt.legend(title='Legend Title')
ax.spines['left'].set_color('k')
ax.spines['bottom'].set_color('k')
sns.boxplot(x=x, y=y, hue=hue, data=data, palette='cubehelix')
plt.xlabel('xlabel')
plt.ylabel('ylabel')
plt.xticks(rotation=45)
plt.title('2nd Box Plot Title')
plt.legend(title='Legend Title')
ax.spines['left'].set_color('k')
ax.spines['bottom'].set_color('k')

When code is written to be reusable:

def ax_params(xlabel, ylabel, plt_title=None, ax=None, legend_title=None, c='k'):
plt.xlabel(xlabel)
plt.ylabel(ylabel)
plt.xticks(rotation=45)
plt.title(plt_title)
if legend_title:
plt.legend(title=legend_title)
if ax is None:
ax = plt.gca()
ax.spines['right'].set_visible(False)
ax.spines['top'].set_visible(False)
ax.spines['left'].set_color(c)
ax.spines['bottom'].set_color(c)
sns.scatterplot(x='x', y='y', hue='hue', data=data)
ax_params('xlabel', 'ylabel', 'Scatter Plot Title')
sns.boxplot(x='x', y='y', hue='hue', data=data)
ax_params('xlabel', 'ylabel', '1st Box Plot Title')
sns.boxplot(x='x', y='y', hue='hue', data=data)
ax_params('xlabel', 'ylabel', '2nd Box Plot Title')

Write Codes for Dynamic Use

As can be seen in the previous example codes, the code also allows for other use cases such as when you want the axes spines to have different colors by passing in the parameter c, or when you’re setting the visualization parameters for subplot axes with ax.

Making small changes to the custom function also goes with the agile programming principle, and version controls. If I were to add a new input parameter that would control for whether to save the figure savefig, the 3 lines of codes for plotting functions the same until ‘savefig=True’ is passed in.

def ax_params(xlabel, ylabel, plt_title=None, ax=None, legend_title=None, c='k', savefig=False):
plt.xlabel(xlabel)
plt.ylabel(ylabel)
plt.xticks(rotation=45)
plt.title(plt_title)
if legend_title:
plt.legend(title=legend_title)
if ax is None:
ax = plt.gca()
ax.spines['right'].set_visible(False)
ax.spines['top'].set_visible(False)
ax.spines['left'].set_color(c)
ax.spines['bottom'].set_color(c)
if savefig:
plt.gcf().savefig(f'{plt_title}.png',bbox_inches='tight')

Python’s “time” modules

For testing and comparing how fast code runs, Python has a convenient “time” module.

The following code examples of time.time() shows the testing of how long a block of code takes to execute (from Better Programming)

import time
start_time = time.time()
a,b = 5,10
c = a+b
end_time = time.time()
time_taken = (end_time- start_time)*(10**6)
print("Time taken in micro_seconds:", time_taken)
# Time taken in micro_seconds: 39.577484130859375

# Testing random_array of 1000 numbers from random vs numpy.random
import random
def random_array(N):
num_array = [random.randint(-N,N) for i in range(N)]
return num_array
start = time.time()
random_array(1000)
time_taken_random = (time.time() - start)*(10**3)
start = time.time()
np_random_array = np.random.randint(1000, size=1000)
time_taken_np_random = (time.time() - start)*(10**3)
print("Time taken for random in milli_seconds:", time_taken_random)
print("Time taken in milli_seconds:", time_taken_np_random)
# Time taken for random in milli_seconds: 1.6129016876220703
# Time taken for np_random in milli_seconds: 0.1647472381591797

The time.sleep() in the time module is especially useful in times when you are getting data from an API and you are limited in the number of calls you can do not only in a day but also in the number of calls you can do every minute or second.

  • data.gov: “Limits are placed on the number of API requests you may make using your API key. […] If you made 500 requests at 10:15 AM and 500 requests at 10:25 AM, your API key would become temporarily blocked. This temporary block of your API key would cease at 11:15 AM, at which point you could make 500 requests. At 11:25 AM, you could then make another 500 requests.”
  • “FoodData Central currently limits the number of API requests to a default rate of 3,600 requests per hour per IP address, as this is adequate for most applications. Exceeding this limit will cause the API key to be temporarily blocked for 1 hour”.
  • Foursquare: “The default hourly limit is 500 requests per hour per set of endpoints per authenticated user.”

Especially useful when there’s a rate limit by the seconds like Shopify API: “To avoid being throttled, you can build your app to average two requests per second.”

As can be seen below, when gathering data using Selenium, especially with interactive webpages, having time.sleep() is especially useful. The following code is a portion where the form is filled with the date, and automating clicks.

Screen Recording of Pausing execution in Selinium.
for date in ['2/8/2016-2/15/2016', '2/15/2016-2/28/2016']:
date_input = browser.find_element_by_name('FmFm2_Ctrl3')
date_input.clear()
time.sleep(2)
date_input.send_keys(Keys.DELETE)
date_input.send_keys(date)
time.sleep(5)
no_results = browser.find_element_by_id('m_ucSearchButtons')
no_results = int(re.search('\d+', no_results.text).group())
date_input.send_keys(Keys.RETURN)
time.sleep(2)
results_pagination(browser, no_results)
time.sleep(1)
browser.find_element_by_id('m_lbReviseSearch').click()
time.sleep(5)

Other Python libraries that are worth looking into are parallel processing Pool from multiprocessing, and Python generator yield, which is especially useful in splitting large dataset to train in batches.

--

--