Create a “Go-To” File

Anjali Sunil Khushalani
2 min readOct 16, 2019

--

If you read my bio, you would realize I have very recently transitioned to learning programming.

This is literally where I started
#Relatable

So, to make my life easy I created a “Go-To” file for Python, R, SQL. The logic was simple, less google and ease of working. But don’t stop googling your doubts because when you search for one thing you tend to learn multiple other!

I highly recommend creating a similar file for yourself with the code which you typically would use or just want in one place. This script is definitely for new users of python and is by far my favorite script.

Some of the Code covered:

  1. Convert a JSON File to Pandas
#Convert a JSON File to Pandadata = pd.read_json("data.json", lines= True)
data.head(10)
df_gd = pd.DataFrame(data)
#Format the data as a list of dict objects
data_dict = [{'data': x} for x in df_gd['data']]
# Create a DataFrame with json_normalize
data_df = pd.io.json.json_normalize(data_dict)
# Convert objects to a numeric datatype if possible
data_df = data_df.convert_objects(conver_numeric=True)

2. Modified Z Score Calculation: I have used this to determine outliers. An outlier is an observation that remarkably deviated from the other observations. There are multiple ways to determine an outlier, but the most commonly used technique is a simple box plot. In my go-to-file, I have included all the typical techniques:

  • Simple box plot
#Considering df is my pandas dataframeboxplot = df.boxplot(column=['spd', 'brake'])
  • Modified Z Score

The modified z score is a standardized score that measures outlier strength. Using standard deviation units, it approximates the difference of the score from the median.

The modified Z score is considered to be more robust over the z score.

def modified_z_score(ys):
median_y = np.median(ys)
median_absolute_deviation_y = np.median([np.abs(y - median_y) for y in ys])
modified_z_scores = [0.6745 * (y - median_y) / median_absolute_deviation_y
for y in ys]
return np.abs(modified_z_scores)
  • Z Score
def outliers_z_score(data):
mean_y = np.mean(data)
stdev_y = np.std(data)
z_scores = [(y - mean_y) / stdev_y for y in data]
return np.abs(z_scores)

3. Access database from python

# Production DB
server = 'server'
database = 'name'
username = 'username'
password = 'password'
driver = 'ODBC Driver 17 for SQL Server'
def connect_sql_server(server, database, username, password, driver):
conn_string = 'DRIVER={' + driver + '};' + 'SERVER=' + server + ';' + 'DATABASE=' + database + ';' \
+ 'Uid=' + username + ';' + 'pwd=' + password + ';'
conn = pyodbc.connect(conn_string, autocommit=True)
return conn
# Manage connections for pandas and pyodbc
cnxn = connect_sql_server(server, database, username, password, driver)
#Sql
sql1 = 'Select * from tablename'
data = pd.read_sql(sql1, cnxn)

4. Pair Plots

This is my favorite kind of plot as its easy to work with and simpler to understand.

#Pairplot
sns.set()
plotdata = newdf2[['spd','brake',throttle','lap_tm']]
sns.pairplot(plotdata, height = 3.0)
plt.show()

The complete script is here- Github.

Follow me on Medium to be part of my Data Science Learning journey!

--

--

Anjali Sunil Khushalani

I am a Data and Analytics consultant at Ernst & Young. In my previous life, I was a dentist and I am thrilled to share my journey at Medium.