Data Scientist with a hint of…

Software developer.

Dr. Nel
Data Sciency Things
3 min readAug 1, 2016

--

I started my job as a Data Scientist at Soteria — Security Consulting & Data Analytics in Charleston, S.C. a few months ago, and I have to say I love it. My job (as is the case for most data scientists) is automating analytical processes. After I was briefed on what the project entailed and consulted with analysts about the features they believed to be relevant, I dove into writing the code that would extract such features. (I should mention that I program mainly on Python.)

When I first had my code running, I had just one file with everything in it. After compiling it a couple of times and realizing how computationally expensive it was, I decided to go back and start rebuilding it with better coding practices. While I felt really comfortable with my machine learning and data visualization toolbox, I have to admit that I haven’t done a whole lot of Object Oriented Programming since my days of C++ back in undergrad.

So here’s a few things I (re)learned about software development:

(Note: These are all things that you probably knew already.)

Divide things into modules and classes

Grouping functions with similar inputs into the same modules and initializing those shared values into the __init__ helped reduce the computational time significantly. This also proved to be helpful for debugging, since it allowed me to pinpoint the line where an error had occurred… which brings me to the next thing I learned.

Gotta catch them all (errors)

I learned Python on my own, so I am very patient when it comes to debugging. The problem is that when your code has so many moving parts, it is kind of hard to read errors from the command (or Jupyter notebook). To make it easier, I added a try and except such that whenever it fails it writes to a log file the details about the failure.

Keep track of runtime

By creating a log to keep track of the amount of data processed and the time it took to complete each individual task I will soon be able to model future performance and scalability. (A very data sciency thing to do.)

Prevent memory leakage

While monitoring my software’s performance through Glances and Htop, I noticed that one of my modules in particular, was causing memory issues. After sitting down with a few members of our team and reviewing my code, I went back and redesign certain functions to make fewer variable assignments. I also made sure that every computation would be carry out only once throughout the class and if there was to be another reference later on that value it would be stored as a self variable.

Take care of your dependencies

After migrating computers twice, because I needed more power, and installing all of my packages again, I realized I had to come up with a better system in case I needed to move my code again. So what better way of doing so than creating a Python package and declaring all its requirements, so that I can just ‘pip install’ and BAM! we are ready to go.

I’m still learning more and more about best practices, and I’m extremely excited about it. I have to thank the phenomenal group of developers I have by my side at Soteria for their advice.

Comment below and tell me about how like to structure your code, and what are some of the best practices you tend to follow.

Cheers,

Dr. Nel

Like this post? Press the heart shape button just below to recommend it and/or share via social media. Feel free to highlight any part of it you like and/or leave me comments on it.

--

--

Dr. Nel
Data Sciency Things

AI Scientist / Quantum Topologist bouncing around DC, NYC and Miami