Sign in

Data Engineer in the Bay Area

A few weeks ago, I had a bad day at work. I was dealing with a bug that appeared completely intractable, and a deadline was looming. The longer I stared at my screen, the more opaque the bug seemed, and the more stressed I felt as the clock ticked onward. I stayed up past midnight, changing code, running and re-running tests, probably re-implementing the same faulty logic multiple times before finally giving up and closing my laptop, defeated. I felt incompetent and convinced that this was a sign that I wasn’t cut out for the field; I couldn’t even fix…


I have been working as a data engineer for the past three years, and one thing that I have noticed is that there is a distinct lack of readily available resources for preparing for data engineering interviews. This is probably partially due to the fact that data engineering as a field is not particularly well defined; the role varies from company to company and domain to domain, sometimes tracking closer with software engineering and sometimes more with data science or analytics.

Well, never fear! The following is a brief list of the resources that I think can be most helpful…


Clean Code in Python by Mariano Anaya

Okay, so the “once monthly” thing didn’t work out. Is there a succinct way to say “once every 6 (8?) months (or so)”?

Coding, generally, is science more than art; but the distinction between the two blurs when we start talking about the fundamentals of coding style. The author of Clean Code in Python, Mariano Anaya, prefaces his book with a caveat: “The author does not claim to be any sort of authority on matter of clean code, because such a title cannot [possibly] exist.”¹ …


Another day at the workstation!

This article contains a short tutorial for setting up a recurring Python script with Apache Oozie. Oozie is a workflow scheduler for Apache Hadoop that allows you to define and schedule two types of jobs: standalone “workflow” jobs, which will run when explicitly triggered, and “coordinator” jobs, which are set up to run at defined time intervals — think cron job, but highly available. Note that while Oozie doesn’t explicitly support Python, it DOES support shell actions, which gives us a handy workaround for running Python scripts.

Please note that this tutorial assumes you have Oozie up and running on…


Tech Book Talk (the new #tbt) is a monthly (or at least, that’s the plan…) series where I will review a technical book or text I’ve finished in the last month. Suggestions welcome!

Instantly recognizable by its wild boar cover!

The book I selected for July 2019 is Designing Data-Intensive Applications by Martin Kleppmann. This is a tome in the O’Reilly family of technical texts, and therefore I had high hopes. Spoiler alert: It didn’t disappoint.

Who should read it: Professional software engineers and computer programmers, especially those starting out in the field (yours truly) or those transitioning from single-node to distributed systems.

Why it’s great: The…


6:05 AM: Alarm jolts you abruptly away from that REM thing that’s so important. Maybe you should buy the sleep tracking ring your roommate keeps raving about

6:15 AM: Finish your morning scroll, taking note of how much more productive, tan and talented the people you follow on Instagram seem than you

6:53 AM: Realize in a panic that you have 7 minutes in which to dress, eat, pack and be out in the door in time to catch the last BART which will arrive at your destination in time to catch the last company shuttle

7:15 AM: Board said…


Image from Zeppelin Assets

Apache Zeppelin is a relatively new open-source tool for data visualization. Like Jupyter notebooks, it allows you to work on scripts in a modular, collaborative fashion while supporting in-browser data visualization; unlike Jupyter, it has support for multiple languages across cells, allowing you to switch freely between Python, Scala, R, Markdown and many others, all within one notebook. It is also uniquely suited for working with Apache Spark, allowing you to bring in large dataframes of data to run analysis on.

However, Zeppelin has a few of its own issues, and one of the most common that I’ve encountered is…


A while ago, I started getting interested in the Ethereum platform and its native language, Solidity. After a bit of trial and error, I was able to set up and test a simple smart contract on my Windows 10 machine. I share my process (and compile some of the resources I found most helpful) here.

from ethereum.org

Ethereum allows developers to write and deploy custom smart contracts in Solidity — distributed applications that, when propelled by Ether “gas”, can produce a result which is propagated on the blockchain. …

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store