A Data Analyst’s Crash Course on Software Development

Jamesmf
22 min readMay 3, 2018

--

“Data People” come from all sorts of backgrounds: engineering, natural or social sciences, finance, marketing, and many more. One of the most common ways to end up working with data for a living is simply transitioning from another role in a data-heavy field — and what field isn’t data-heavy these days? The result is that people working in data analysis — and for the purposes of this post, I will use ‘analysis’ as a term to cover things like data science, data analysis, data application development, and reporting — have very different comfort levels with the various technologies involved. Notably, many data analysts I’ve met aren’t very familiar with software development processes, despite the fact that the concepts have broad applicability to data work.

I personally found many things that software developer would shrug at to be quite intimidating. But after a lot reading, talking with developers, thinking, and eventually prototyping, I have learned a lot about the general software development process. Hopefully this post can serve as a data analyst’s crash course on development and a catalyst for applying the best that the software development process has to offer to your data work.

While reporting, analysis, and building data products differ in key ways from software development, they share important commonalities and insights from each can help with the others. Software development workflows differ from industry to industry, but at their root they usually comprise writing source code, building the source and development artifacts, testing the built application, then deploying it (possibly for direct front-end customer use, possibly as a downloadable package or binary). Each of these steps has its own tools and best practices, and each has its parallel in data analysis.

The insight that makes this comparison worthwhile is that the motivations in software development shops are the same as in data analysis: to deliver a product that meets specifications and to do so on-time and under-budget. Any organization whose employees program daily will quickly find itself facing large and complex codebases, so processes must be put in place that make it safe to change code without breaking anything. To avoid stagnation, time-intensive tasks need to be automated to keep employees productive and innovative. Failing to do so will leave programmers mired down maintaining poorly structured legacy code.

Every process or tool outlined here has to be evaluated for its expected value, as they have varying costs and uncertain benefits. Small projects might not need much as formality, as they might remain so simple that there is barely any risk of a mistake making it to production. Similarly there may be projects where the cost of a production mistake is small. But for higher-priority and more complex projects, the concepts outlined in this post help to lower both the probability of production errors and their cost. Furthermore, as a development process matures, hopefully shared best practices and frameworks emerge, lowering the cost of implementation/compliance. The ultimate goal is to make it as easy and fast as possible to build products that are accurate and reliable.

The formula below is an oversimplified way to think about this tradeoff:

The value of the product and the cost of a mistake are likely fixed once requirements are set, but the implementation cost and probability of a mistake are inversely related; you can add extra safeguards to ensure you make fewer mistakes, but it will raise implementation costs (particularly up-front developer time). At a certain level of complexity, the probability of a mistake gets high enough that it’s worth extra implementation cost to drive down the number of mistakes (which can cost you developer time, customer time, and often real dollars).

A few caveats for what follows: much of my experience is working in python, and I have worked at a few organizations that used primarily SAS. My use of R has been limited to academic work. I use git and haven’t used CI tools other than Jenkins. This has likely influenced a lot of my thinking, but hopefully these concepts apply to other development/analysis settings as well. The generalizations I make regarding how software development works are intentionally broad, so if you already know far more about any piece than I cover here, more power to you! You might not need this post, and I hope I don’t offend your developer sensibilities.

Writing Source Code

This step is familiar to analysts as there is a lot of programming that goes into it. To begin, a developer or team will think through the structure of the application thoroughly before implementing anything. Often a developer specializes in one domain while other teams handle other areas (front-end web development vs back-end is an example of very broad specializations), so they may have a narrow scope of the application to consider, and they only need to worry about how their code will interface with the other modules. Once the team(s) involved have have decided on a design for the features that need to be built for the next release, they will write tests to see check if the code works. To start, it obviously won’t — they haven’t written any new code yet. Writing your tests first is called test driven development and can be useful for a few reasons. The most critical is that in order to write your tests you have to think through your implementation carefully — which will usually help you catch any inconsistencies in your design. It also forces you to think about exactly what data you expect at each point in your process and requires you to break your code up into logical, testable units (modularization). The developer then implements the design until it passes all of the tests they’ve written (more on running the tests later). When several developers are working together, they usually use a source control manager to collaborate.

What’s Different in Analysis?

This step is similar for analysis and software development. But one notable difference is that often most of the ‘logic’ an analyst needs to know/use is in-database in the form of db relations, object and schema definitions, etc. This is in contrast to developing in an object-oriented language where there’s a good chance you’re interacting with application logic written in the same language or framework. The answer to this is a data-access layer to decouple your data retrieval logic from your business logic. A data-access layer is roughly a package or module in the language you’re developing in that translates database logic (rows/columns/relations) into objects in that language, or at least standardizes requests/returns. This lets you take advantage of higher levels of abstraction, which are critical to designing maintainable, scalable codebases. In ‘multilingual’ organizations, a data access layer might exist for each language (R, python, javascript) but it should provide a very similar API to encourage developers think in terms of the same abstractions.

What’s the same?

Developers and analysts both should spend more time thinking than implementing. This ideally leads to elegant design and modular code. Both need to understand their data model and should be 100% explicit about what “done” and “working” look like for the next release. The developer/analyst should always think “does this logic already exist,” “will others need this logic too,” and “does my logic share commonalities with other code in our organization?” Based on the answers to those, you might want to create a package or module that you can share/reuse. If you find a package exists that does something you’ve already done, you likely will want to refactor your code to use that package. That way, when the logic gets updated, you can inherit the changes (when you’re ready to). If your code does something the other package doesn’t support then before saying “I won’t use that” try to think “can I work with the maintainer to support my desired functionality?”

What Tools Can Help?

Developers use IDEs which aware of what language they’re using and help out accordingly; syntax highlighting, linting your code, etc. Many analysts use an IDE that is so tightly integrated in their workflow that they aren’t comfortable executing their code without it — which can lead to hesitance to fully automate (think SAS…). The IDE may also help create the files necessary for compiling a project and managing dependencies.

Source control (git is very common) helps developers keep track of changes to the codebase. That allows them to know exactly what change introduced a bug, what code generated what results, etc. (Aside: most “source files” are the files that the language of choice will be interpreting or compiling. Rule of thumb is that anything you store in a source control manager like git should be something you could open and edit in a simple text editor). Ideally as much as possible is stored in source (configuration files, test data or scripts to create test data, etc) with the exception that you should never put your secrets in source control (no hardcoding passwords)!

Assuming you use git, it is helpful to define a workflow with branching and code review. Using branches to isolate changes and group together issues into releases can immediately improve reliability.

Building

A build is different depending on the type of project, but the goal is to go from source code (as of a certain commit) to something executable that will not change from that point forward. For compiled languages, this will probably at minimum produce a .exe or .bin. For interpreted languages a build is less well defined, but they key idea is that the result of a build process should be able to be run many times and produce the same outputs. Ideally it will also work the same on any machine. Often this requires a build automation or continuous integration tool.

Continuous integration is the practice of building your code — and testing the build — every time you check a new commit into your source control. This ensures you know whether your new commit breaks anything and it also helps you know when the last time a feature was working. Note: you don’t always have to build every commit, so long as you have the ability to automatically run builds from source code (either at a time of day or based on some other trigger).

What’s Different in Analysis?

Since much analytical work is done in languages that are interpreted not compiled, the build step is often overlooked. This is a potential source of inconsistency, as it usually means results are dependent on something outside your source: your globally installed python, R, SAS. That means that when someone else upgrades a package within the environment, a process that was working yesterday might fail today. Defining a build process for your language of choice is important, and a build file should specify everything necessary to make your code run the same way regardless of any (reasonable) changes to the environment in which it is executed. Note that the definition of reasonable varies based on your hardware and level of automation.

Coming up with this definition can be challenging, as most tools are aimed at software development pipelines. But these tools tend to be intentionally flexible, so it’s possible to architect a satisfactory build pipeline using well-supported tools.

What’s the same?

Both require total certainty about what inputs create what outputs, and in both cases that means leaving nothing that requires human input in the build process.

What Tools Can Help?

A well-defined framework like those provided by CI tools is essential for a robust build process. CI/Build Automation tools can take care of credentials/secrets at build time, they can execute your tests, and most importantly they will keep you aware of how every build differs from its predecessors by archiving results.

Package managers are incredibly important. Whether it’s python, R, or another language, you will want to be sure that at build time your code knows where to find all of its dependencies (and the proper version of each). This is critical to having code run the same way twice. It can be hard to know whether to use git submodules, versioned packages in your language of choice, or some other method, but no matter how you implement it, the build process should be what turns relative references (like “this code depends on the release branch of repository X”) into absolute references (like “this code depends on commit Y of repository X”).

Testing

Testing is a cornerstone of modern development. Whether web or application development, there are unit tests, integration tests and user acceptance tests (and maybe more) to consider. This sounds like a lot of work, but the idea is that most modern software is so complex and has so many moving parts that it costs more not to have good test coverage. Essential to turnaround time is the ability to test a change and know it does what you think it does without having to put it in front of a user to look for bugs (though you might also regularly want user testing).

In addition, tests force you to think through your code thoroughly. An inelegant design pattern will likely reveal its flaws when you sit down to write tests for it. You will see inconsistencies and duplication of logic. You will think of edge cases and consider upstream data integrity issues. As a result you’ll know exactly how you expect your program to behave (‘expect’ is the operative word there). You won’t think of everything, but when new bugs surface you’ll add them to the test suite. Then when your application is mature, you’ll be testing for many pitfalls and you will know if you ever reintroduce those bugs.

What’s Different in Analysis?

Our test cases almost all involve databases (or other means of receiving/streaming data), whereas many developers are testing objects explicitly defined and instantiated in the program they’re testing. This can be intimidating for an analyst, as testing lots of database logic can quickly get complex. But using a package with good testing abstractions can make it much easier. At a bare minimum, tests to validate the output of your process can serve to protect you from sending out invalid result or populating tables with bad data. In addition, we can use small amounts of carefully crafted synthetic data for testing our logic, then use data in a development environment for testing performance. Here is one simple way to conceptualize testing (this is far from the only way to organize tests!):

  1. Post-build unit and integration tests: Load fake data into an actual (non-prod) database/datasource and execute your unit tests pointing to that source. Your test data should check key edge cases and make sure your datasets look the way you expect at each point.
  2. Pre-run validation tests: These tests ensure that the assumptions baked into your logic hold true in your input datasource. These shouldn’t be too intensive, but keep you from producing garbage if a change upstream allows ‘illegal’ data in your input. A mature data access layer might make these redundant.
  3. Post-run validation tests: After you run your program, you will likely benefit from ensuring a few key assumptions hold: you don’t have null data, no data got rejected during an insert, etc. These can help you address any issues you couldn’t have foreseen in your pre-run validation.

What’s the same?

Developers have to test processes with database inputs as well. They are can be mocked or faked, or you can use a real database connection to a reliable, non-production database. Using a real database with fake data is a viable option, but many languages offer good ways to mock a database connection. Mocking may take a little while to learn and get used to, but provides high reliability: you don’t need to concern yourself with spinning up/connecting to a real database or with inserting the fake data. However using a real database lets you test actual queries to ensure they really work the way you think they should.

What Tools Can Help?

Many languages have canonical testing frameworks, and common output formats like junit xml can help them work together. Abstraction is key here, so it can be difficult to think through testing in the context of a non-object-oriented language. As shops mature they likely recycle their tests and abstractions, which can lead to internal testing frameworks and default test suites.

Deployment

Deploying an application is probably simplest to define for web applications. A deployment occurs when a given build is made available by hosting the web service somewhere. Builds are likely at least deployed to a production and a development environment, as well as often a user acceptance testing (UAT) environment. Each of these serves a different purpose.

  • Production is the product for the customer.
  • Development is where changes are made and automatically tested
  • UAT/staging is the environment in which a test-passing build gets put in front of internal testers.

What’s Different in Analysis?

In data analysis and reporting, there may not be a need for UAT exactly, but there are still elements that might not be easy to implement as automated tests. For instance we might want to see how a report looks on a phone or tablet, or we might want to ensure our customer really wants the features that have just been implemented.

Similarly, the role of the development build can be murkier unless we have a dev database that matches production. Having a development database mirroring prod (or perhaps deviating from prod only in the latest data, pulled from the development version of your application/workflow) may sound like a lot of work, particularly to those who work with PII or PHI. But it pays for itself quickly: no one runs test queries in production until they’re validated in development, you catch data issues before they interrupt your production jobs, and best of all migrating a data product from dev to prod is straightforward.

What’s the same?

In both fields there’s a lot of value to fully validating any changes you make somewhere where you don’t run production code. First of all, it allows you to safely continue to deliver your products while you are developing new features or fixing bugs. It also allows you to compare the output from your dev and prod processes to determine whether your changes have impacted the end product in the way you had hoped.

What Tools Can Help?

This is entirely organization-dependent. For some shops where reporting dominates, deployment may just mean putting something on a scheduler and posting the output to a specific sharepoint/tableau/shared drive. In which case having two ‘environments’ is as simple as having a second location for output and intermediate datasets. It could also be far more complex, with fully parallel processes producing and consuming data in development and production.

While it might not be possible for every organization to use containers, they certainly can make deployment a good deal simpler. In general, any tool that helps you stand up a new, fully-configured environment with each build will help you ensure each deployment proceeds exactly as expected.

Putting It All Together

Hopefully some parts of this resonated with you or helped something click into place, but if you’re still thinking, “okay, but how does this help me?” then read on.

First of all, this post should help you remember that your default assumption should probably be that someone has already solved some version of the problem you’re currently facing. So instead of thinking, “I can’t write tests for my SQL queries” you’ll think “I wonder how other people ensure the quality of their data products” and do some open-minded research.

Ideally you’ll also come away feeling like none (or at least fewer) of the concepts here are scary or irrelevant to your work. Most of these things are an online tutorial, a prototype, and a proof of concept away from becoming a part of your workflows.

Finally if nothing else, hopefully this post will help you be able talk to developers and read about development tools! Data science is only as valuable as what ends up in production, so if you want to be part of a mature machine learning shop, you’ll have to know how to have a conversation about testing, deploying, environments, job scheduling, etc. You might work somewhere that has perfect abstractions for their data science workflows, allowing you to ignore all of this and focus mostly on models, but that is a fairly rare thing at the moment.

To illustrate some of the concepts above, let’s walk through an example.

Dana is a data analyst at a healthcare organization and she receives a request to analyze a referral dataset to determine the timeliness of care their members receive. Dana is familiar with the member data, but the referral data has external components and is ingested by a part of IT she has no dealings with. She first checks the documentation for the internally managed package that serves as their data access layer. She finds a function that will help her pull the referrals and, using the documentation and her knowledge of the member data, extracts some data for her analysis.

Dana presents her work and gets the feedback that it would benefit from more granularity, and that it needs to become a report that runs monthly. To increase the granularity, Dana needs to pull a few more fields from the referral dataset. She reads the documentation of the data access layer and realizes she knows what would need to change in the data access layer to enable her use case. She creates an issue in their issue tracker and a branch in the repository for the data access package.

Working out of that branch, she adds a test that checks whether the function she is changing returns the finer resolution data when called with a certain parameter specified. At first the test fails (as it should) but after she updates the logic of the function accordingly, the test passes. Once she confirms the change didn’t break anything else in the test suite, she can confidently submit a pull request to get her changes merged into the data access package.

Now that her code is going to produce something with important business implications, Dana decides to create a few test cases that cover the subtler ways her logic could break, as well as a few tests that confirm the data in-database meets her program’s assumptions. The first tests are there so she can safely update her code in the future, and she’ll run them when she makes any updates to her codebase or upgrades dependencies. The second set of tests will run before her report does, and will stop the program and alert her before any code runs with broken assumptions. The final set of tests she needs are common to almost all her team’s reports, so she imports them from an internal package and uses them to validate her output. These run after her program finishes, but before any customers see the results, and they validate her output (are there duplicates or nulls, is the report empty, etc). If any of these tests fail, she will be alerted and no output will reach her customers.

With all of this done, she writes a build script that checks out her repository, creates a virtual python environment, installs dependencies, and runs her tests. If all of the tests pass, Dana can perform the build somewhere where she can have the company’s scheduler run it monthly (she could also automate that process if workflow permits). The scheduler runs her built program every month and any time she makes changes she follows the same process, each time building her source into the same location as a deployment step.

Conclusion

That’s it, thanks for staying with me. I could happily write several times this amount on this subject, but hopefully this provided a good balance of breadth and depth.

Glossary and Useful Links

Benefits of an IDE:

Depending on the language you’re using, it could give you a lot! This SE answer talks about general features, but look up what IDEs offer for the analytical language you use; you might be missing features you didn’t know existed!

Test Driven Development:

There’s a ton you could say about TDD, and there are an awful lot of types of tests that people often have conflicting definitions for. This a decent overview of types of testing, but proponents of TDD tend to like it for one of two features: it forces you to design cleanly and it allows you to know when incremental changes break your logic. Detractors argue that it’s costly and that automated tests can only catch so much. As with many things, it’s important to strike a balance between time and value when considering how much test coverage you want.

Branching:

A quick search can turn up a few examples of how to set up a repository or project to use git branches, so I won’t link to any specific resources. I do advocate making branches a part of your day-to-day workflow as (at least for me) it greatly helps with discipline. It’s a good way to encourage code review, it helps you tie commits together logically, and it pairs nicely with CI tools.

Build Automation/Continuous Integration Tools:

Jenkins and Bamboo are a couple of CI tools that come to mind though there are many others. In general these tools provide a framework: if you write a few commands and structure them in a certain file in your repository, the CI tool will know how to perform a build and run your tests. It will then store all outputs (artifacts) in an archive and test results. These can be configured to be incredibly flexible, which is great for data analysis. For tools more focused on builds (rather than aimed primarily at CI), look at Make, Bazel, Ant, Maven, etc.

Modularization and refactoring:

Modular code is hard to define precisely, particularly with the terms ‘package’ and ‘module’ meaning different things in different languages. That said, it can be thought of at a high level as the process of breaking complex code into smaller logical pieces. In general this increases the ability to reuse code and makes it easier to rely on dependencies without caring about underlying implementation logic. It increases readability in most cases (as the coder doesn’t need to care about implementation as long as they understand inputs/outputs), though if implemented poorly it can server to obfuscate the code if modules/functions have unexpected side-effects.

When you have existing code that you want to reuse elsewhere for simplicity and maintainability, you might ‘refactor.’ That is the act of choosing a different design (often one that is either more modular or uses different abstractions) and re-coding the same logic using that design pattern.

Package Managers:

Depending on the language you’re working on, packages/modules will work very differently. It’s import to realize that there will come a time when versions matter, even if you use a proprietary language. APIs will change and code will need to be updated. You’ll want to use new features. Your IT will move to a new cloud provider. It’s going to happen. So given that, you need a way to handle using different versions and configurations in different projects; you need to make those dependencies and assumptions explicit. Proprietary languages like SAS will make this ‘simple’ by taking all your flexibility away — you install SAS X.Y and it comes with version X.Y of every package it offers. For Python and R, things have been improving rapidly and tools like conda help enormously.

Data Access Layer:

Data access layers are critical for data analytics/data product shops. A simple definition is that a data access layer is a package/module/shared-codebase that defines the essential ways you might retrieve data from your datasources. This is crucial to being able to integrate new datasources, change backends, or unify data models across systems. Instead of each program having the logic for querying data baked in alongside the project-specific business logic, the business logic is decoupled; all data is retrieved using the data access layer. This makes it easy to inherit updates to datasource-specific logic and ensures consistency across applications in the business.

Abstractions and Interfaces:

Abstraction is a key element of designing complex systems. It is easy to think of in terms of object-oriented programming languages: custom classes allow you to define objects that have the attributes and methods you want without needing to worry about the implementation. The same is true of interfaces. To design large systems, you need subsystems to be able to only depend on a few key pieces of other subsystems, not every detail of them. To achieve that, you define interfaces between and make explicit precisely what each one expects of the other.

Data Science, Data Analysis, Data Products:

Data science, while sexy, is so large an umbrella that it is often not useful. As a result, I tend to think of data science somewhat literally: the application of (parts of) the scientific method to data. Most critically, data science should be repeatable. It should ask and answer specific questions. Assumptions should be made explicit. If you treat your data processes with that degree of rigor, I’m happy to call what you do “data science.”

I prefer to split things out a little differently in my mind. If you are writing code that calculates objective measures by applying a known specification to a datasource, you’re building a “data product.” A good example is a business unit producing results to be shared with a regulatory body; this could be financial, it could be the results of emissions tests, anything with a formal specification.

If you are seeking to answer an abstract question by translating it into a series of data transformations that will produce numbers that need interpretation, you’re doing “data analysis.” This might include using observational data to argue that a program is/isn’t effective, for example.

If you’re translating events and transactions into visualizations and summaries, you’re “reporting.” You might be producing a dashboard that shows sales broken down by product or giving realtime feedback on the health of a system.

I suspect part of the overloading of the term “data science” to encompass everything to do with machine learning or statistics comes from the level of rigor required. If you don’t follow a protocol and if you aren’t incredibly careful in those endeavors, you’ll end up with bogus inferences or dangerously miscalibrated models. Which leads me to the conclusion that data science is required for good statistics or ML, but ML/statistics aren’t strictly required for data science.

Interpreted vs. Compiled:

This isn’t the clearest distinction, particularly with languages that compile to bytecode. But a simple rule of thumb can help you think through the meaningful differences: by default do you compile your code to a binary before you can run it, or can you feed your source directly to an interpreter? Compiled languages are more common in software development than analysis, particularly as things like Jupyter notebooks become more common in data exploration. The idea of a ‘build’ is very concrete when you have a compiled language. But with an interpreted language, you have to have a policy around specifying the interpreter (a specific `python.exe` on Windows, for example) that will run your code and you have to make sure the interpreter doesn’t get changed (no more `pip install` commands can modify it for instance).

The Case for Good Data in Your Development Environment:

Imagine two scenarios: one in which you have a dev datasource that matches your prod datasource closely, and another in which you have a prod environment so powerful it can handle sharing dev and prod analyses alongside each other. Now let’s say you want to test new changes you’re developing. In the two-environment scenario, you can freely alter/overwrite your project-specific tables, run the new code, and check the outputs. If all is as expected, you can promote the code to production and the same thing will happen there.

In the single environment version you have headaches: you need separate table names/schemas for your dev and prod versions. Once you run the development code and confirm the outputs, you aren’t just pointing to a new datasource, you’re also changing every table name (or worse yet, perhaps only your project-specific table names). You’re doing a lot of unnecessary bookkeeping, all while requiring your production database to run code you haven’t yet fully vetted.

Containers:

I won’t go into how containers work (besides saying they aren’t virtual machines), but they can make your products robust to different OSes and configurations. If your code runs perfectly in a container, deployment should be almost as simple as downloading your container image, starting the container, and running your code. Take a look at Docker.

Originally published at docs.google.com.

--

--

Jamesmf

I’m a data scientist whose goal is to make healthcare safer, faster, and friendlier by building responsible machine intelligence products.