The Python trick that will change your analyst team for good

Jason Braund
Bumble Product

--

In data science, we all know that ‘80% of a data scientist’s time is spent on data preparation’. I’m sure, like me, you’ve heard people say this more than enough times. And I’m sure, like me, you’d be unlikely to dispute its validity. In my opinion, this same concept that is described in this statement can also be applied to the general workflow of an analyst or data scientist and not only to a specific modelling task. That is to say, a large proportion of an analyst’s working time can be taken up executing the tasks that lay the groundwork for them to do what is actually expected of them: analysing. Whether it’s data preparation, visualisation or formatting outputs, the time required for ‘behind the scenes’ work can quickly grow. Such ‘time sink’ tasks multiply rapidly when you have a team of analysts and this makes them a potentially lucrative area for making those marginal gains from optimisation.

I’ve worked within several analytics teams to date and one persisting frustration has been the lack of efficiency in tasks that have to be performed over and over. I should make it clear that I’m not necessarily talking about automation here, those repetitive tasks done time and again and often without human intervention. I’m talking about providing our analysts with the specific tools that they need to accelerate the process of turning data into actionable recommendations. And in doing so, generalising these tools so that they can be used across the many possible different use cases that we might encounter.

Here at Magiclab, we take a highly proactive approach to making the life of our analysts easier with a view to maximising insight delivery and reducing stakeholder time. Over a long period, we have developed our very own internal Python library that meets the specific needs for our everyday tasks. These cover a broad range of themes from statistical methods that are specifically tuned to our data structures, to standardised notebook formatting and branding to deliver a consistent stakeholder experience. The best bit about this is that all these tools are developed and owned by the end-user, our team, and so can be adapted to suit our changing requirements and methodologies. Our main stakeholders are our own world-class product managers and it’s key that our messages to them are clear and consistent. This enables them to absorb the information that we present to them quickly and efficiently. They are making important product decisions in a fast-moving environment and culture and speed keeps them ahead of the game.

Examples

Let’s look at one of our simpler examples to demonstrate exactly what I mean here.

As an analyst team, naturally, we spend much of our time in Python working with pandas dataframes. When producing our reports, we often wish to display our aggregated data or statistical results both to supplement our commentary and provide context for our arguments.

Of course, we could output our dataframes, no problem, without any kind of internal packages. We could add formats to some specific columns perhaps, maybe remove the index column, rename columns to make them more reader-friendly. Already, just with these simple tasks, the time requirements are far more than they need to be and the results may not justify the effort.

We’ve got this covered, though. With just one function, our analysts can do all of these tasks. Columns are formatted based on the column name (e.g. if the string contains the % symbol or a word representation of it, then a % format will be used) which we can easily override within the function; the index is removed by default; the column headers are automatically formatted to be user-friendly; and we can apply standardised styling across any number of columns or even apply a bespoke styling function.

A simple example of our dataframe formatting function.

Giving our analysts the ability to do this with just one function may seem simple, but the time savings that this can offer is well worth any effort it takes to develop and maintain the function. Even if each member of our ten-strong analyst team used this function just five times a day, and each time saved 30 seconds then over the course of a year the time savings equate to over 100 hours. Considering that this applies to each function in the library, it’s not hard to justify the work required.

Add example of adding tags to a report. This makes life easier for product managers looking for relevant analysis.

Not all of our internal functions are quite as simple as dataframe manipulation. Certain more complex methodologies that we use or even develop within the team are also perfect candidates for this optimisation process. We have a number of different functions that span well-known areas including Google’s Causal Impact, NLP, Geo analysis and statistical testing, to name but a few.

Application

Hopefully, by now I’ve convinced you of the value of this kind of optimisation. You may now be wondering how you can go about implementing something similar in your team. The advice below should get you on the right track.

How can I do this in my team?

1. Compile a list of common operations across your team.
Consider all possibilities, even if these are things done by just one or two people — if these things are executed often enough then it could still be beneficial to optimise. Furthermore, they could become something that helps additional members of the team, something that they weren’t doing before but would benefit from doing. Common examples might be running stats tests or manipulating dataframes.

2. Group operations into sensible subpackages.
This will make life much easier later on when your package becomes larger and has multiple contributors. Staying organised at the early stages can pay dividends in a year’s time. You might have a geo analysis sub-package containing a number of functions, a sub-package for general df formatting options and then maybe one for NLP functions.

3. Optimise the functions you write.
It’s likely that the processes that are used time and again have been somewhat optimised already and so can slot into your package without too much trouble. But be sure to try and find ways to make the code as efficient and general as possible. Peer review is always a good idea at this point.

4. Package them all up and release to your team.
The standard structure of a Python package is pretty simple and something that a little googling will quickly reveal. This link will certainly help you https://packaging.python.org/tutorials/packaging-projects/. This isn’t intended to be a guide on how best to do this bit, so I won’t go into any further detail here.

5. Get constant feedback and continually improve.
The importance of this can’t be underestimated. Without feedback and improvements, the package will be unlikely to survive its infancy. Your team may need some convincing at first and getting their feedback on how to improve it will help with their engagement. We are constantly developing and editing our package and all of the ideas and improvements come from the team.

6. Deprecate and keep up to date.
This is quite similar to the previous step but you should make sure you keep an eye on what’s still useful and what’s out of date. Keep that package nice and trim.

Importance of documentation

1. Make sure the team knows how to install and use the package (this will help with buy-in in the early stages).

2. Make sure all of the functions follow standard styling such as PEP8. This will help with consistency across developers and allow end-users to understand functions easily by accessing the docstrings.

3. Keep track of development ideas. If you use JIRA or something similar then create a board to help with this.

4. Consider producing HTMLs from the docstrings to smooth things for end-user. This can be done really easily using a package such as sphinx.

Don’t slip up

The last thing you want is for all your hard work to go to waste, to see a project aimed at improving efficiency turn into one that’s simply a waste of time. Here are a few things you should try to avoid.

1. Failing to get buy in from your team.

2. Over optimising (much like overfitting a model): a use case that is too specific then it won’t generalise well and will be deemed pretty useless.

3. Adding functions for the sake of it. I’ve rejected plenty of requests to add things to our package. Sometimes it’s just not worth the time input needed.

4. Failing to keep everyone up to date. Talk about it, announce new releases, don’t let it lose momentum.

5. Compatibility: it’s important to ensure that your team members all use the same Python version so that functionality is compatible for all users.

Conclusion

As analysts, there are many ways of optimising our work and processes. Naturally, we want to spend our time doing the things we find the most interesting to us and optimising allows us to do so more often. The process I’ve described may be one way of doing it and is not necessarily appropriate for all teams. It may require a high initial investment and perhaps involve a steep learning curve. But think ahead a year or two. It might well become something you come to rely on every day like we do here at Magiclab; it could become something that saves your analysts precious time, letting them focus on more important aspects of their role.

In our data analyst team, we’re actively encouraged to spend time on side projects. We’re always looking to hone our skills and expand our knowledge and so add value to the wider goals of the business. When I joined the company my experience in Python development was non-existent. But over the past two years, I’ve been given the opportunity and support to invest the time both in teaching myself and attending relevant courses. I’ve now got to the level that allows me to advance the whole team in this area.

If contributing to projects such as this is of interest to you, we are hiring so feel free to get in touch and find out more here about joining the MagicLab family!

--

--