Image for post
Image for post
Source: unsplash.com

How to Build a Successful Data Science Capability

Jonathan Robinson
Dec 7, 2020 · 13 min read

The data science experience within large organisations can be variable due to a perception that all you need is a group of data scientists to successfully execute data science.

A complete data science team requires a broad range of skillsets. The functions that are required depend on factors such as the industry, culture of the company, the maturity of analytics and accessibility and availability of data for analytics. The importance of analytics to senior leaders will also impact the capability of the data science function: is the aim to do the bare minimum to comply with a regulatory environment, or to assist in getting ahead of the competition; is data science a fundamental piece of the corporate strategy?

Establishing the Team

The data science experience within large organisations can be variable due to a perception that all you need is a group of data scientists to successfully execute data science. It takes the right mix of skills, a clear strategy to guide use-case selection and executive education to create and maintain a data science capability that can deliver significant bottom line value.

To establish an effective data science function requires some or all of the following elements:

  • Data and machine learning scientists.
  • Machine learning engineers.
  • Software engineers.
  • Infrastructure engineers.
  • Hardware engineers.
  • Enterprise and solution architects.
  • Communications.
  • Customer adoption and change management.
  • Security advisors and security operations.
  • Development operations (DevOps).
  • Project/iteration managers.
  • Business analysts.

One of the hardest roles to fill is the leadership of all this. The team must be led by someone with business acumen and technical credibility; someone who can establish strong relationships with the wider business, articulate and formulate complex business problems and lead a complex and varied group of professionals.

Data Science is more than R and Python

Don’t overlook the science part of data science. There is much more to being a data scientist that just learning R and Python. If that’s all you do then you’re just someone that knows R and Python. Knowing how to code in these tools doesn’t make you a data scientist.

Consider the hype around Hadoop. There seemed to be the expectation that if you install Hadoop in your organisation, then your business would instantly be data driven by executing the command:

hadoop -generate-insights -input=all_my_data.csv \
-output=predictions.out

Sadly, no such command exists and many organisations were disappointed when they didn’t see the expected returns from their shiny new Hadoop cluster. The missing piece was that the focus was on the technology, not the outcomes.

When designing experiments, think about experimental protocols, control sets, measurement and performance metrics. How do you know your model is not just a random number generator? What performance metric did you choose and why? Follow the scientific method where the aim is to disprove that your model works. It’s an easy task to show examples of your model working but only one example that doesn’t work is needed to invalidate the model. These are harder activities than writing some Python which should be considered as more of an implementation detail.

The Data Strategy

Often the reasons for establishing a data science function within a large organisation will be a mix of the above depending on the area within the organisation. Consider anti-money laundering (AML). Financial organisations need to comply with government regulations on detecting and reporting AML but there is no money to be made by having the best AML detection (unlike fraud which costs businesses real money). Customers are unlikely to make a choice of bank based on how well that bank detects AML. Thus there are valid reasons for not wanting to advance analytics in every area of a business, but to selectively apply data science in the areas that directly impact revenue such as predicting if a customer intends to leave when a timely intervention will likely retain them.

Identifying useful data science projects is often more difficult than anticipated, particularly if the organisation is not culturally analytically focused and analytics is something done off to the side or begrudgingly to satisfy regulators.

One approach is to work on wicked problems. A wicked problem may be defined as working out the problem is part of the problem. These are very complex problems that may involve several parts of the business. Brainstorm with the business: we have a pile of data that we could be doing something more with. Combine the resources of business subject matter experts with experienced data scientists; experiment and iterate; explore the possibilities. For example, an insurance company may wish to know if there is any way to identify root causes to home damage claims, or if there are any detectable trends and if so what the trend can be attributed to (a problem that probably does keep senior management awake at night). This may require natural language techniques on assessors reports or even deep learning to identify damage from photos. One possible outcome may be that a change to a building code has increased the damage caused by water leaks or rain. Premiums could be adjusted accordingly to reflect the increased risk for affected properties, or the regulatory authority could be approached with a view to altering the building code. They are more likely to listen with solid data over anecdotes.

Another approach is to showcase the possibilities to the business. This generates ideas and you will identify enthusiasts and champions.

Organisational Structure

There are several approaches to where a data science team(s) can sit within an organisation. If the organisation has a chief data/analytics office (CDO) then data science would be one of the functions of the CDO. The CDO itself typically sits within the technology area under the CIO/CTO. This data science team would provide a shared service to the rest of the business.

For larger organisations, a more federated structure may better suit. A common implementation of this is a centralised data science team in a centre-of-excellence (CoE) or shared service setting with seperate specialist data science teams distributed across the business as required by divisional organisational units. This is termed the hub and spoke model. The hub and spoke engagement structure is where the CoE ( hub) provides experts to analytical teams in other departments ( spokes) when they need help building new models, initiating new projects, understanding specialist infrastructure, etc. Ideally this creates collaboration which may be coordinated by the hub. More detail on the hub and spoke model here.

There is no ‘correct’ way to structure data science teams within large organisations but if analytics is important to your business, it should be done in a directed way. If words like AI, Machine Learning and Data Science appear on your high level strategy, your strategy should contain a plan detailing the approach. Don’t leave it to individual organisational units to independently interpret this for themselves, you’ll end up with duplication, variable quality and questionable return on investment.

Science and Engineering

Scientific inquiry and engineering design are complementary but equally important.

The image below illustrates the difference between the scientific method and engineering design (source: The difference between science and engineering). Scientific enquiry begins with a study of a physical system and produces a model. Engineering design begins with a model to produce a physical system. Both disciplines are required and are complementary. Coordinating your science and engineering teams to work in harmony is often a challenge. It can lead to friction, turf wars, jostling for control, even rogue activity. For example a data scientist may do engineering work if their requirements are not being met, bypassing normal engineering practices including quality control.

Image for post
Image for post

The scientific method is focused on creating new insights whereas engineering produces a known output, usually in the form of a physical system such as a software product or infrastructure. The outputs of scientific enquiry are unknown ahead of time — if they were known then it becomes an engineering activity.

Large organisations are much more familiar with engineering than with science. The meaning of science in this context is discovering something new by interrogating and exploring data for new insights. If the ‘scientific’ task is to build a credit risk model based on industry standard data fields, then it is an engineering activity. Just because a machine learning algorithm may be used does not automatically turn the activity into data science. Many activities labelled as data science are actually engineering projects. For this reason you do not need an army of ‘data scientists’. Data engineers will be able to do many of these tasks.

As scientific enquiry requires uncertainty it does not fill well into an Agile workflow. Timeframes will almost certainly be missed if scientific inquiry is attempted to be shoehorned into any management framework designed for engineering. Project managers and senior leaders need to understand this, and the challenge for them is to allow scientific enquiry to take place while not over focussing on unrealistic ‘deadlines’. Consider the massive research effort going into nuclear fusion. Despite 60 years of effort we are still unable to give any sort of timeframe for when all technical issues will be resolved. Although on a difference scale to nuclear fusion, data science is science.

Timeframes will almost certainly be missed if scientific inquiry is attempted to be shoehorned into any management framework designed for engineering.

Image for post
Image for post

The Business needs Educating

The business should understand why changes in processes are necessary when predictive analytics is introduced. This is where effective change management and communications is paramount.

A data science unit cannot work successfully in isolation from the wider business. Non-technical executives need expert guidance in identifying the right projects and understanding what constitutes success.

Don’t wait to be given tasks by the business, they likely do not know what is possible with data.

One of the pain points for an insurance company is detecting and responding when an insurance claim goes off track. When the data science team is tasked with building a model that will predict which claims will need an intervention, the business may request: “Give us the top 20 claims each week”. This sounds like a reasonable request if the average across a year is 20 claims per week. However, statistics does not work like that; the results are inconveniently lumpy. Some weeks we may have 5 claims, other weeks 35. The business needs to change its KPIs to adapt and properly measure the effectiveness of a predictive model. The business should understand why changes in processes are necessary when predictive analytics is introduced. This is where effective change management and communications is paramount. We cannot simply drop a predictive model into the business and leave all other activities untouched. Another consideration is that when a model is first activated, it will immediately identify the ‘low hanging fruit’. Thus, some time must pass before the model reaches a baseline for which a more accurate measurement can be made as to its effectiveness.

If the introduction of data science is part of a wider transformation program, think about what intermediate steps will be required before achieving the target state. Well thought out sequencing and prioritising is critical if you are transitioning complex models from old technology to newer, perhaps cloud based, technology. It is also an opportunity to rework any existing models rather than a lift and shift approach.

Choosing the Correct Metric and Measuring Value

Work out the size of the prize and present that to the business rather than a confusing array of numbers and graphs.

It is possible to build a model that is more than 99% accurate but completely useless. Consider a case where we are predicting a rare event such as a fraudulent transaction. According to the Australian Payments Network in the year ending 30 June 2018, 0.0421% of credit card transactions were fraudulent (this is fraud that was caught, the actual number will be higher). That means 99.0579% of transactions were not fraud. If we were to build a model that predicts that there are no fraudulent transactions we can claim that our model is 99.0579% accurate. Clearly accuracy is the wrong metric here. Other metrics we could use are balanced error rate, log-loss, area-under-the-curve or its related variation Gini imbalance, precision/recall, etc. The problem with anything other than accuracy is that it is often hard to explain to a non-technical audience. Some metrics are difficult for a human to interpret the meaning. Instead, think about quantifying the value. Work out the size of the prize and present that to the business rather than a confusing array of numbers and graphs.

Pick Projects That Return the Best Value

Identify and understand the top 10 problems are that keep senior management awake at night.

Make sure that use cases align to the corporate strategy. If the corporate strategy is baffling, confused or known only to a few, identify and understand the the top 10 problems that keep senior management awake at night. Don’t be afraid to bring new ideas to the business to demonstrate ways that data can be used that they may be unaware of.

If establishing a new data science function be mindful of the need to demonstrate value and benefits as soon as possible. Actual value in a production setting, not theoretical value if only we had the right infrastructure. Pick use cases that cannot be done with existing technology.

Avoid Innovation Theatre

Data science is closely associated with innovation so avoid innovation theatre:

  • Idea contests/hackathons/datathons resulting in an endless stream of chat bots. Hackathons can be useful if run effectively. Aside from idea generation, they can also identify talent in your organisation that maybe not be being used optimally. You should not solely rely on these for ideas and if you hold one, have a plan to actually execute good ideas otherwise it’s just a feel-good distraction.
  • ‘Innovation labs’: A group of self-styled ‘techies’, resplendent in jeans and t-shirts situated in an isolated outpost disconnected from the rest of the business. A ‘labs’ function can be useful for incubating ideas that may have no logical place to be developed, but care must be taken that this team works with the business, not independent from it and is not form over function. If you setup a ‘labs’ team, consider secondment where ideas (perhaps generated in hackathons) can be developed by the person or team that conceived the idea; don’t restrict membership to an elite few who have long since run out of ideas.
  • Outsourcing innovation; getting staid consultancies to tell us how to do innovation and data science. This can be an expensive form of theatre and damaging to people engagement scores. This isn’t to advise not using consultancies, but be certain due diligence has been followed and the consultancy can deliver what they are promising. Be wary of vendors who try to sell a solution that is the same as they have sold to the competition (where is the competitive advantage?).

For more on innovation theatre, see here and here.

Image for post
Image for post

End to End is Key

There is much more to deploying and executing a machine learning algorithm in a production environment than just the machine learning algorithm.

Don’t get too caught up on the perfect machine learning algorithm. The below image (from a paper by Google available here: Hidden Technical Debt in Machine Learning-systems) illustrates the components required for a successful predictive model in production. All components are required in some form for even the smallest project.

Image for post
Image for post

Too often the focus is on the small black box to the exclusion of the other components. There is much more to deploying and executing a machine learning algorithm in a production environment than just the machine learning algorithm. This is where engineers, dev ops, security ops, business analysts, etc are required. All this must be coordinated by project and iteration managers.

Note that the image above is a little dated. These days many of the components are available as off-the-shelf packages and/or provided by cloud providers.

The last mile is often the hardest when deploying a predictive model into production.

Image for post
Image for post

You will need an infrastructure, procedures and governance that support rapid experimentation and prototyping. You will also need a way of promoting models from the experimentation/discovery environment to the production environment. Think of it like a laboratory and factory. The deploy arrow from the lab to the factory is often excruciatingly difficult to achieve in large organisations. Blockers tend to be caused by extreme risk aversion resulting in multiple layers of bureaucracy. The last mile is often the hardest when deploying a predictive model into production.

The Lab/Factory model is an idealised way to view infrastructure. In reality each project will have it’s own nuances so the above picture is a logical view rather than a physical but it helps to view each project in these terms.

In your production environment, consider the following:

  • The physical method of bundling a model and its required files and securely transferring it to the production system.
  • Auditing and approvals. Who has authorised and approved that the model is fit for production and how is the approval recorded? Has the model been assessed against the organisation’s ethical criteria? Are there any regulatory considerations?
  • Version control. Related to the two point above to maintain a copy of all pervious copies of a model, together with an approval trail. Also useful if you need to roll back a version due to unforeseen consequences of replacing an older model with a newer one.
  • Objective measurement. How is the model performance being measured and tracked? How is the value to the organisation being measured? How do you compare models in a champion/challenger setting?
  • Access to the production system should be limited to engineering. No users should be able to directly modify anything in production without following the steps above.
  • How is data lineage tracked? Are you certain that each column of numbers is what you think it is? Are you confident you can explain to a regulator the features being used in a model satisfy any regulatory requirements, for example: gender or race discrimination?
  • If the model is part of a decision framework that is regulated, are you storing scoring features and model outputs so that historical decisions can be investigated if required?
Image for post
Image for post

Conclusion

We can assume with all the buzz and interest around AI, machine learning and data science that why an organisation that deals with data should be doing these things is well understood. The how is a much more difficult proposition. I have presented (possibly controversial) observations and suggested approaches to creating a successful data science capability drawn from experience, particularly in financial industries. I’d be interested in your thoughts and experiences. Please comment or message me!

Originally published at https://www.linkedin.com.

The Startup

Medium's largest active publication, followed by +771K people. Follow to join our community.

Jonathan Robinson

Written by

Data scientist and engineer. PhD in Machine Learning (2004). Head of Data Science at a fintech startup. https://www.linkedin.com/in/datavader

The Startup

Medium's largest active publication, followed by +771K people. Follow to join our community.

Jonathan Robinson

Written by

Data scientist and engineer. PhD in Machine Learning (2004). Head of Data Science at a fintech startup. https://www.linkedin.com/in/datavader

The Startup

Medium's largest active publication, followed by +771K people. Follow to join our community.

Medium is an open platform where 170 million readers come to find insightful and dynamic thinking. Here, expert and undiscovered voices alike dive into the heart of any topic and bring new ideas to the surface. Learn more

Follow the writers, publications, and topics that matter to you, and you’ll see them on your homepage and in your inbox. Explore

If you have a story to tell, knowledge to share, or a perspective to offer — welcome home. It’s easy and free to post your thinking on any topic. Write on Medium

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store