Data Science bringing down the divide between IT and business

Ville Voutilainen
SPxFiva Data Science
9 min readSep 26, 2023

Introduction¹

Being an organization with long traditions, business and IT units at Bank of Finland and Finnish Financial Supervisory Authority (FIN-FSA) have been living rather separate in the past. Nowadays, however, it is not uncommon for business units to have tech-savvy people working with large datasets and/or programming languages to build advanced statistical models. Such people have various titles within the organization (such as economist, supervisor, researcher, developer, etc.), but they could collectively be referred to as data scientists.

In this article, I describe the typical work of a central bank/supervisory data scientist. The aim is to highlight the workflow of a typical data science project as well as outline the similarities and differences to more common duties in our organization: traditional business analysis, empirical research work, and software development. The highlighted differences and similarities are highly stylized, yet they aim to give a rough grasp of why the role of a data scientist has evolved alongside other duties.

Data science workflow

In this article, I define the term “data science” in a broad way: it is anything where the typical workflow looks something like in the following figure:

Stylized data science project workflow. Adapted from the Six Divisions presented in “50 Years of Data Science” by David Donoho and the “prototyping loop” described in Ville Tuulos’ book “Effective Data Science Infrastructure”.

As with any other kind of project, it all starts with an idea or a question. This can be formulated by the data scientists themselves, or it may come as a given from their manager or subject matter experts. The task could be either a business case, a need for analysis on some topic, or an enhancement to an existing process. When the task requires data, this is where the data scientist starts to excel (pun intended).

The first essential step is to find out whether appropriate data for the task exists. One of the most important factors is that data scientists typically need to utilize real-life (production-grade) data. One needs real data to answer real business questions, and only in special cases can data scientists operate with, for example, synthetic data.²

Once appropriate data sources are uncovered, the prototyping loop³ is initiated. The loop can be defined to include five parts⁴:

  1. Project-specific ETL⁵: extracting data from databases or data warehouses, doing initial transformations, and loading the data for analysis (e.g., to another warehouse, a computer’s RAM, etc.). The project-specific ETL process comes on top of the ETL processes already conducted towards the data warehouse by a data engineer.
  2. Data representation and wrangling: implementing transformations, feature engineering, or data restructuring to present the data in a more appropriate form.
  3. Computing with data: interacting with data using either dedicated statistical software or programming languages while leveraging various computing clusters to achieve this.
  4. Data exploration and visualization: investigating data, for example, by visual inspections, summary statistics, or simply viewing individual observations. Refining initial hypotheses based on findings.
  5. Data modeling and analysis: can include various flavors of analysis, for example, descriptive, predictive, or causal analysis. This is what ultimately will bring answers to the business questions.

The prototyping loop is at the core of a data scientist’s work. The prototyping loop is rarely a well-defined, linear process. For example, computing limits may require changes to the project-specific ETL process, visual explorations can uncover facts that warrant changes to the modeling, model results may indicate that new features are needed, etc. In short, it can get messy. This, however, is one of the defining characteristics of data science. It is both an art and a science, driven by discovered patterns based on real data.

The prototyping loop can require extensive subject matter understanding in addition to technical skills. Further, any of the five parts of the loop may be very complex processes, requiring a long time to be completed. Thus, it is essential that the infrastructure around the prototyping loop is made as smooth as possible. This means that appropriate tooling is available and that data pipelines — between data warehouses and the data scientist’s development environment — work seamlessly.

After the data scientist is happy with the prototype, they want to publish her results to the world. It is important to note that, depending on the business case at hand, the outcome of the prototyping loop can take many forms. Here are some typical outputs with included example cases from the Bank of Finland:

  • Deploying an application: an application that, for example, utilizes a developed statistical/ML model and/or presents its outputs. An example from our organization: the Bank of Finland Nowcasting model.
  • Research piece: a prepared policy analysis or academic research article. A common factor for the pieces is that a great deal of data work has been put behind them. For example, the article might be a policy analysis on housing companies utilizing granular loan data or econometric research papers (see Bank of Finland Discussion Papers).
  • New dataset: the result might be a new dataset created from existing datasets or developing new features for an existing dataset. New data can be written to an internal database table for the team to use or published through a public API. An example from our organization: Bank of Finland and FIN-FSA Open Data Portal.

In the industry, the work of a data scientist may sometimes be understood to consist solely of model building and deploying models into production environments. This, however, is an incomplete story. The outcome can be a piece of software, but it might as well be a research paper or an analysis article, without the need of “deploying” anything. In these two cases, the actual work in the result-sharing part can be completely different. If the goal is to deploy models into production, the data scientist needs to have a working knowledge of DevOps/MLOps. However, if the goal is to write a business report or a research article, the data scientist needs to be able to write a compelling story and/or articulate the outcomes according to the requirements of the scientific method. The crucial thing is that things look the same in the prototyping loop phase.

Relation of data science to business analysis

The goal both in data science and business analysis⁶ is the same: creating value by providing answers to business questions. This requires knowledge of the actual subject matter. Without the business knowledge, it is impossible to formulate or judge the relevance of ideas, and “sciencing about data” loses its meaning. Also, both business analysis and data science deal with real-world data, and the outputs of analysis might be similar (e.g., an analysis article).

The main differences arise in the technical know-how:

  • Data scientists work with bigger and more granular datasets than business analysts or subject matter experts. Thus, it becomes important to understand the work of data engineers (data infrastructure, ETL processes) to be able to extract relevant pieces of data for the projects.
  • Knowledge of advanced (statistical) software — for example, programming languages such as R or Python — becomes essential for handling and analyzing complex datasets.
  • Data scientists tend to use sophisticated analysis techniques (regression analysis, predictive models, etc.) instead of aggregated statistics and key performance indicators.
  • Lastly, if the end product is a statistical model as a piece of software, data scientists often participate in the technical details of productization and monitoring of the model.

Relation of data science to research

Beside business analysis, there are many people with an academic background working at the Bank of Finland and the FIN-FSA. At the Bank of Finland, there are even two dedicated economic research units. Thus, economic research plays an important part of our core business.

Empirical economic research⁷ shares quite a few elements with the data science prototyping loop:

  • Project-specific ETL⁸ and data wrangling parts typically look the same in both cases.
  • Data visualization and exploration are essential in both formulating questions and hypotheses.
  • Both researchers and data scientists are interested in computational capacity, that is, having enough computing power to perform the needed calculations.
  • Both researchers and data scientists love to build models, either for predictions or to find causal explanations.

The main differences are:

  • Economic researchers are rarely interested in shipping software or building applications. Instead, researchers focus on producing academic research articles or policy papers.
  • The data science stack can be more open-source-oriented than that in research. Whereas R or Python are typical choices for data scientists, economic researchers more often rely on proprietary software like Matlab or Stata.
  • Researchers are less often interested in computational efficiency, as it is less essential that their computations run within a specified time frame compared to models running in production.
  • Lastly, as researchers are at the forefront of the academic literature, they typically have a stronger methodological background compared to data scientists, who in turn often excel in efficient data wrangling. Combining expertise in different skillsets can be very beneficial for individual projects.

Relation of data science to software development

Data science requires a certain degree of technical know-how. Due to this, some core elements in data science are the same as in traditional software development. These include concerns about computational efficiencies, best practices for writing and version controlling code, as well as acknowledged principles of database management.

There are, however, several key differences.

  • First, and perhaps most importantly, the nature of the problem being solved is different. Data science is uncertain and can have huge degrees of complexity (on top of just software complexity). This is because data itself is typically messy. The outcome of a data science project may not be known beforehand. Rather, it is the job of the data scientist to find out what the answer to the proposed question is or whether the question can be answered with the data in question to begin with.⁹
  • Data science relies on constant interaction with production systems. The most important input for data science, the data, must typically be real production data. Due to its dependence on a data-driven approach, only on rare occasions can data science be practiced with “test” data.¹⁰
  • Reliance on production data also requires more data cleaning and wrangling compared to software development.¹¹

Conclusion

Within the central bank and supervisory, the advent of large datasets and increased computing power — combined with the need for more detailed analysis — has led to the need for the skillset of data scientists alongside more traditional business and IT roles. The division between IT and business has nearly disappeared, with people combining tech savviness with a business-oriented mindset operating smoothly in between the organizational lines. The typical data science workflow described in this article encompasses elements from many different central banking and supervisory roles.

[1]: The views in this article are those of the author and do not represent Bank of Finland or the FIN-FSA.

[2]: A common problem with proprietary real data is that it cannot be distributed due to confidentiality concerns. In such cases, representative synthetic data may be enough to train, say, a machine learning model.

[3]: A good description of the data science prototyping loop can be found, e.g., in Ville Tuulos’ book Effective Data Science Infrastructure.

[4]: List adapted from Six Divisions presented in article 50 Years of Data Science by David Donoho.

[5]: Extract, transform and load.

[6]: In this article, “business analysis” refers broadly to analysis done by subject-matter experts. It also encompasses the role of a business analyst. In central banks, one typical role is “economist”. In this text, the work of an economist is to be understood as a combination of business analysis and economic research.

[7]: I distinguish empirical economic research (econometrics, working with data) from theoretical research (economic theory, theoretical models).

[8]: In addition to extracting and transforming data from a data warehouse, sometimes researchers and data scientists might participate in the data collection itself.

[9]: In this sense, data science coincides with research.

[10]: Say, training predictive models on synthetically created data that has preserved relevant moments of the real data.

[11]: The interaction with production systems does not limit itself to data. Imagine a scenario where a data scientist has built a predictive model that constantly gives out new predictions in a production environment. If the data-generating process out in the wild changes, the once-trained model can become obsolete, or worse yet, start making false predictions. Data-driven models may require constant monitoring — that is, feedback on their success. Monitoring is also common in traditional software development, but again, the dimension of data brings added complexity.

--

--

Ville Voutilainen
SPxFiva Data Science

Data science and quantitative economics, as well as enabling the same for others. Opinions own.