Data Lineage — An Operational perspective

What can a BI team learn from a data science team?

Vimarsh Karbhari
Acing AI
5 min readMay 5, 2020

--

Operational metrics like CAC, LTV, growth metrics, cash flow and profits are the heart of any operations team. These are actual corner stones for operational teams. Having a historic view of all this data is very important. It helps chart a trajectory on the business in comparison to where it is at today.

Data lineage is one of those problems a organization doesn’t know it has until it starts to scale. When a team is only working on one or two projects, it’s easy to keep track of things with spreadsheets, word of mouth, and Slack. But as the team grows and they take on more projects, that kind of ad-hoc system breaks down fast. Data science team since they deal with data all the time usually has processes and tools to solve this issue.

Photo by Miguel A. Amutio on Unsplash

Data lineage— biggest tool in a business operations toolbox

Mature business operations teams live and die by their metrics. Each operations team within a company usually tracks this data across spreadsheets, box folders, cloud applications and uses applications like Tableau to display this data. There are many instances where data lineage is lost or muddled and that results in unforeseen circumstances. They are:

  • Moving off from an previous application to a newer application: The marketing team decides to move from Marketo to Hubspot. The data perspective Marketo has vs Hubspot is different and this could lead to issues. To solve this problem, it is important to engage with the deployment team for the newer application, prepare a plan and run metrics before and after in different test environments to compare values. Each application has a different view of looking at data points. It is important to meld and have a smooth transition from a metrics perspective before anything process is built on the newer application.
  • Loss of team members: A critical member of the team departs taking a bunch of tribal knowledge about metrics. It is important to have knowledge documentation for different metrics and their definitions stored. Actual data loss could be regenerated by other team members. However, when the metadata context of the data disappears, it becomes a bigger problem. We will discuss the metadata issue solution in our next topic about leverage from data science to business intelligence.
  • Process entropy: A process inevitably transforms data, leading to data becoming muddled. Process design is a critical process. For each process that is designed to fetch values or update values, there should be documentation and communication on the area. Whenever a process changes, a before and after look of the metrics and values should be checked to make sure there are no problems. These processes should be viewed from the purview of a version control so changes could be tied to versions of a process.

What can be leveraged from data science to improve business intelligence?

Sources like database, IoT device, warehouse or web scraper give the data that tells data scientists about the quality of that data. It also helps them to trace back errors to their source. If a database owner changed fields or changed a metadata value in the database, it can affect the outcome of training. Data scientists work find the root issue of those changes by working across business divisions to either roll back to an earlier state or incorporate those changes into their analyses.

Additionally, understanding the history of the data gives the team valuable insights about the analytics pipelines, simplifies training and reproducing experiments much easier by letting the team track everything back to the root of the problem. This also makes it easier to train new team members when they join the team. The data science team uses tools, processes and techniques to maintain versions of the data as well as the lineage. Avoiding the metadata drift is a very important problem to be addressed within business intelligence.

It is important that the data analytics and BI team gets embedded into different business operation teams from a functional perspective. It is also as important to learn best practices from the data science team from a technical perspective. Once an organization scales, each team has its own perspective on data and on data about the world and the business. The metadata at this point is actually as important as the data. Master data management (MDM)is an important concept in the same vein when it comes to modern data business operations. However, it is important to note that each team might start having different definitions about the same data point. Before the metric definitions and metadata definitions start to drift within and organization, it is important that the teams start to align on the MDM strategy.

One of the solutions to the metadata loss problem and the MDM strategy is to have a centralized repository about metadata. Uber has done this using databook. It is also important to expose this metadata correctly by having a tool like airpal which was build by Airbnb for its internal teams.

Importance of a data analytics/BI team

It is imperative to start thinking about data analytics/BI team early on in the life cycle of an organization. The team helps aggregates data from different sources like product usage, conversion metrics, LTV, visitors, leads and customer related data. In the life cycle of an organization, in my experience, I have seen teams start with building a warehouse and getting all the data in a centralized location like a data lake. That is followed by building scalable pipelines to ensure data gets to the data lake. The team also starts building human capital to help sales, marketing, product, customer success, business development, partner networks to get relevant data. Once the data lake and pipelines are built, the teams start building schedules and processes to get reports and analytics on a regular basis to their stakeholders. The team as it gets mature with the organization will start looking into things like standards, MDM and easy access mechanisms like metadata repositories. They could also work with data science teams to start adopting data version control.

Tools

There are many tools in the BI teams repertoire. The data visualization could be achieved using Tableau, Qlik, Looker, Domo and many others. For data pipelines teams use Apache airflow. Streaming could be enabled using Kinesis/Kafka followed by data aggregation which could be done using Apache Spark/Apache Storm. For storage each cloud provider has their own solution. Additionally for more mature and sophisticated teams, Snowflake or Databricks could be used.

Recommendations and conclusion

Maintaining situational awareness, awareness about maturity and understanding where data comes from should be central to the collection and use of data within your business. These lessons should be incorporated into all the tools, processes and advanced analytics that leverage that data. Some of these lessons should be borrowed from the data science teams. Applying the continuous approach creates a strong positive re-enforcement loop. It also drives to continually improve data analytics and BI platforms within the organization and helps teams to provide better analyses of the data coming out of those platforms.

Subscribe to our Acing Data Science newsletter for more such content.

Thanks for reading! 😊 If you enjoyed it, test how many times can you hit 👏 in 5 seconds. It’s great cardio for your fingers AND will help other people see the story.

--

--