Stories by MathCo on Medium

10 Steps to Unleashing the Power of AI: Transforming Business Processes for the Future

MathCo — Wed, 16 Aug 2023 08:18:26 GMT

Artificial Intelligence (AI) has emerged as a transformative force that enables organizations to optimize their operations, enhance decision-making, and unlock new opportunities. Implementing AI into business processes requires a systematic approach to ensure successful integration. In this article, we will explore the interconnected steps to integrate AI into a currently manual business process, from process flow mapping to automated decision-making, and understand how AI can revolutionize business decisions.

1. Process Flow Mapping:

The journey of AI adoption begins with understanding the existing business processes. Process flow mapping involves meticulously documenting the current processes, identifying bottlenecks, and pinpointing areas that can benefit from AI intervention. A thorough understanding of the workflow is crucial for successful AI implementation.

2. Process Digitization:

Once the processes are mapped, the next step is to digitize them. Converting manual tasks into digital formats lays the groundwork for automation and data collection. Digitization streamlines operations, making them more accessible and suitable for AI integration.

3. Data Capture for Training:

With digitization in place, capturing relevant data from various sources, such as sensors, applications, and databases, becomes essential for training AI models effectively. High-quality data ensures that AI models are accurate, reliable, and able to produce meaningful insights.

4. Process Automation:

Leveraging the digitized processes, the organization can now automate repetitive and rule-based tasks. Robotic Process Automation (RPA) plays a pivotal role in automating routine operations, increasing efficiency, and reducing human errors. This sets the stage for more advanced AI applications.

5. Machine Learning Models:

Machine Learning (ML) is a cornerstone of AI implementation. Organizations can develop ML models tailored to their specific needs. From supervised learning to unsupervised learning, ML algorithms can be applied to classify data, predict outcomes, and make informed decisions.

6. ML Models and External Pre-trained Models:

Combining internal ML models with external pre-trained models can enhance AI capabilities. Leveraging pre-trained models developed by industry leaders allows organizations to save time and resources while benefiting from cutting-edge technologies.

7. Application of ML Model and Externally Trained Model into Decision Flow:

Integrating ML models and externally trained models into the decision flow optimizes the decision-making process. Real-time data analysis, predictive modeling, and pattern recognition enable organizations to make well-informed decisions faster and with higher accuracy.

8. Automated Decision Making and Implementation:

As AI matures, organizations can move towards automated decision-making. AI systems can analyze vast amounts of data, identify patterns, and suggest optimal solutions. Automated decision-making not only accelerates operations but also minimizes human biases in critical choices.

9. Prompts to Get Recommendations:

Interactive AI systems can prompt users for recommendations, leveraging user preferences and historical data to offer personalized suggestions. This enhances customer experiences, increases engagement, and fosters loyalty.

10. Business Decisions:

Finally, AI-driven insights play a pivotal role in shaping business decisions. Data-backed intelligence allows organizations to identify emerging trends, explore new markets, optimize resource allocation, and develop competitive strategies.

The process of making AI real in an organization is a transformative journey that requires careful planning, technical expertise, and a commitment to embracing innovation. By following the interconnected steps outlined above, businesses can seamlessly integrate AI into their processes and unlock a world of opportunities, empowering them to stay ahead in the ever-evolving digital landscape. Embracing AI-driven decision-making will be the key to future success, and organizations that leverage AI effectively will undoubtedly be the leaders of tomorrow.

Author:

Piyush Mundhra, Partner and Head of Customer Success, TheMathCompany

Real-Time Streaming Analytics for Faster Decision-Making

MathCo — Mon, 17 Jul 2023 07:22:30 GMT

The concept of data as a strategic asset has been gaining momentum in recent years. Its importance stems from the invaluable insights it provides, enabling organizations and individuals to make informed choices and drive progress. Data is essential for informed decision-making, problem-solving, innovation, personalization, performance optimization, risk management, scientific advancements, and effective governance. By harnessing the power of data, organizations and individuals can navigate the complexities of the modern world and drive meaningful progress.

Most organizations are continuously striving to make rapid data-driven decisions — both strategic and operational — across multiple critical business units based on trustworthy data. Bad decisions can not only have negative implications internally but can also result in losing an important customer permanently. Therefore, organizations are increasingly adopting emerging technologies, like artificial intelligence (AI), machine learning (ML), Internet of Things (IoT), and cloud computing, both to revolutionize operations and to keep up with competitors.

With the ongoing data explosion, owing to IOT sensors, social media, web and mobile apps, etc., there is a stressing need for real-time and data-driven decision-making by leveraging streaming data analytics. Using streaming analytics, we can analyze multiple streams of data produced by a multitude of components (e.g., IoT devices, social media interactions, financial transactions, customer click-stream, etc.) to generate real-time insights and facilitate faster business decision-making.

This article showcases how we designed a platform-agnostic, real-time streaming analytics engine and how it processes live data feed from Twitter to perform real-time sentiment analysis of users.

Application Design

Real-time systems are required to collect data from various sources and process them as they arrive, within a specified time interval, typically in the order of milli-, micro-, or even nanoseconds, and generate a response that delivers value.

Here are some of the key characteristics of real-time applications.

· Low Latency (extremely short processing durations)

· High Availability (fault-tolerant systems)

· Horizontal Scalability (dynamic addition of compute or storage servers based on need)

The above flowchart represents the overall streaming analytics process. Let’s look at each component in detail.

Data Ingestion

The Twitter Developer API, along with the “tweepy [1]” Python library, is used to read live Twitter feeds and ingest them into Apache Kafka [2] queues.

Tweepy is a popular Python library that provides a convenient way to access and interact with the Twitter API. It simplifies the process of connecting to the Twitter API, retrieving tweets, posting tweets, and performing various other Twitter-related tasks, making it easier for developers to incorporate Twitter functionality into their Python applications.

Data Transformation

For sentiment analysis, “TextBlob [3]” Python library, which provides a simple API for common NLP tasks such as part-of-speech tagging, sentiment analysis, text classification, etc., is used. Text Blob’s intuitive API and wide range of NLP functionalities make it a popular choice for quick prototyping, educational purposes, and lightweight NLP tasks. Its integration with NLTK and availability of pre-trained models make it a convenient library for various text processing needs.

Apache Beam [4]–a unified model for both batch and streaming data-parallel processing pipelines — is being used to perform data transformations on the raw Twitter data, then pushed to the Kafka “outgoing” topic.

Data Storage

Apache Druid [5] — a real-time analytics database designed for fast ad-hoc analytics and supporting high concurrency on large datasets — is being used to store the transformed streaming data from Apache Kafka.

Real-time Dashboarding

Apache Superset [6] — an open-source, highly scalable application for data exploration and data visualization — is being used as the dashboarding tool for data analysis on users’ sentiments (as shown in the figure below).

Application Hosting

The working prototype of a real-time streaming analytics engine has been set up on individual Docker containers, which are hosted on a single virtual machine and orchestrated using Docker Compose.

Summary

Computational analysis of streaming data, such as a Twitter data feed, is a challenging task due to the unstructured and noisy nature of the tweets, which requires integrating the tweet stream processing with ETL tools and ML algorithms to transform and analyze the streaming data.

The framework showcased in this article can be leveraged for a wide variety of use cases, such as real-time recommendation engines, real-time stock trades, fault detection, monitoring and reporting on internal IT systems, real-time cybersecurity for enhanced threat mitigation, and many more.

Further research work in this stream is currently in progress, including integration with fine-tuned ML models for NLP and hosting the Docker containers in a Kubernetes cluster for scalability.

Check out our white paper to learn more about real-time streaming analytics and what it can achieve here.

References

[1] https://github.com/tweepy/tweepy

[2] https://kafka.apache.org/

[3] https://textblob.readthedocs.io/en/dev/install.html

[4] https://beam.apache.org/

[5] https://druid.apache.org/

[6] https://superset.apache.org/

Originally published at https://themathcompany.com.

Author:

· Tapas Das (Manager — Engineering), TheMathCompany

Optimizing Data Protection: Leveraging BigQuery for Disaster Recovery

MathCo — Mon, 22 May 2023 04:16:58 GMT

Understanding the Need for Disaster Recovery

According to the Uptime Institute’s Annual Outage Analysis 2021 report, 40% of company outages or service interruptions cost between $100,000 and $1 million, with approximately 17% costing more than $1 million [1]. Therefore, for operators of mission-critical systems, recovering data from sudden outages and data breaches through effective recovery measures and backup plans remain a priority.

Disaster Recovery (DR) is an implementation strategy involving a set of tools, policies, and procedures that enable the recovery or restoration of critical technology infrastructure and systems in the aftermath of natural or human-made disasters. It helps organizations regain access and functionality to IT infrastructure following events that can disrupt business continuity.

However, a process combining DR and business analytics is complex, tedious, and exorbitant, and it will take years to implement, depending on its complexity. The DR process requires a comprehensive analysis of geographically separated secondary data centers, techniques, and different DR solutions. It also involves building a backup plan to ensure business continuity in case of disasters/failures. To navigate these issues surrounding DR, emerging big data technologies are being considered as potential problem solvers. For instance, when different virtual machine (VM) instances are clubbed together, they form a single cluster possessing better availability and auto-scaling capabilities to meet DR requirements. Services such as BigQuery can meet disaster recovery constraints without requiring any additional effort, whereas resources such as cloud storage-based applications require extra effort for implementation.

A highly resilient and efficient cloud data warehouse alternative is Google BigQuery, a serverless and scalable data warehouse that has an inbuilt query engine to execute SQL queries in terabytes of data. BigQuery’s high throughput and better execution time makes it an excellent candidate for executing terabytes of data in mere seconds. Many organizations have now adopted it to achieve enhanced performance levels without creating or rebuilding indexes, shifting their focus to data-driven decision-making applications and the seamless collection and storage of data beyond siloes. For instance, teams are now using BigQuery to perform interactive ad-hoc queries of read-only datasets. It enables data engineers and analysts to directly build and operationalize business models on PlanetScale, Azure Cosmos DB, or static backend databases with structured or semi-structured data using simple SQL in a fraction of time. All this is heralding a new era of accelerated innovation, making improved agility and scalability possible.

Creating Preemptive Data Backups through Google BigQuery

BigQuery has the advantages of Dremel and Google’s own distributed systems to process and store large-scale datasets in column format. BigQuery also incorporates a parallel execution process across multiple VM instances using a tree architecture and scans the data table by executing SQL queries, thereby providing insightful outcomes in milliseconds. It can even process tables with millions of rows without slashing its execution time. Compared to conventional, on-premises tools that take longer times, BigQuery provides responses within moments. Businesses are leveraging this increased efficiency to enact faster decisions, develop visualized reports using aggregated data, and obtain precise results with high accuracy. Recently, a major social networking site democratized its data analysis process using BigQuery, utilizing a few widely used data tables across different teams including Finance, Marketing, and Engineering [2]. This site identified Google BigQuery as the most effective tool alongside Data Studio in democratizing data analysis and visualization. In addition, the querying in BigQuery was observed to be easy and performant.

The features of Google BigQuery for disaster recovery are illustrated in the figure below.

These features make Google BigQuery a better option for disaster recovery than similar applications in the market.

In general, disasters can be at a zonal or regional level. In the former scenario, the data will not be lost since BigQuery, by default, maintains copies of the data across different zones within a region. However, in the latter case, there can be a potential loss of data due to the unavailability of a replicated copy across different regions by default. To handle such challenges efficiently, copies of the dataset must be maintained in a different region. And to do this, the BigQuery Data Transfer Service can be used to schedule the copy task.

Business Use Cases of Google BigQuery in Disaster Recovery

Disaster recovery is an integral part of business continuity planning (BCP), which is further defined using two concepts: a Recovery Time Objective (RTO) and a Recovery Point Objective (RPO) [3].

An RTO is a part of the service level agreement (SLA) that has a maximum acceptable time for the application to be in the offline mode.
An RPO has a maximum acceptable time for which the data might be lost due to a sudden system failure.

Generally, the smaller the values of RTO and RPO (which define how fast the application can recover from a failure or disruption), the higher the execution cost.

A Use Case for DR: Real-time Analytics

In real-time analytics processes, data streams are continuously ingested from endpoint logs into BigQuery. For systems lacking BigQuery’s disaster recovery capability, ensuring seamless data processing and protection for an entire region would require continuous replication of data and providing slots in a different region. Nonetheless, assuming that the system is resilient to data adversity due to the implementation of Pub/Sub and Dataflow in the ingestion path, such a high level of storage redundancy results in a clear cost disadvantage. However, in a DR-enabled system, users only have to configure BigQuery data at a location of their choice, say zone A. From there, the data is exported to cloud storage under the archive storage class to zone B. In the event of a machine-level disruption, BigQuery continues to execute only within a few milliseconds of delay, with currently running queries continuing to be processed to support teams’ real-time analytics needs. In case of regional catastrophes resulting in data loss in one of the zones, DR allows users to create an exhaustive backup from storage in the other zone. Additionally, BigQuery also allows users to further strengthen their data recovery strategy by creating cross-region replicas of their datasets.

Mitigating Data Losses for the Future

BigQuery is an excellent tool to store and explore granular data, offering advantages such as transparency in terms of cost, seamless and effective integration with other components, and better reliability and scalability compared to its competitors. Unlike conventional techniques, BigQuery does not require any additional configuration. It is predominantly adopted in most analytics projects to handle large-scale data by running simple queries. Considering the challenges businesses face in today’s times, with the increase in data volume and shortcomings of traditional data warehousing, a huge volume of data can be processed in the background to meet requirements using Google BigQuery. It also presents itself as an effective choice if database resources are limited.

With the increase in the centralization of data in IT systems and the growing significance of cloud services, a significant number of database management systems are being outsourced to different cloud service providers. The outsourcing of an entire database to an external provider requires a scalable processing system to access and process all the information from the database. For businesses looking to achieve this on a single platform, Google BigQuery offers a comprehensive, secure, and multifunctional solution to process large-scale outsourced data at significant speed in contrast to other cloud service platforms. Looking ahead, it seems clear that most data analytics and ICT organizations will shift their focus from traditional DBMS to Google BigQuery without having to worry about the significant costs associated with data processing and maintenance of complex infrastructure.

Bibliography

[1]https://www.techtarget.com/searchdisasterrecovery/definition/disaster-recovery#:~:text=The%20Uptime%20Institute's%20Annual%20Outage,cost%20more%20than%20%241%20million

[2]https://blog.twitter.com/engineering/en_us/topics/infrastructure/2019/democratizing-data-analysis-with-google-bigquery

[3] https://cloud.google.com/architecture/dr-scenarios-planning-guide

Originally published at https://themathcompany.com.

Author:

Rohan PR (Manager — Delivery), TheMathCompany

Empowering Co.dx’s Minerva Chatbot with ChatGPT

MathCo — Fri, 19 May 2023 12:05:43 GMT

Co.dx, our proprietary AI/ML master engine lies at the heart of most of our solutions and Minerva, the platform’s integrated chatbot has been an important part of the platform for a few years now. At MathCo, we believe that a big part of humanizing solutions and products lies in how you interact with it. Minerva was built with the goal of humanizing user touchpoints so that it felt less like a chatbot but more like a virtual data scientist helping you save time and effort to work towards your and your organization’s goals. It has done that and more for a while now but with generative AI taking the spotlight, what is the future for Minerva?

To see what the future holds for Minerva, we first have to understand what makes it unique in the first place. Co.dx launched Minerva with natural language processing (NLP) capabilities that would help it understand human speech and context better. Users can ask questions in natural language and get appropriate responses in the form of charts, data summaries, etc. But apart from these, Minerva has some key features that make it more than a simple ask-and-answer interface:

Built with NLP: Ask questions about your data in natural language. Behind the scenes, Minerva uses natural language processing to translate user question into an SQL query.
Visualization on demand: A curated list of data visualizations that help you answer your question. As a user, you have flexibility to choose your visualizations.
Data Stories: Add your on-demand analysis and queries into Data Stories, Co.dx’s reporting module. Schedule and automate reports created using Data Stories.
Flexible: Ability to customize and contextualize metrics and user questions based on the problem statement and business use case.

How ChatGPT empowers Minerva

Minerva has played an important part within the Co.dx ecosystem over multiple years and that is thanks to regular updates to make sure it never feels outdated. One update that adds a new dimension to the tool, however, is the integration of OpenAI’s renowned ChatGPT. But how exactly does ChatGPT enhance an already existing competent chatbot? Read below to learn how:

1. Translating queries to SQL:

Due to ChatGPT’s advanced large learning model (LLM) being trained extensively over the internet, Minerva can now translate natural language into SQL queries at an even better capacity, cutting down on time and effort required and allowing for more complex queries to be processed.

Disclaimer: Images are representative and are not based on real-world data

2. Data Summarization:

Automated data summaries and insights to quickly draw insights from your data. Supplement auto-generated graphs and visuals with relevant statistical and problem-specific summaries and insights.

Disclaimer: Images are representative and are not based on real-world data

3. Visualization and Contextualization:

Through the integration, Minerva is developing the ability to generate insights and summarizations based on generated charts and visuals. For example, previously, if you had asked for “sales across countries in 2020” for a particular brand, a color-coded map would have been visualized for you. Now, additional insights such as value or percentage of sales in the highest and lowest sales regions may also be presented alongside for better context. Not only would it provide more context and information but also complement Minerva’s VDE to make it more effective.

Disclaimer: Images are representative and are not based on real-world data

4. Query unstructured knowledge base:

Minerva now has the ability to query the unstructured knowledge that exists in an organization in the form of PDFs, documents, presentations, etc. This unstructured data can be used to enhance the context and knowledge of a base LLM.

Disclaimer: Images are representative and are not based on real-world data

For users and clients, this would not change the way they engage with the tool but only enhance existing features. Minerva has been helping CXOs and product teams alike in enabling quick turnarounds and automating work to a large extent. Its integration with ChatGPT takes it to the next level to lead the way in a world learning to live and work with generative AI.

Originally published at https://themathcompany.com.

Authors:

Sourav Banerjee (Partner — Innovation), TheMathCompany
Srishti Nagu (Associate — Innovation), TheMathCompany

Forecasting at Scale: MathCo’s skCATS Model Ranks Among Top 10 in the M6 Competition

MathCo — Fri, 19 May 2023 11:58:55 GMT

TheMathCompany, a global provider of advanced analytics business solutions, finished in the Top 10 of the 6th edition of the highly successful time-series forecasting M competition. MathCo’s Innovation team was able to outperform the baseline and predict the future performance of selected stocks and ETFs (Exchange Traded Funds) using its original internal model named skCATS (Complete Automated Time-Series).

From the invention of astronomy to modern-day weather predictions, the science of forecasting has been an object of human pursuit since time immemorial. Today, organizations everywhere leverage analytics tools to analyze historical data and forecast trends, aiding in business planning and critical decision-making.

Yet, time-series forecasting-a technique that makes scientific predictions based on historical time-stamped data-remains a complex analytical task for most enterprises even today. Transient anomalies, the adaptivity of forecasting methods, and the scalability of the data and pipeline infrastructure have been some of the persistent challenges of scaling time-series forecasting in business contexts.

Financial markets are one of the foremost areas in which time-series forecasting is applied today. This is where the competition of discussion comes into play. The 6th iteration of the M competition, known as the M6 competition, focused on evaluating the accuracy and value of time-series forecasts towards the explanation and validity of the Efficient Markets Hypothesis (EMH). This year, TheMathCompany’s time-series forecasting skCATS model was able to secure the 8th position overall in the global M6 leaderboard.

Check out the story behind the development of the skCATS model and why it worked, along with a few tips for next year’s participants from our victorious Innovation team.

The M6 Competition

In 1982, Spyros Makridakis, one of the world’s leading experts on forecasting, began a series of competitions to monitor forecasts in real life and rate their accuracy. The vision was to develop breakthroughs and solutions in response to real-world challenges. The latest edition of this competition, held from March 2022 to February 2023, was the first to tackle and question the EMH, which posits that all asset prices reflect all known information and therefore, the market is perfectly efficient.

The M6 competition sought to find empirical evidence on how investors can enhance the accuracy of their forecasts and utilize their findings to build resilient and lucrative portfolios. However, given the vastness of the field, the myriad of questions forecasting tries to answer, and the countless number of forecasting approaches available, attaining benchmarking accuracy is no small feat. The M6 competition pits different methods against each other to determine which performs best in different real-world scenarios.

The duathlon challenged the participants in two ways:

1. Provide probabilities for the performance quintile of 100 selected stocks and ETFs for the upcoming month.
2. Provide an investment portfolio each month based on their performance expectation of the same stocks and ETFs.

SkCATs & Financial Forecasting

The skCATS model was developed to forecast time series at scale, delivering fast, highly accurate, and cost-effective results. The model uses a two-model strategy, where the first model is focused on generating ranked forecasts of the chosen 100 ETFs and stocks based on past performance, and the second model is focused on determining if the results could beat the market baseline. Predictions were produced by combining the two sets of model results. The investment decision model used an ensemble approach, combining portfolio optimization theory and business fundamentals. The development processes to satisfy the dual challenges are explained below.

Development of the skCATS Model

a. Forecasting Challenge

The Innovation team began testing and refining its methods prior to the release of the competition’s stock and ETF list. To do so, a random sample of 100 assets from different sectors of the S&P 500 was selected. The team then generated input variables using fundamental analysis (operational variables like profit/loss ratio, liquidity, etc.) as well as the macro drivers (inflation, volatility index, unemployment, etc.) of each industry and applied traditional time-series models to generate the results.

However, it soon became clear during the exercise that this approach would not suffice as performance failed to consistently beat the baseline. The problem was further complicated as there was the need for the generation of forecasts and confidence intervals for the input as well as output variables to be able to forecast the future. With some minor adjustments, the Innovation team was able to transform this task into a classification problem to obtain a simple relative ranking of the 100 assets.

Based on these new assumptions, the team believed that the historical relative ranking of assets would not differ significantly from their ranking in the following month. This assumption was tested by creating rolling relative rank probabilities for a period ranging from 40 to 56 months. After calculating the normalized count of each asset falling into a specific rank, the team combined all 17 rolling probabilities, resulting in 85 rows of features for the model.

These sector-wise models significantly outperformed the baseline during the back tests. MathCo’s team then defined 15 unique sectors within the M6 universe and tuned 15 skCATS classification models with the objective of accurately classifying performance of stocks and ETFS for the following month. The team finalized the models after 10 months of rigorous back testing.
Additionally, since the skCATS model training period did not account for inflation-related volatilities, the team trained a second skCATS model that assessed the inflation volatility each sector faced and re-adjusted the results of the first model accordingly throughout the competition.

b. Investment Challenge

To inform the investment decisions, the team developed multiple independent methods for assigning weights to the assets. Some methods relied purely on historical data while others were based on fundamental ratios. The team created a weighted ensemble of all methods to inform their decisions for the following month and fine-tuned the weighting system over a 10-month back-testing period to maximize the Sharpe ratio. Like with the forecasting process, the team adjusted investment weights with a second model that assessed potential sector volatility based on the sentiments expressed in the Federal Open Market Committee (FOMC) minutes.

How skCATS was Modified to Tackle Real-World Problems

Reliable forecasts are needed for multiple applications like capacity planning, merchandizing, and web traffic forecasting. However, scaling forecasts to large volumes of data presents unique challenges, like the needs to parallelize model execution for faster run time, to choose the right modelling approach based on the time-series pattern, and to make trade-offs between explainability and accuracy. MathCo’s proprietary AI/ML master engine, Co.dx, provides a no-code interface to scale forecasts reliably. The skCATS forecasting approaches have now been integrated with this AI-powered analytics platform. Co.dx can then be used for demand forecasting at scale as it automatically provisions the right infrastructure and selects the right model to forecast for each time series. TheMathCompany thus blends state-of-the-art deep learning processes, statistical forecasting methods, and specialized techniques for intermittent forecasting to arrive at optimal forecasts for thousands of time series. Through the use of the updated Co.dx engine, the company’s internal benchmarks have observed up to 60% gains in time in reliable forecast generation for multiple time series.

Conclusions and Recommendations

Any analytics solutions provider should be focused on productionizing complex data systems for clients. To achieve this, the company products must work well in live settings, which are characterized by incomplete datasets with varying frequencies, continuous random shocks, multiple sources of noise, and complex feedback loops. Live-testing products prior to launch is, therefore, critical to ensuring process optimization, excellent software quality, and operational efficiency.

As a successful global consulting firm that caters to Fortune 500 firms, TheMathCompany continuously pushes technological boundaries to help clients address concerns and overcome challenges. The live nature of the M6 contest was, therefore, ideal scenario to benchmark the performance of the skCATS model. Since the overall forecasting results are similar to what was observed during the back tests, there is a level of consistency that skCATS can produce in its results under live scenarios.

However, it is important to remain circumspect and not over-extend the conclusions. While the overall forecast results being below baseline during the back tests and the actual competition, when analyzing the results month-to-month, the MathCo approach beat the baseline 6–8 times a year. As the M6 winners managed to keep their predictions below the baseline for all 12 months, there was certainly scope to improve on the skCATS models.

At the same time, although the EMH stipulates that no one can consistently outperform the market, MathCo’s Innovation team saw clear opportunities to outperform the market over the duration of the competition. This was primarily due to a mismatch between the Federal Reserves’ inflation expectations and their interest rate path, in comparison to the market’s outlook on the same. The team found that the markets continuously underestimated and then readjusted to the Federal Reserves’ assumptions and associated actions in this regard.

In every iteration, the M competition witnesses transformative developments in the arena of financial forecasting. TheMathCompany’s Innovation team is proud to have taken part in and been placed in the Top 10 of the prestigious M6 competition. As a firm that is dedicated to optimization and innovation, MathCo continues to use the competition as a benchmark to improve its products for the benefit of its clients.

Originally published at https://themathcompany.com.

Consumerizing Data at Scale with Data Mesh

MathCo — Fri, 19 May 2023 09:18:26 GMT

Data Architecture: Why is it crucial to know?

In the age of self-service business intelligence[1], becoming a data-driven organization remains the chief strategic goal of every company. However, few companies tend to their data architecture with the level of democratization and scalability it deserves.

Both the analytics and technology industries are now in a state of transition. In fact, we rarely use the phrase “Big Data” anymore; instead, we talk about “digital transformation” or “data-driven organizations”. The industry, largely, has realized that data is not the new oil, because, unlike oil, the same data can be repurposed for several initiatives.

Much in the same way that software engineering teams transitioned from monolithic applications to microservice architectures, data mesh is, in many ways, emerging as the data platform version of microservices.

This new kind of data architecture will empower faster innovation cycles and lower costs of operations, with evidence from early adopters of this approach validating potential large-scale benefits. [2] [3] [4]

Traditional Data Lake Architecture: An E-commerce Case Study

A typical data lake architecture for an e-commerce business mainly constitutes the following:

Customer domain
Order domain
Invoice domain
Inventory domain

For each domain, a team of data engineers inputs all the data, via ETL tools or streaming solutions, on a central platform (Data Lake). Although each team may possess expertise about their specific domain, a knowledge gap among different teams and their data sets may persist.

What are the potential challenges with this architecture?

The data platform is monolithic, and although it serves multiple teams, from a development perspective, it is to be maintained by only a single team.
It creates a central bottleneck at the data engineering front, and teams need broad experience in software development, data engineering, business analytics, and data sources used to be able to maintain the data platform.
The domain knowledge that the individual source teams possess will likely be untraceable through its course to the central hub.
Since the Data Platform team performs ETLs to finally publish the data, end users would need to work with the Data Platform team on their data requirements.
Ownership is not clear, as data flows from sources to consumers and is transformed by different teams in the process.

How does Data Mesh approach this problem?

In this e-commerce scenario, rather than centralizing data, data mesh emphasizes four key principles:

Domain-oriented decentralized data ownership and architecture
Data-as-a-product
Self-serve data infrastructure as a platform
Federated computational governance

Data mesh moves data analysis and management closer to specific domain teams, such as customer insight, sales, or data science teams, who best understand the data. In this pragmatic and automated approach, each team owns and is responsible for the data in their domain.

Here is what the same e-commerce business would look like with a data mesh architecture:

What has changed in the transition to data mesh?

Individual teams, such as the analytics & marketing teams, can now directly access data from the source domains. They do not need to consult with the platform team for implementation or metadata integration. This self-service approach helps keep potential bottlenecks at bay.
Owing to decentralization, each domain will have its own resources and implementation, avoiding overlapping.
In this new structure, data platform team(s) do not require a deep understanding of domains; being skilled in software development and data engineering will suffice.
Ownership becomes clear since domain teams are responsible for providing reliable data from the source to consumers, with platform teams supporting integration.

To Mesh or Not to Mesh: Calibrating Data Mesh against your Business Scale

Below is a simple questionnaire to determine whether it’ll be worthwhile for your organization to invest in a data mesh.

1–15: Given the size and dimensionality of the data ecosystem, a data mesh may not be the right approach for your organization at present.
15–30: The organization is maturing rapidly and may even be at the crossroads of leaning into data. The score strongly suggests following data mesh best practices so that migration at a later stage becomes easier.
30 or above: Data organization is an innovation driver for the company, and a data mesh will support any ongoing or future initiatives to democratize data and provide self-service analytics across the enterprise.

New Directions in Data Architecture: Data Mesh

On the surface, the idea of a data mesh is not very different from several software-as-a-service (SaaS) applications, as it leads to obtaining customers, offering data as products, and selling and shipping them. For instance, data for demand forecasting or customer segmentation, or a BI dashboard.

However, implementing a data mesh requires 2 key considerations:

1. It takes time and changes in approach to create a platform such as this. It is like with deciding on an MVP architecture — “should it be monolithic or microservices?”

2. How can one logically group and organize domains? This requires an enterprise view and likely a cultural shift for your organization as well, entailing federating data ownership among data domains, and owners who are accountable for providing their data as products.

Although it is easier, faster, and cheaper to deliver a monolithic application, especially for the very first release, businesses that can identify their key requirements as described above can unlock new potential and create value with data mesh.

Bibliography:

[1]https://searchbusinessanalytics.techtarget.com/definition/self-service-business-intelligence-BI

[2]https://medium.com/intuit-engineering/intuits-data-mesh-strategy-778e3edaa017

[3]www.assetservicingtimes.com/assetservicesnews/dataservicesarticle.php?article_id=10675

[4]https://www.youtube.com/watch?v=TO_IiN06jJ4

Originally published at https://themathcompany.com.

Author:

Tapas Das (Manager — Engineering), TheMathCompany

Causal Inference and Statistical Tests for Business Analytics

MathCo — Fri, 19 May 2023 09:05:02 GMT

What is Causal Inference?

Suppose a fashion retail company takes out a promotional discount offer during a weekend and sees a considerable increase in sales that weekend. Should they celebrate the success of the promotional campaign which led to all that increased sales? Or should they pause to learn whether uptick in sales was truly because of the offer, or due to the festive season that weekend?

Let’s take another example. There are two restaurants, A’s Pizzas and B Pizzeria, which sell similar kinds of pizza. B Pizzeria sells 100 pizzas per day while A’s Pizzas, located in a busy part of town with a lot of people crossing it, sells 500 pizzas per day. One day, A’s Pizzas decides to drop its prices by 5%. After this, we notice that A starts selling 1000 pizzas per day, and B starts selling 200 pizzas per day. Did the price drop by A lead to a drastic difference in sales, of 800 pizzas per day, between the two stores?

The answer to both the questions is not simple. To answer these kinds of business questions, we would need to conduct randomized controlled trials to establish a cause-and-effect relationship. In the first example, it could have taken the form of the fashion retailer rolling out the offer in some stores while keeping prices unchanged in other comparable stores on the same weekend to measure the effect of promotions. However, conducting such elaborate experiments can be expensive and is often not feasible. Therefore, in cases such as these, we often analyze after-the-fact-observational data to find out whether a particular strategy (a promotional offer or a price drop in our examples) led to an increase in sales.

As data scientists, we’re interested in how to train a machine to understand such cause-and-effect relationships and help establish what is called as “causal inference”. In the subsequent sections of this article, we explain the core concepts behind causal inference and its applications in the business setting.

Causal Modeling in Machine Learning

While machine learning techniques do a great job at predicting an outcome, they do not (yet) answer the why behind the predictions. Explainable AI techniques like SHAP and LIME have become popular choices to bring transparency into these predictions. However, they are post-hoc explanations of the ML model-i.e., their purpose is to explain the correlations detected by the ML model (often by creating another model for the explanation). Causal models, on the other hand, are purpose-built to discover causality and answer causal questions.

Logic behind Causal Inference: Counterfactual Analysis

Causal modeling deals with the problem of estimating counterfactuals. Let’s consider a situation where an e-commerce company wants to run an email campaign to boost purchase of its products and then find out how much of the purchase was caused by the email campaign. In causal modeling jargon, the promotional email is the “treatment”, and purchase is the “outcome”. Our objective is to estimate the effect of treatment on the outcome.
In the absence of randomized controlled trials (A/B testing), we can estimate the causal effect of emails on customer purchases using the observational data about customers.

Causal effect = purchase value if email sent — purchase value without sending email

This needs to be calculated for each customer and then the effect is aggregated across all customers. However, the challenge here is that for any given customer, only one of the purchase values is observed-as illustrated in Table 1, where the numbers highlighted in orange are not observed.
A potential solution is to impute this missing data in orange based on how much similar customers have purchased and then derive the causal effect. Causal inference frameworks provide various methods to address these challenges and answer causal questions.

Table 1: Numbers in orange color indicates unseen outcome

Understanding Causal Inference Concepts

Before getting to the process, let’s first understand a few basic concepts with the help of examples.

Directed Acyclic Graph (DAG)

A directed acrylic graph (DAG) provides a visual representation of the causal relationship among a set of variables. Causal models usually take the DAG as a starting point and estimate causal effects in the graph.

The DAG should be carefully constructed with multiple viewpoints from subject matter experts as this will determine the rest of the causal modeling process. It is important to note that a DAG may not always represent the true nature of the causal relationships, but it is a framework to explicitly declare our hypothesis about those relationships. A single problem, if highly complex, can have multiple DAGs. While there are algorithms that can help discover the causal structure, they just represent one possibility of the true DAG. Determining the causal structure for real world problems can be challenging. This is why incorporating domain knowledge into the causal modeling process is crucial.

Fig1: — Illustrative DAG

Confounders

A key challenge in estimating the causal effect of any treatment is the effect of confounders on both the treatment and the outcome. A variable is called a confounder if it predicts both treatment and outcome. In the email campaign example earlier, what if the e-commerce company sent emails to high-income customers, and high-income customers also purchase more. In this scenario income is the confounder that will make it challenging to estimate the true effect of treatment (email) on outcome (purchase). (Explained in Figure 2).

Loosely put, causal inference frameworks estimate the causal effect by removing all connections to the treatment, keeping everything else the same. As explained in Figure 3, a causal model can remove the confounding effect of income and then estimate the causal effect of email on purchase.

Fig2. Income is the confounder that makes it challenging to estimate the causal effect of email on purchases
Fig3. Causal model estimates impact of treatment on outcome by removing the confounding effect

Matching

The example in Table 2 shows two sets of customers and their purchase values: a) customers who were sent promo emails, and b) customers who were not sent promo emails.

Table 2: Customer purchase data

The mean purchase values of the 2 groups of customers are $400 and $233, respectively. At first glance, it looks as if sending promo email caused $167 incremental purchases. However, notice that income of customers who received promo emails is higher than that of customers who did not. The effect of promo emails on purchase value is confounded by income in this example.

To estimate the true causal effect, we need to match customers who received emails with customers in the same income level who did not receive emails. Table 3 is created to estimate the true causal effect of promo emails by conditioning for income. The conclusion is that promo emails indeed lead to higher purchases, but the incremental value of those induced purchases is much smaller-at $67.
While we have come up with a simplistic example here, in the real world, there are multiple confounders and hence more sophisticated matching algorithms are used.

Table 3: Matching to estimate causal effect

Causal Inference Methods: Highly Significant for Data Scientists and Businesses

The adoption of causal inference methods is increasing wherever the objective is to answer why something is happening. They are an additional toolkit for data scientists for use cases where causality is more important than prediction. We have seen successful application of causal inference in answering questions such as “What causes customers to become more/less loyal”, “What factors cause higher adoption rate for a technology product.”

Tech giants[1] [2]are leading the research in causal inference to extend the capabilities of the technology. Developments in causal machine learning, where the aim is to build causality into the Machine learning process, are extremely promising. There are several useful open source libraries for causal modeling that make it relatively simple to get started with these approaches.

Glossary:

SHAP: SHapley Additive exPlanations
LIME: Local Interpretable Model Agnostic Explanation

Bibliography:

[1] https://www.microsoft.com/en-us/research/group/causal-inference/

[2] https://www.uber.com/en-IN/blog/causal-inference-at-uber/

Originally published at https://themathcompany.com.

Authors:

Sourav Banerjee (Partner — Innovation), TheMathCompany
Malayaja Chutani (Manager — Innovation), TheMathCompany

Explaining Chart Summarization Using a Custom Model

MathCo — Mon, 15 May 2023 10:17:45 GMT

Visualizing information in the form of charts, like bar charts, line charts, pie charts, or histograms, in order to obtain meaningful insights from the data is what we data scientists do on the daily. But what if we attempt to do the reverse — meaning extract information in the form of summaries from available charts? In this article, we, data scientists from TheMathCompany’s Innovation team, address techniques that can be used to extract information from charts and any challenges therein.

What is Chart Summarization?

Inferring insights from scientific charts can be challenging for clients from non-data science backgrounds. This is where chart summarization comes into play — this technique will help keep track of relevant information that might be overlooked otherwise. It can also help prepare better reports with strong references and an illustrative description. Often, captions do not contain complete information about the features of a chart, which is where a brief description will prove to be useful. Automatic chart summarization also helps clarify information for visually impaired people, who can gain insights using screen readers.

Technically, this task of information extraction from charts can be defined as producing a descriptive summary from non-linguistic structured data (in the form of tables), which falls under the umbrella of natural language generation (NLG). Natural language processing (NLP) and NLG are crucial aspects of the analytics industry, with a gamut of applications. The field of NLP has seen great progress after the development of transformers — an architecture made of neural networks, which is often used in translations.

One such transformer-based super model is GPT3 by OpenAI. The Guardian [1] got it to generate an entire article by asking it to write “a short op-ed in clear and concise language in about 500 words, with focus on why humans have nothing to fear from AI.” This super model can even compose poetry. The attention mechanism used is what makes all of these wonders possible.

The Attention Mechanism Explained

Consider these sentences:

The animal did not cross the street because it was too tired.

The animal did not cross the street because it was too wide.

In the first sentence, it refers to the animal, but in the second, it refers to the street. Natural language is full of these complexities, which we, as humans, can handle surprisingly well; our brains calculate and distribute appropriate attention to context. However, the same is not so simple for a machine. This is why transformers are useful — it is a system that understands what to focus on in a sentence and how much attention to pay to each part.

We use this attention mechanism for various tasks like writing or summarizing articles, retrieving machine translation documents, and more.

For our purpose, we have used the same translation-based model, the ‘T5 model’, to translate a chart into comprehensible text, where one language consists of numbers (i.e., chart information) and the other is simple English. The Text-to-Text Transfer Transformer, known as the T5 model, is more efficient in text-to-text tasks, meaning for any given text input, it will produce a text-based output. Here’s how we trained one such model to read numbers (in charts) and generate text summaries.

We adopted the approach of converting this image-to-sequence problem into a sequence-to-sequence problem. An image-to-sequence task involves interpreting information from an image and giving text as an output, e.g., generating a caption for an image. A sequence-to-sequence task implies a text-to-text conversion, e.g., translating French to English, as below. Here, an encoder–decoder-based transformer was used for translation.

Fig. 2: A sequence-to-sequence task using an encoder–decoder-based transformer

Our approach for this task was to work with the chart data and not with the chart images. So, we converted the chart data into a sequence of data points.

We scraped data from an online statistical data portal, Statista, for different categories like aviation, pharmaceuticals, manufacturing, banking, and finance. The scraped data included the table, chart images, chart title, and the available summary or description. An enhanced transformer approach was then followed, where a data variable substitution method was also applied to minimize hallucination, an issue where NLG models predict incorrect words (or “tokens”), which degrades the factual accuracy of the generated summary. To combat this problem, the approach [2] we followed considered seven categories of data variables, viz., subjects, dates, axis labels, titles, table cells, trend, and scale. If the generated token matched a predefined data variable, a look-up operation was performed to convert the generated variable into corresponding chart data.

In another attempt, we used the T5 model from the Hugging Face library — an open-source library built for researchers, machine learning (ML) enthusiasts, and data scientists — to build, train, and deploy various ML models to predict chart summaries.

Before performing further analysis, we divided the entire data set into the following:

· A train set, i.e., a sample of data used to fit the model;

· A validation set, meaning, sample of data to get an unbiased evaluation of a model fit on the training data set;

· And a test set, which is a sample of data used for the unbiased evaluation of the final model.

The numbers of the data samples for the train set, test set, and valid set were 5703, 1222, and 1222, respectively. The T5 model was trained by crafting the data in a different manner — the data, title, and summary for each chart were concatenated. Below is a sample piece of data from the train set.

Sample train data: {‘data’: “‘title’: ‘Global spending on motorsports sponsorships 2011 to 2017’ ‘data’: ‘Year 2017 Spending_in_billion_U.S._dollars 5.75 Year 2016 Spending_in_billion_U.S._dollars 5.58 Year 2015 Spending_in_billion_U.S._dollars 5.43 Year 2014 Spending_in_billion_U.S._dollars 5.26 Year 2013 Spending_in_billion_U.S._dollars 5.12 Year 2012 Spending_in_billion_U.S._dollars 4.97 Year 2011 Spending_in_billion_U.S._dollars 4.83 ‘ “}

Original summary:

The amount spent globally on motorsports sponsorship from 2011 to 2017 is displayed in the above table. The total amount spent on racing sponsorships worldwide in 2013 was USD 5.12 billion.

Model output:

“This statistic shows the global spending on motorsports sponsorships from 2011 to 2017. In 2017, 5.75 billion U.S.D. were spent on sponsorships.”

We can see from the above train example that our model predicted the summary precisely by interpreting the chart data with good numerical accuracy.

Evaluation Metrics:

In order to evaluate the results, we considered ROUGE (Recall-Oriented Understanding for Gisting Evaluation) metrics, which are specifically applicable to summarization tasks. It matches the number of matching n-grams between the model-generated text and the reference text, i.e., the original text.

The average ROUGE scores for the train, valid, and test sets are as follows:

Train set = 0.42

Valid set = 0.43

Test set = 0.44

One of the results from the sample test data can be seen below.

Original summary:

The nations and territories with the lowest anticipated fertility rates in the period between 2050 and 2055 are included in this data. Singapore is predicted to have the lowest fertility rate in this period, with an average of 1.38 children born per woman.

Model output:

“This statistic shows the number of children, with the lowest global fertility rates from 2050 to 2055 being in Singapore. The average fertility rate was 1.38 children per woman.”

Our model captured the context and numbers fairly well but there is scope for further improvement when it comes to generating factually accurate summaries. We are already experimenting on fine-tuning our model with a larger and more diversified dataset for better and more stable results. A mature model can then be deployed on our projects with clients to assist executives with our data findings.

References

1. https://www.theguardian.com/commentisfree/2020/sep/08/robot-wrote-this-article-gpt-3

2. https://www.statista.com/

3. https://doi.org/10.48550/arXiv.2010.09142

4. https://huggingface.co/docs/transformers/model_doc/t5

Glossary

1. NLG (Natural Language Generation): NLG is a branch of artificial intelligence that converts structured input data into human readable text.

2. NLP (Natural Language Processing): NLP is a branch of artificial intelligence that enables computers to understand text and speech similar to human understanding capabilities.

3. Transformer: A transformer is an architecture made up of neural networks which has a novel encoder-decoder framework. It has emerged to solve the sequence-to-sequence tasks for example translation from one language to another efficiently.

4. Scrape: The process of extracting data from a website or database into a local file saved on a user PC is called data scraping.

Stories by MathCo on Medium

10 Steps to Unleashing the Power of AI: Transforming Business Processes for the Future

Real-Time Streaming Analytics for Faster Decision-Making

Application Design

Data Ingestion

Data Transformation

Data Storage

Real-time Dashboarding

Application Hosting

Summary

References

Optimizing Data Protection: Leveraging BigQuery for Disaster Recovery

Understanding the Need for Disaster Recovery

Creating Preemptive Data Backups through Google BigQuery

Business Use Cases of Google BigQuery in Disaster Recovery

A Use Case for DR: Real-time Analytics

Mitigating Data Losses for the Future

Bibliography

Empowering Co.dx’s Minerva Chatbot with ChatGPT

How ChatGPT empowers Minerva

1. Translating queries to SQL:

2. Data Summarization:

3. Visualization and Contextualization:

4. Query unstructured knowledge base:

Forecasting at Scale: MathCo’s skCATS Model Ranks Among Top 10 in the M6 Competition

The M6 Competition

SkCATs & Financial Forecasting

Development of the skCATS Model

a. Forecasting Challenge

b. Investment Challenge

How skCATS was Modified to Tackle Real-World Problems

Conclusions and Recommendations

Consumerizing Data at Scale with Data Mesh

Data Architecture: Why is it crucial to know?

Traditional Data Lake Architecture: An E-commerce Case Study

How does Data Mesh approach this problem?

To Mesh or Not to Mesh: Calibrating Data Mesh against your Business Scale

New Directions in Data Architecture: Data Mesh

Bibliography:

Causal Inference and Statistical Tests for Business Analytics

What is Causal Inference?

Causal Modeling in Machine Learning

Logic behind Causal Inference: Counterfactual Analysis

Understanding Causal Inference Concepts

Directed Acyclic Graph (DAG)

Confounders

Matching

Causal Inference Methods: Highly Significant for Data Scientists and Businesses

Glossary:

Bibliography:

Explaining Chart Summarization Using a Custom Model

What is Chart Summarization?

The Attention Mechanism Explained

References

Glossary

Further Reading on Transformers