Are You Still Ignoring Text Data?

Published in

Walmart Global Tech Blog

5 min readNov 3, 2020

Text data analytics is an activity through which an organization can derive value out of text messages that are shared by the customers. Text Analytics is a method of deriving value from unstructured data. This type of analysis uses various statistical and machine learning techniques to derive meaningful data for analysis, to measure product reviews, feedback, and support for fact-based decision making. These text messages can be posted by customers on various social media platforms like Twitter, Reddit, Facebook, etc. All these channels have plenty of text data which can be engineered to find patterns of what are the major issues or challenges faced by Walmart customers at either store or on our dot com.

Overview

Walmart, being the largest retail company has petabytes of data and has the largest data Lake in terms of Volume, Variety Velocity, and Complexity. Data Lakes are mandatory for the organization as they help to consolidate all the data. Lakes are hosted either on-premises or on cloud and further increasing the capability of the organization to derive required insights. It’s a common practice for all organizations to have data lake. However, Text Data Lake is generally ignored because of the complexity involved in deriving the analytics from the data due to the unavailability of a single unique key, which is essential to join it with traditional organization data. Ignoring text data blindsides various organizations from the challenges that are faced by their customers on day to day basis. Hence it is important to start thinking of how effectively we can stream the social media data coming from social media platforms, call center, product reviews, store and dotcom reviews, and blogs.

Every organization has different methods to classify the data. Some common stages to classify data in lakes is Gold data(just raw), Cold data (historical — collected overages), Hot data (Live data — streaming data), clean data (catalog).

This activity helps in the consolidation of structured, unstructured, and semi-structured data. The simple key goal of active analytics. The key goal that every company wants to achieve using these data sets is to derive analytics and enhance the experience of its customers, users, employees, etc. Text data lake plays the role of identifying key events and capturing the information from various social media movements on social platforms, mobile logs, IoT devices, and system logs. All events are captured on various locations help in refining the data which is further used to generate value. These events also empower businesses to increase the probability of success for the decisions they make.

Key design considerations

The following are the key design considerations:

Ensure faster response, near real-time data stream
Ensure security of the data never store personal information
Ensure data quality as any data discrepancy can cause improper signals and result in wrong customer signal
Make the feeds multi-tenant
Optimize data pipelines to ensure that pipelines are running without any issues
To make the feeds easy to query by category and source
Capture customer alerts and category of levels
Make sure Country code is normalized based on your master data management(MDM)
Make sure Language codes are a normalized example: English or full name based on your MDM
The timestamp should be normalized mm/dd/yyyy 24HH:MI:SS and bring it back to the standard time zone and timestamp.
Standardize the Location identification techniques

Text Data Lake Architecture

The best cloud data engineering architectures are always on a scale from zero to infinity scaling to process batch or real-time data, which can be enabled using dockers or multiple cloud offerings. The best architectures always allow getting the data on-time or realtime with minimal cost. The below architecture is an example Text Data Lake architecture to enables the event collection, data injunction, data Process, Labelling, and reporting.

Data Feedback Loop

Feedback is the key to refine the machine learning models and get more accurate results. Sometimes the feedbacks are manually reviewed, and adjustments are made to the models.

Data Quality Checks

Never compromise your data quality checks and data Governance for the data existing in the lake. Data Quality rules enhance data quality and allow organizations to take strong decisions based on the data points.

1. Remove Duplicates

2. Excluding the keywords

3. Size of Content <10 consider removing

4. Convert all your Emoji to Text

Data Model Design

Data Models are used to capture all the data attributes. It’s important to identify the key for tables in your data model. Unique identification, text, language, source location, URL, and even time, time of collection of data are key for any text data. Defining the

clear filter mechanism before pulling the data from External sources makes the analysis easy.

Optimize the model after analysis, this can be any supplier, security, product, store, and manager based. Avoid combining two different channels into one model. Always have a defense mechanism for your model by identifying the records such as fake ratings and reviews, competitor reviews and etc.

Reporting and Alerts Analytics

Reporting is essential to determine the story that your data is trying to tell. If your organization is not doing reporting on top of the data they are collecting, then certainly they need to pay attention to the power of reporting. You can categorize text analytics data based on different levels like Level 1( Immediate), Level 2(Elevated), Level 3(Moderate), and Level 4(Referred). Level 1 being data points that need immediate attention and need to resolve within 4 hours.

Conclusion

A very important part of the process is data pipelines. This makes sure that the data provided is as accurate and scalable as possible. This was achieved by having a proper data model design in place by reviewing the various source systems
Monitoring and alerting frameworks for the pipelines helped us in ensuring that the pipelines run without any issues
Data quality check at each layer helped us to identify discrepancies seen in the data we generate as part of the feeds and take necessary action
Leveraging cloud for running our data pipelines and optimizing job workflows helped us to speed up our final snapshot generation with the whole process completing within 5 minutes
Defining the appropriate models and data structure for better use of data Leveraging C for providing and addressing some of the bandwidth concerns
Data security and Governance of data Lake is key to success for such solutions
Onboarding new partners were made easier by going with a Docker-based approach for providing feeds.