Part 2 -Building Your AI-Ready Data Stack: Designing Your Data Architecture

10-part series about building your first Data stack from 0 to 1, and be ready for AI implementation.

Gunjan Titiya
8 min readJun 14, 2024

Hello Readers,

Welcome to the Part 2 of Building your data stack ready for AI. Last time we talked about aligning business objectives, set SMART goals and plan an iterative roadmap. Now we will start with technical design for our data architecture.

Choosing the right data tool for the right job depends on things like budget, existing stack, skills, and more. Since we are focusing on 0 to 1 data stack, we will use AWS stack for this series. If you are just starting out with building the data stack, it is always better to stay with your current cloud billing and infrastructure. For now we will assume AWS for our infrastructure and build our data stack, for predictive analytics use case.

Designing a data architecture for predictive analytics is equal parts art and science. It requires a deep understanding of your business goals, the data needed to support them, and the technical components to ingest, store, process, and serve that data at scale.

Today, we’ll walk through the key considerations and best practices for designing a scalable, performant, and cost-effective data architecture on AWS, with a specific focus on enabling predictive campaign analytics. We’ll cover:

  1. Defining your data sources and requirements
  2. Choosing the right storage and processing components
  3. Designing for data security and governance
  4. Planning for real-time and batch processing
  5. Enabling self-service analytics and data democratization

By the end, you’ll have a clear blueprint for building a modern data architecture on AWS that can power data-driven decision making across your marketing organization. Let’s dive in!

Know Your Data — Defining Sources, Requirements, and SLAs

The foundation of any data architecture is a deep understanding of the data itself. What data do you need to achieve your campaign analytics goals? Where does that data come from? How frequently does it update? What are the requirements for freshness, accuracy, and completeness?

Let’s use the example of a predictive analytics platform for optimizing digital ad campaigns. Some key data sources and requirements might include:

Ad platform data (Google Ads, Facebook Ads, etc.)

  • Impressions, clicks, cost, conversions by campaign/ad group/ad
  • Updated hourly or daily
  • Required for core campaign optimization models

Web and app clickstream data

  • Page views, events, user journeys
  • Real-time or near-real-time (< 5 min latency)
  • Required for attribution models and personalization

CRM data (Salesforce, Hubspot, etc.)

  • Leads, opportunities, customer attributes
  • Updated daily
  • Required for ROI calculations and customer lifecycle models

Product catalog data (PIM, ERP, etc.)

  • SKUs, prices, categories, metadata
  • Updated daily or weekly
  • Required for revenue/margin calculations

External market data (weather, events, trends, etc.)

  • Varies by source
  • Frequency depends on data type
  • Enabling data for campaign planning models

With your data sources identified, dig deeper into the specific requirements for each:

  • What fields and granularity are needed?
  • What are the latency and freshness needs?
  • What are the accuracy and completeness bar?
  • How much historical data is needed?
  • What are the expected growth rates in volume and velocity?

Document these in a data dictionary, along with the expected consumption patterns (batch vs. real-time, analytics vs. product integration) and SLAs for each dataset. This will serve as a crucial input to your architecture design.

For example:

Data Source: Google Ads

  • Fields: date, campaign, ad group, ad, impressions, clicks, cost, conversions
  • Granularity: hourly
  • Latency: < 1 hour
  • History: 2 years
  • Accuracy: 99%
  • Completeness: 99.9%
  • Volume: 10 GB/day, 5% monthly growth
  • Consumption: batch (daily) for BI, real-time for campaign optimization API

With a clear view of your data landscape, you can start to map those requirements to the optimal storage and processing components on AWS.

Building the Foundation — Choosing Storage & Processing Components

The heart of your data architecture are the storage and processing engines that will house and refine your data at scale. AWS offers a rich set of managed services to handle diverse data workloads, from highly structured transactional data to unstructured streaming data and everything in between.

The key is to choose components that:

  1. Meet your performance (latency, throughput) requirements
  2. Handle your scalability needs (volume, growth)
  3. Support your consumption patterns (batch, real-time, SQL, NoSQL)
  4. Optimize for cost based on access frequency and retention needs

Here’s a quick guide to some of the core AWS data services and typical use cases:

Storage:

S3 (object storage):

  • Staging raw data
  • Storing processed data for infrequent access
  • Handling unstructured data (images, video, logs)

DynamoDB (NoSQL database):

  • Storing real-time customer & session data
  • Powering high-throughput/low-latency applications

RDS (relational database):

  • Storing structured, transactional data
  • Running analytic queries that require joins/aggregations

Redshift (data warehouse):

  • Storing and querying large volumes of structured data
  • Powering BI & reporting workloads

Processing:

Kinesis (streaming):

  • Ingesting and processing real-time streaming data
  • Enabling real-time analytics & ML

EMR (Hadoop/Spark):

  • Processing and transforming large batches of data
  • Running ML workloads using Spark

Glue (ETL-as-a-service):

  • Extracting, transforming, and loading batch data
  • Building data pipelines visually or with code

Lambda (serverless functions):

  • Processing real-time events & triggers
  • Enabling high-volume parallel workloads

For our predictive campaign analytics use case, a simplified architecture might look like:

In this design:

  • Ad platform and clickstream data is ingested in real-time via Kinesis
  • Clickstream events are processed by Lambda and stored in DynamoDB for real-time personalization
  • Ad data is stored in S3, processed in batch by EMR/Spark, with outputs loaded to Redshift for BI and DynamoDB for real-time serving
  • CRM and product catalog data are stored in RDS and S3 respectively, ETL’d via Glue, and loaded to Redshift
  • External data is staged in S3 and loaded directly to Redshift
  • Redshift serves as the central warehouse powering BI, reporting, and batch ML featurization
  • DynamoDB powers real-time applications like campaign optimization API

This architecture balances the need for real-time processing and serving with cost-optimized batch processing and storage, while enabling multiple consumption patterns.

Of course, the specific components and design will vary based on your unique requirements — that’s where working with an experienced data architect comes in. They can help you weigh the tradeoffs and optimize for your goals.

With your core storage and processing in place, let’s turn to the critical matters of data security and governance.

Locking the Vault — Designing for Data Security & Governance

In the excitement of building AI and analytics capabilities, it can be tempting to overlook data security and governance. But in today’s environment of heightened privacy awareness and increasing regulation, it’s more critical than ever to bake security and governance into your data architecture from the ground up.

The key is to design for:

  1. Authentication and access control
  2. Data encryption at-rest and in-transit
  3. Network isolation and segmentation
  4. Auditing and logging of all data access
  5. Compliance with relevant regulations (GDPR, HIPAA, etc.)

AWS provides a number of services and features to help you secure your data architecture:

IAM (Identity and Access Management):

  • Fine-grained access control for AWS services and resources
  • Integrate with corporate directories for SSO
  • Set password policies and enable MFA

KMS (Key Management Service):

  • Create and control keys used to encrypt data
  • Integrate with EBS, S3, Redshift, RDS for easy encryption
  • Audit and log all key usage

VPC (Virtual Private Cloud):

  • Create isolated networks for different applications/environments
  • Control inbound and outbound network access with security groups
  • Use VPC endpoints to keep traffic within AWS network

CloudTrail (logging and auditing):

  • Log and monitor all API calls and user activity
  • Detect unusual activity and set alarms
  • Facilitate compliance reporting

In our campaign analytics architecture, some key security designs would be:

  • All data encrypted at rest (S3, DynamoDB, RDS, Redshift) and in transit (SSL/TLS)
  • IAM roles to control access to each service and resource
  • VPC to isolate the analytics environment, with strict security group rules
  • No direct public access to any data stores, only accessed through APIs or jump hosts
  • All access and activity logged via CloudTrail, with anomaly detection enabled

On the governance side, you’ll want to establish:

  • Clear data classification policies based on sensitivity/criticality
  • Data retention policies for each class of data
  • Processes for secure data sharing with external parties
  • Rulebooks for handling PII and other regulated data types
  • Data catalog to document all datasets, owners, sensitivity levels, and lineage
  • Automated sensitive data discovery and classification
  • Regular security and compliance audits

Implementing strong security and governance does add complexity to your data architecture. But it’s an essential investment to protect your customers’ data and your company’s reputation. Work closely with your CISO and legal teams to ensure your design meets all requirements.

With our data fortress secured, let’s look at how to design for the diverse processing needs of AI and analytics.

Power to the People — Enabling Self-Service Analytics & Democratization

What good is a shiny new data architecture if nobody can use it? The final key to driving value from your AI and analytics investments is to enable broad, self-service access to data. That means empowering marketers, analysts, and data scientists to find, understand, and use data in their tools of choice — without always needing to go through IT or engineering.

The keys to enabling self-service are:

  1. Creating a unified data catalog and dictionary
  2. Implementing intuitive data discovery and exploration tools
  3. Providing BI and analytics in the tools users know and love
  4. Enabling ‘what-if’ scenario modeling and experimentation
  5. Delivering data science and ML as products and services

On the catalog and discovery side, tools like AWS Glue Data Catalog, or Active metadata tools like Atlan can help data users find and understand the datasets available. They provide a unified view across your S3, Redshift, and RDS stores, with rich metadata, lineage, and sample queries. Invest the time upfront to document your key datasets thoroughly — it’ll pay dividends in adoption and proper usage.

For BI and analytics, you’ll likely need to support a mix of:

  • Visual drag and drop reports and dashboards (e.g. Looker, Tableau)
  • Ad-hoc SQL analysis (e.g. Redash, Superset)
  • Cloud notebooks (e.g. Hex)
  • Automated insights and anomaly detection

For data science and ML, aim to deliver as products and services vs. one-off projects. That means:

  • Providing easy access to clean, featured datasets in S3 or Redshift
  • Hosting Jupyter/Hex notebooks with common libraries and example code
  • Productizing key models via APIs and self-service UIs (e.g. propensity scores, LTV predictions)
  • Enabling AutoML for common use cases (e.g. churn prediction, lead scoring)

Tools like SageMaker, Databricks, and Dataiku can help productize the data science workflow. Here we are focusing only on Data stack, so I will leave MLOps pipeline out of the scope of this series.

Hit reply and let me know if you would like to see MLOps architecture and steps to implement series as well!

Training and Enablement

Finally, don’t neglect the importance of training and enablement. Work with your analytics leaders to identify key personas and their data needs. Develop targeted trainings and ‘user manuals’ for each persona. Host regular office hours and demos to onboard new users and share best practices. Celebrate successes and evangelize wins broadly.

Democratizing data is as much a cultural challenge as a technical one. It requires a mindset shift from data-as-a-service to data-as-a-product. But when done right, it can unleash a step-change in the value you drive from data.

Conclusion

Designing a data architecture for AI and analytics requires deeply understanding the business goals, user needs, and analytics use cases first. Only then can you design a technical architecture that meets those needs in a scalable, performant, and cost-effective way.

If you feel intimidated, worry not! Next time, we will actually implement this architecture in AWS using Terraform, so get ready to get your hands dirty as next one is going to be hands on lab and you will walk away with your data stack in AWS ready to go!

Onwards!

Questions? Feedback? Connect with me on LinkedIn or contact me directly at gunjan@bytesandbayes.com!

This article is proudly brought to you by Bytes & Bayes, the consulting firm dedicated to guiding you towards data excellence. We also offer a AI Literacy for business leaders workshop that provides a deeper understanding of how to make your organization READY for AI.

--

--

Gunjan Titiya

Founder Bytes & Bayes | Data and AI strategy consultant | Speaking, Writing about data world | "LLM models wont improve company's bottom line, their Data will."