How I Built an AI System That Automatically Analyzes 10,000+ Customer Support Calls

Siddu — Thu, 21 May 2026 14:21:17 GMT

A deep dive into building a production-grade, AWS-native call analytics platform using NLP, sentiment analysis, and serverless architecture

Published by Godena Siddartha | AI/ML Engineer | LinkedIn | GitHub

The Problem: Manual QA Is Broken at Scale

Picture this: a mid-sized company handles hundreds of customer support calls every day. Their QA team manually listens to a random 5% sample, fills out spreadsheets, and tries to infer whether agents are performing well.

The flaws are obvious:

95% of calls go unreviewed — meaning quality issues stay hidden
Human reviewers are inconsistent — scoring varies person to person
Feedback is always delayed — problems get caught weeks after they happen
It doesn’t scale — doubling call volume means doubling QA headcount

I built AI Call Sentry to solve this. It’s a fully automated, AWS-native pipeline that processes every single call, scores it using NLP and sentiment analysis, and delivers structured insights — without a human ever pressing play.

The Architecture: End-to-End on AWS

Here’s the high-level system design:

Customer Call (Audio) → S3 Upload
         ↓
   S3 Event Trigger
         ↓
   AWS Lambda (Orchestrator)
         ↓
   AWS Transcribe (Speaker-Aware Transcription)
         ↓
   Custom NLP Pipeline (FastAPI Service)
         ├── Sentiment Classification
         ├── Intent Detection
         ├── Tone Analysis
         └── Resolution Quality Scoring
         ↓
   Call Quality Score (0–100)
         ↓
   Structured Output (JSON) → Storage / Dashboard

The key design principle: zero manual intervention. Once a call recording lands in S3, the entire pipeline runs automatically. No one needs to press a button.

Step 1: Automated Ingestion with S3 + Lambda

The entry point is dead simple — an S3 bucket configured with an event notification that fires a Lambda function on every .mp3 or .wav upload.

# Lambda handler — triggered on S3 upload
import boto3
import json

def lambda_handler(event, context):
    s3_client = boto3.client('s3')
    transcribe_client = boto3.client('transcribe')
    
    # Extract file info from S3 event
    bucket = event['Records'][0]['s3']['bucket']['name']
    key = event['Records'][0]['s3']['object']['key']
    job_name = key.replace('/', '_').replace('.mp3', '')
    
    # Start speaker-aware transcription job
    transcribe_client.start_transcription_job(
        TranscriptionJobName=job_name,
        Media={'MediaFileUri': f's3://{bucket}/{key}'},
        MediaFormat='mp3',
        LanguageCode='en-US',
        Settings={
            'ShowSpeakerLabels': True,
            'MaxSpeakerLabels': 2  # Agent + Customer
        }
    )
    
    return {'status': 'transcription_started', 'job': job_name}

This pattern — S3 trigger → Lambda → Transcribe — is the backbone of the entire system. It’s serverless, infinitely scalable, and costs essentially zero when idle.

Step 2: Speaker-Aware Transcription

Standard transcription gives you a wall of text. That’s not useful for QA.

AWS Transcribe’s ShowSpeakerLabels parameter gives you diarized output — the transcript is segmented by speaker, so you can tell who said what. This is critical because:

Agent tone and customer tone need to be analyzed separately
Resolution detection requires knowing what the agent said last
Compliance checks target agent language specifically

A simplified version of the diarized output looks like this:

{
  "speaker_labels": {
    "segments": [
      {"speaker_label": "spk_0", "start_time": "0.0", "end_time": "5.3"},
      {"speaker_label": "spk_1", "start_time": "5.4", "end_time": "12.1"}
    ]
  },
  "transcript": "Hello, thank you for calling... [full text]"
}

We map spk_0 → Agent, spk_1 → Customer based on who speaks first, then separate their dialogue for downstream NLP.

Step 3: The NLP Pipeline (FastAPI Service)

The transcribed text goes into a custom FastAPI service that runs three NLP analyses in parallel:

3a. Sentiment Classification

We run sentiment analysis separately on agent turns and customer turns using a fine-tuned transformer model:

from transformers import pipeline

sentiment_model = pipeline(
    "sentiment-analysis",
    model="distilbert-base-uncased-finetuned-sst-2-english"
)

def analyze_sentiment(text_segments):
    results = []
    for segment in text_segments:
        result = sentiment_model(segment['text'][:512])
        results.append({
            'speaker': segment['speaker'],
            'sentiment': result[0]['label'],
            'confidence': result[0]['score'],
            'timestamp': segment['start_time']
        })
    return results

This gives us a sentiment trajectory across the call — did the customer start frustrated and leave satisfied? Did the agent’s tone stay professional throughout?

Achieved accuracy: 88–92% on our test set of labeled call recordings.

3b. Tone & Professionalism Scoring

Beyond positive/negative sentiment, we classify agent tone across dimensions like:

Empathy signals (“I understand”, “I apologize”)
Urgency language (positive: “I’ll resolve this right now”)
Negative signals (interruptions, dismissive phrasing)

3c. Resolution Detection

We use keyword pattern matching + contextual NLP to determine whether the call ended in resolution. Key signals: confirmation language from the agent, absence of escalation keywords, positive customer response in the final 20% of the transcript.

Step 4: The 0–100 Quality Score

All three analysis outputs feed into a weighted scoring formula:

def compute_call_score(sentiment_results, tone_results, resolution_result):
    weights = {
        'agent_sentiment':    0.30,  # Was the agent positive and professional?
        'customer_sentiment': 0.25,  # Did customer sentiment improve across the call?
        'tone_quality':       0.25,  # Empathy, clarity, professionalism
        'resolution':         0.20   # Was the issue resolved?
    }
    
    scores = {
        'agent_sentiment':    score_agent_sentiment(sentiment_results),
        'customer_sentiment': score_sentiment_trajectory(sentiment_results),
        'tone_quality':       score_tone(tone_results),
        'resolution':         100 if resolution_result['resolved'] else 30
    }
    
    final_score = sum(weights[k] * scores[k] for k in weights)
    return round(final_score, 1)

The output: a single, interpretable 0–100 score per call, plus a breakdown of what drove the score up or down. This is what QA managers actually see.

The Results

After deploying this system to production:

85% reduction in manual auditing effort — QA reviewers now only investigate flagged calls (score < 60)
10x faster turnaround — real-time scores vs. 2-week manual review cycles
100% call coverage — up from 5% random sampling
88–92% accuracy on sentiment classification validated against human-labeled samples

The most powerful outcome: pattern detection became possible. When you analyze every call, you can find systemic issues — specific product complaints spiking on certain dates, agents consistently struggling with certain query types, regional tone differences. None of that was visible with 5% sampling.

What I’d Do Differently

1. Replace rule-based resolution detection with a fine-tuned classifier. The keyword approach works but misses nuanced cases. A model fine-tuned on labeled resolved/unresolved transcripts would push accuracy significantly higher.

2. Add real-time streaming analysis. The current pipeline works on completed call recordings. Processing calls as they happen (using Amazon Kinesis + real-time Transcribe) would enable live coaching for agents.

3. Build a dashboard. Right now the output is structured JSON. A Grafana or Streamlit dashboard showing score trends, sentiment heatmaps, and flagged call counts would make this immediately useful to non-technical QA managers.

Try It Yourself

The full project is on GitHub: AI-Call-Sentry

The repo includes the Lambda handler, FastAPI NLP service, scoring logic, and setup instructions for the AWS infrastructure.

If you found this useful, follow me on LinkedIn — I write about building production AI systems, GenAI pipelines, and lessons from deploying ML at work.

Next post: How I built a voice-driven multi-agent troubleshooting system using LangChain, Gemini, and RAG — and why multi-agent routing changed everything about response quality.

Tags: #MachineLearning #AWS #NLP #Python #ArtificialIntelligence #SentimentAnalysis #ProductionML #LLM #FastAPI #MLEngineer

Stories by Siddu on Medium