Lessons from Building a Contact Centre Platform

Over the past few years, I've led the architecture and implementation of a contact centre modernization initiative for a major Canadian airline. The results—30% cost reduction and 25% improvement in first-call resolution—came from a series of deliberate technical and organizational choices.

This is the full story: the decisions we made, the trade-offs we accepted, and the lessons that apply far beyond contact centres.

The Starting Point

Like many large enterprises, we were dealing with a legacy contact centre platform that had accumulated technical debt over decades:

Multiple disconnected systems requiring agents to toggle between 7+ applications to handle a single customer interaction. Average handle time was inflated by the time agents spent searching for information, not solving problems.
Limited intelligent routing: Calls were distributed based on simple queue-based logic. A customer calling about a complex rebooking during a weather disruption would hit the same queue as someone asking about baggage allowance.
High infrastructure costs with unpredictable scaling: On-premises PBX systems required capacity planning months in advance. During peak events—holiday travel, weather disruptions, system outages—the platform couldn't scale, resulting in long wait times and abandoned calls.
Poor visibility into customer interactions: No unified view of a customer's journey across channels (voice, chat, email). An agent had no way of knowing that the customer on the line had already tried to resolve their issue via chat twice.
Vendor lock-in: The legacy platform required expensive professional services for even minor changes to call routing or IVR flows.

The mandate was clear: modernize the platform while maintaining 24/7 operations for millions of customer interactions annually. No downtime. No degradation. No "please bear with us during our transition" messages.

Architecture: The Big Decisions

Why Amazon Connect

We evaluated several CCaaS (Contact Centre as a Service) platforms. Amazon Connect won for specific architectural reasons:

True pay-per-use pricing: No per-seat licenses. You pay for minutes of usage, which aligns costs directly with volume. During low-traffic periods, costs drop proportionally. This alone transformed the economics.
Native AWS integration: Since our broader infrastructure was already on AWS, Connect provided seamless integration with Lambda, Lex, DynamoDB, S3, Kinesis, and EventBridge without building custom middleware.
Programmable contact flows: Contact flows are defined programmatically, not through vendor-managed configuration. This meant our engineering team could version-control, test, and deploy changes with the same rigor as application code.
Extensibility: Connect's architecture is designed to be extended. Every interaction generates events that can trigger custom logic, feed analytics pipelines, or invoke external systems.

The trade-off: Connect's out-of-the-box agent desktop is functional but basic. We invested in building a custom agent experience that pulled data from multiple systems into a unified view. This was the right call—the agent experience is where productivity gains compound.

Serverless-First Architecture

We made a deliberate decision: no EC2 instances in the contact centre platform. Everything runs on managed or serverless services:

AWS Lambda for all business logic: routing decisions, data lookups, integration with backend systems, post-call processing. Cold start latency was a concern early on, but provisioned concurrency and architecture choices (keeping functions focused and dependencies minimal) kept it well within acceptable bounds.
Amazon DynamoDB for session state and real-time data: agent status, active interaction context, customer preferences. The single-digit millisecond latency is critical when you're making routing decisions in real-time.
Amazon EventBridge for decoupling: every significant event (call started, call transferred, interaction completed, customer identified) publishes to EventBridge. Downstream consumers—analytics, quality monitoring, workforce management—subscribe independently.
Amazon Kinesis for streaming analytics: real-time dashboards showing queue depths, wait times, agent utilization, and sentiment trends.
Amazon S3 for long-term storage: call recordings, chat transcripts, and interaction logs stored with appropriate lifecycle policies.

Why serverless? Three reasons: operational simplicity (no servers to patch, scale, or monitor), cost alignment (pay only for what you use), and forced modularity (Lambda's constraints push you toward small, focused functions with clear interfaces).

AI as a First-Class Citizen

Rather than bolting on AI capabilities later, we designed the platform with natural language understanding at its core:

Amazon Lex for conversational AI: Lex handles the initial customer interaction across both voice and chat channels. We built intent models for the most common contact types:

Booking modifications (seat changes, date changes, name corrections)
Flight status inquiries
Baggage tracking
Loyalty program questions
Rebooking during disruptions

Each intent model was trained on thousands of real customer utterances extracted from historical call transcripts. The key was not just understanding what the customer said, but understanding what they meant. "I need to change my flight" could mean a date change, a route change, or a cancellation—and the follow-up questions needed to disambiguate efficiently.

Custom models for domain-specific tasks: Standard NLP models don't understand airline-specific entities well. A booking reference like "ABC123" looks like gibberish to a general model. We built custom entity extraction for booking references, flight numbers, airport codes, and fare class terminology.

Graceful fallback paths: This is where many AI implementations fail. When the AI can't resolve an issue, the handoff to a human agent must be seamless. We designed the system so that:

All context gathered by the AI transfers to the agent's desktop
The agent sees exactly what the AI attempted and where it got stuck
The customer never has to repeat information
Every fallback is logged and analyzed to improve the AI over time

The containment rate—interactions fully resolved by AI without human intervention—started at around 20% and improved steadily as we refined intent models and expanded coverage.

Infrastructure as Code

Every component of the platform is defined in AWS CDK (TypeScript). This wasn't optional—it was a non-negotiable architectural principle:

Reproducible environments: Spin up a complete copy of the platform for testing in under 30 minutes
Version-controlled infrastructure: Every change goes through the same PR review process as application code
Automated deployment pipelines: Push to main triggers deployment through staging to production with automated testing at each stage
Drift detection: Any manual change to production infrastructure is detected and flagged

We chose CDK over Terraform for this project because the tight integration with AWS services and the ability to use TypeScript (matching our Lambda runtime) reduced context switching for the team.

Data Architecture: The Force Multiplier

The data architecture ended up being the most impactful architectural decision, even though it wasn't the most exciting.

Every Interaction Becomes Data

We built the platform so that every customer interaction generates a rich data trail:

Contact trace records: Complete interaction history including routing decisions, queue times, handle times, and outcomes
Conversation transcripts: Real-time transcription of voice calls and complete chat logs
Sentiment analysis: Amazon Comprehend processes transcripts to detect customer sentiment throughout the interaction
Agent actions: Every system action an agent takes during an interaction is logged
AI decisions: Every routing decision, intent classification, and entity extraction is recorded with confidence scores

This data flows through Kinesis into a data lake on S3, with Athena for ad-hoc querying and QuickSight for dashboards.

Real-Time Agent Assist

Using the streaming data, we built an agent assist capability that surfaces relevant information during live interactions:

Customer context: Previous interactions, open cases, loyalty status, recent bookings, known disruptions affecting their itinerary
Suggested responses: Based on the current conversation topic and successful resolution patterns from similar interactions
Knowledge base articles: Automatically surfaced based on the detected intent
Alerts: Real-time notifications about system issues, weather events, or policy changes relevant to the interaction

This reduced average handle time by enabling agents to spend less time searching and more time solving.

What Actually Mattered

Looking back, the decisions that most impacted success weren't the flashy technology choices. They were:

Cross-Functional Alignment

Getting operations, IT, and customer service to share a vision was harder—and more important—than any technical challenge.

The operations team cared about service levels and agent utilization. The customer service team cared about CSAT and resolution rates. The IT team cared about reliability and maintainability. The finance team cared about cost per interaction.

We needed a shared metric framework that showed how the platform investment served all of these goals simultaneously. Building that alignment required months of conversations, workshops, and incremental trust-building. It was the most important "architecture" work on the project.

Incremental Delivery

We launched with a narrow scope: one contact type (booking modifications) through one channel (voice) for one market segment. This approach:

Reduced risk: If something went wrong, it affected a contained subset of interactions
Built confidence: Showing measurable results early generated organizational support for expanding scope
Enabled learning: Each increment taught us something that improved the next one
Created advocates: Agents who used the new system became internal champions

Over subsequent releases, we expanded to additional contact types, added chat as a channel, integrated with the loyalty program, and built the AI capabilities layer by layer.

Metrics-Driven Decisions

We focused relentlessly on metrics that directly reflected customer and business outcomes:

| Metric | Before | After | Impact | |--------|--------|-------|--------| | First-call resolution | 68% | 85% | +25% | | Average handle time | 8.2 min | 5.8 min | -29% | | Customer satisfaction (CSAT) | 3.6/5 | 4.2/5 | +17% | | Cost per interaction | Baseline | -30% | Significant | | Agent satisfaction | 3.2/5 | 4.0/5 | +25% |

Every feature decision was evaluated against these metrics. If a proposed capability couldn't demonstrably move one of these numbers, it went to the backlog.

Agent Enablement

The best technology means nothing if agents can't use it effectively. We invested substantially in:

Training programs that went beyond "click here, then here" to help agents understand the platform's capabilities and how to leverage them
A feedback channel where agents could report issues, suggest improvements, and flag AI failures directly from the agent desktop
Gradual rollout where experienced agents piloted new features before broader deployment
Recognition programs that celebrated agents who achieved high resolution rates using the new tools

The Technical Debt We Accepted

No project ships without trade-offs. Being honest about technical debt is more useful than pretending it doesn't exist:

Some legacy integrations are adapter-wrapped rather than properly modernized. We built adapter layers around three legacy systems that should eventually be replaced. The adapters work but add latency and complexity.
The monitoring stack could be more comprehensive. We have good coverage for the happy path but gaps in edge case monitoring. We're addressing this incrementally.
Testing coverage is uneven. Core routing logic has excellent coverage. Some of the less-critical integration functions are under-tested.
Documentation didn't always keep pace with development. Architecture decision records were maintained, but operational runbooks lagged behind feature development.

Acknowledging these honestly enabled us to plan for addressing them rather than discovering them in a crisis.

Lessons for Enterprise Architects

If I were starting this project again, here's what I'd do differently:

Spend more time on the data model upfront. Getting entity relationships right early saves enormous pain later. We refactored our interaction data model twice, which was costly.
Invest in observability from the start. Don't add monitoring after the fact. Build it in from day one: structured logging, distributed tracing, custom metrics, and alerting that catches issues before customers do.
Build for extensibility, not just current requirements. The platform needs to evolve. Every architectural decision should be evaluated against the question: "How hard will this be to change in 18 months?"
Document decisions, not just implementations. Future teams will understand what the code does. What they won't understand is why it was built that way. Architecture Decision Records (ADRs) are invaluable.
Invest in the developer experience early. Local development setup, CI/CD pipelines, automated testing, and deployment tooling. Every hour spent on developer experience pays back tenfold in velocity.

Where It Goes From Here

The contact centre platform continues to evolve. Current work includes expanding AI capabilities with large language models for more natural conversations, deeper integration with operational systems for proactive customer communication, and workforce management optimization using historical interaction patterns.

The platform we built enables that evolution rather than constraining it—which might be the most important success metric of all.