F
Faisal
ProjectsSkillsContact
عربي
📁Projects⚡Skills✉️Contact🌐عربي
Back to Projects

Voiceflow: Real-Time AI Conversational Agent

An ultra-low latency, voice-to-voice AI system capable of complex reasoning, state management, and real-time UI synchronization.

2026-01-03
AI / Machine LearningFull-Stack DevelopmentReal-Time SystemsReactNode.jsSocket.IOMongoDBGrok-120BGemini 2.5 FlashElevenLabsCustom TTS Fine-Tuning

Overview

Voiceflow is a cutting-edge voice agent engineered to replace traditional, static IVR systems with an intelligent, conversational AI that listens, thinks, and responds in real time. The system achieves sub-2-second end-to-end latency from the moment a user finishes speaking to the moment they hear the AI's reply — a pipeline that spans speech recognition, LLM reasoning, state mutation, and audio synthesis.

Unlike conventional RAG-based architectures, Voiceflow employs a Context Injection strategy: the entire structured schema (restaurant menus, bank policies, product catalogs) is fed directly into the LLM's context window at inference time. This guarantees 100% factual accuracy and zero hallucination, while delivering lightning-fast response times that outperform retrieval-based approaches.

The voice itself is a custom fine-tuned ElevenLabs model trained to produce a hyper-realistic Saudi Arabic accent, paired with robust noise cancellation that allows the system to operate reliably in high-noise environments like drive-thru lanes.

Problem vs. Solution

The Problem: Legacy IVR Systems

Traditional interactive voice response systems force users through rigid decision trees — "Press 1 for sales, Press 2 for support." They cannot handle natural language, fail in noisy environments, and offer zero personalization. Any change to the menu or policy requires expensive re-recording and re-programming.

The Solution: Voiceflow AI Agent

Voiceflow replaces the entire IVR paradigm with a single, intelligent conversational agent. Users speak naturally in their own dialect. The AI understands intent, performs calculations, updates backend state, and responds with a hyper-realistic human voice — all in under two seconds. Schema changes are instant: update a JSON file, and the agent adapts immediately.

Your browser does not support the video tag.
Demo

Use Case 1: AI Financial Consultant (Banking)

A dynamic, voice-driven banking assistant that guides customers through complex financial products using natural conversation rather than forms or menus.

Interaction Flow

The user asks for financing options. The AI intelligently narrows the conversation to a personal loan, then asks for the user's monthly salary and bank. The user replies "12,000 Riyals." The AI instantly calculates the maximum eligible loan amount (90,000 Riyals based on the bank's 7.5x salary multiplier) and asks for confirmation to proceed with the application.

Technical Highlight

Showcases the LLM's ability to perform mathematical logic in real time, extract specific entities (salary amount, bank name), apply business rules (eligibility multipliers), and trigger backend state changes — pushing a loan approval form directly to the user's screen — entirely through voice interaction.

Your browser does not support the video tag.
Demo

Use Case 2: AI Drive-Thru Cashier (F&B)

A highly resilient ordering system built for noisy environments and complex, multi-intent human speech patterns typical of real-world drive-thru interactions.

Interaction Flow

The user orders a beef burger and a Pepsi. Then, they add two fries. Finally, the user issues a complex, multi-intent command in a single breath: "Actually, remove the beef burger, and give me a chicken burger with extra ketchup." The system processes all three intents simultaneously — remove, add, and modify — without hesitation.

Technical Highlight

The system parses simultaneous "remove", "add", and "modify" intents in a single utterance. Using WebSockets (Socket.IO), the visual cart on the screen updates in real time, line-by-line, perfectly synchronized with the AI's audio response. Each cart mutation is emitted as a discrete WebSocket event, enabling frame-accurate UI synchronization.

System Architecture

End-to-end voice pipeline — STT, LLM, and TTS run as independent microservices, orchestrated dynamically by the server

Frontend / Local Device
Microphone Input
1. User Audio
React AppEmployee & Customer View
10. Stream Audio
Speaker Output
2. Forward Audio/Text ↓
11. Real-time UI Updates ↑
Local Backend
Node.js ServerExpress + Socket.IO
7. Update Order Data
MongoDB AtlasDatabase
External Microservices
PARALLEL
3. Send Audio →← 4. Return Transcript
STT Service
5. Transcript + Order Context →← 6. AI Response + Order Update
API LLM / Logic
8. Text to Synthesize →← 9. Return Audio Stream
API TTS / Voice

Performance Metrics

Benchmarked under real-world conditions

End-to-End Latency

<2s

From user speech end to AI audio playback

96%

Intent Accuracy

100%

Zero hallucination via Context Injection

100%

Multi-Intent Parsing

3+

Simultaneous add/remove/modify in one utterance

94%

UI Sync Latency

<50ms

WebSocket event to DOM update

99%

Key Capabilities

Voice-to-Voice Pipeline97%
Real-Time State Management95%
Noise Resilience90%
Multi-Dialect Arabic Support92%

All metrics measured with Grok-120B as the primary LLM and ElevenLabs Scribe for STT. Gemini 2.5 Flash is used as a fallback for cost-sensitive deployments with comparable accuracy.

Objectives

  • Replace legacy IVR systems with a natural-language voice AI agent.
  • Achieve sub-2-second end-to-end latency across the full voice pipeline.
  • Implement Context Injection for 100% accurate, hallucination-free responses.
  • Build real-time UI synchronization via WebSocket events.
  • Fine-tune a custom TTS voice model with a hyper-realistic Saudi Arabic accent.
  • Design for high-noise environments with robust noise cancellation.

Key Features

Context Injection Architecture

Feeds the entire structured schema (menus, policies, catalogs) directly into the LLM's context window at inference time, eliminating retrieval latency and ensuring 100% factual accuracy with zero hallucination.

Real-Time Cart / State Synchronization

Every state mutation (add item, remove item, update quantity) is emitted as a discrete Socket.IO event, triggering frame-accurate DOM updates on the client. The visual cart mirrors the AI's spoken response in real time.

Multi-Intent Parsing

The LLM processes compound, multi-intent commands in a single utterance — handling simultaneous add, remove, and modify operations without requiring the user to pause or repeat themselves.

Custom Voice Fine-Tuning

A custom ElevenLabs voice model fine-tuned on Saudi Arabic speech data, producing a hyper-realistic accent that builds trust and familiarity with local users.

Noise-Resilient Processing

Advanced noise cancellation and audio preprocessing pipeline enables reliable operation in high-noise environments like drive-thru lanes, outdoor kiosks, and busy banking halls.

Mathematical Reasoning

The LLM performs real-time calculations (loan eligibility, order totals, discount application) and validates results against business rules before responding — no external calculator service required.

Challenges & Solutions

Challenge:

Achieving sub-2-second latency across a 5-stage pipeline (STT → LLM → DB → TTS → Client).

Solution:

Implemented streaming at every stage: chunked audio upload, streaming LLM inference with early token emission, and streaming TTS synthesis that begins playback before the full response is generated.

Challenge:

Handling simultaneous add/remove/modify intents in a single user utterance without data corruption.

Solution:

Designed an atomic transaction model where the LLM emits a structured JSON diff of cart mutations, which the server applies as a single atomic database operation before emitting granular WebSocket events.

Challenge:

Operating reliably in high-noise drive-thru environments with wind, engine noise, and background chatter.

Solution:

Implemented an adaptive noise threshold calibrated from the first 4 seconds of each recording session, allowing the system to dynamically filter ambient noise based on real-time environmental conditions before passing clean audio to STT.

F
Faisal

Software Developer passionate about building exceptional digital experiences with clean code and creative solutions.

Quick Links

HomeProjectsSkillsContact

Connect

Available for new projects

© 2026 Faisal. All rights reserved.

Made with ❤ in Saudi Arabia