Voice-AI-for-Beginners – A curated learning path for developers

  • Notifications You must be signed in to change notification settings
  • Fork 4
  • Star 97
  • Code
  • Issues 0
  • Pull requests 0
  • Discussions
  • Actions
  • Projects
  • Security and quality 0
  • Insights
Additional navigation options  mainBranchesTagsGo to fileCodeOpen more actions menu

Folders and files

NameNameLast commit messageLast commit date

Latest commit

History

5 Commits5 Commits
LICENSELICENSE  
README.mdREADME.md  
banner.pngbanner.png  
View all files

Repository files navigation

  • README
  • MIT license

Banner Image

A curated, developer friendly learning path for building real-time voice AI agents from your first STT call to scaling production telephony.

Voice AI has moved from research demos into shipping product in under three years. The modern stack is converging around a clear pattern: a real-time transport layer (WebRTC or telephony), a streaming pipeline of speech-to-text → LLM → text-to-speech, and a turn-taking model that decides when the agent should speak. This list is structured to mirror that learning order start with the foundations, pick a framework, then drill into individual components and production concerns.

Resources are tagged 🟢 Beginner, 🟡 Intermediate, or 🔴 Advanced. Prefer free official docs and vendor-neutral guides; flag where authors have commercial interests.

Read top-to-bottom if you're brand new. The recommended path:

  1. Foundations → understand the pipeline and latency budget
  2. Frameworks → pick one (LiveKit Agents or Pipecat are the safest open-source bets) and ship a hello-world
  3. Components (STT, TTS, LLM, VAD, turn detection) → swap pieces to learn what each layer does
  4. Transport & telephony → connect to a real phone number
  5. Evaluation, production, ethics → make it safe enough to ship
  1. Foundational concepts and learning paths
  2. Frameworks and orchestration platforms
  3. Speech-to-text (STT / ASR)
  4. Text-to-speech (TTS)
  5. LLMs for voice and real-time AI
  6. Voice activity detection and turn-taking
  7. WebRTC fundamentals
  8. Telephony and SIP
  9. Tutorials and hands-on projects
  10. GitHub starter repos and awesome lists
  11. Datasets and benchmarks
  12. Beginner-accessible research papers
  13. Evaluation and testing
  14. Production, deployment, and scaling
  15. Ethics, safety, and regulation
  16. Blogs and newsletters
  17. Podcasts
  18. Communities
  19. Conferences and events
  20. Hackathons and competitions

Start here. These resources establish the mental model of the voice agent pipeline and the latency budget you'll fight for the rest of your career.

  • Voice AI & Voice Agents An Illustrated Primer Kwindla Hultman Kramer's free, regularly-updated long-form primer. The de facto textbook for the field. 🟢 Beginner
  • Voice Agent Architecture: STT, LLM, and TTS Pipelines Explained (LiveKit) Visual walkthrough of streaming patterns, turn detection, and where latency accumulates. 🟢 Beginner
  • Everything You Need to Know About Voice AI Agents (Deepgram) End-to-end primer covering feature extraction, ASR, LLM reasoning, and synthesis. 🟢 Beginner
  • AI Voice Agents (LiveKit Docs) The canonical "what is a voice agent" reference, covering pipeline vs multimodal and agent state. 🟢 Beginner
  • Core Latency in AI Voice Agents (Twilio) Visual explanation of end-of-turn detection, silence thresholds, and smart endpointing. 🟢 Beginner
  • Advice on Building Voice AI in June 2025 (Daily.co) Practical P50/P95 latency-budget guidance from Pipecat's creators. 🟡 Intermediate
  • How Intelligent Turn Detection Solves the Biggest Challenge in Voice Agents (AssemblyAI) Endpointing is the most underestimated problem; this is the clearest deep-dive. 🟡 Intermediate

The frameworks below all let you wire STT, an LLM, and TTS together. For open-source production work, LiveKit Agents and Pipecat are the two safest bets; for managed dashboards, Vapi, Retell, and Bland win on time-to-first-call.

  • LiveKit Agents Voice AI Quickstart Working assistant in <10 min via Python or TypeScript, runs on top of WebRTC. 🟢 Beginner
  • Pipecat Quickstart Scaffolds a Deepgram + OpenAI + Cartesia pipeline you can talk to in the browser in 5 minutes. 🟢 Beginner
  • Ultravox (fixie-ai/ultravox) Open-weight multimodal speech LLM (Llama/Gemma/Qwen variants) that skips the separate ASR stage for ~150 ms TTFT. 🔴 Advanced
  • Vapi Quickstart Dashboard-first; ship an agent on a free US phone number in under 5 minutes. 🟢 Beginner
  • Retell AI Introduction & Quickstart Phone-agent platform with $10 free credit on signup. 🟢 Beginner
  • Bland AI Send Your First Phone Call Minimal API tutorial for placing your first AI phone call. 🟢 Beginner
  • ElevenLabs Conversational AI Quickstart Build and embed a voice agent widget on any website in 5 minutes. 🟢 Beginner
  • OpenAI Realtime API Guide Official guide to gpt-realtime over WebRTC, WebSockets, or SIP. 🟡 Intermediate
  • Google Gemini Live API Overview Low-latency, bidirectional voice + vision agents with barge-in and tool use. 🟡 Intermediate
  • Twilio ConversationRelay WebSocket bridge that handles STT/TTS so you focus on LLM logic; works with any LLM. 🟡 Intermediate
  • Vapi vs Pipecat vs LiveKit (AssemblyAI) Architecture-focused comparison of pipeline control and transport choices. 🟡 Intermediate
  • 11 Voice Agent Platforms Compared (Softcery) Broad market map with use-case recommendations. 🟢 Beginner
  • Best Voice Agent Stack (Hamming AI) Buy-vs-build framework with concrete cost, latency, and time-to-launch numbers. 🟡 Intermediate

Pick one streaming STT and learn it deeply before shopping around. Deepgram, AssemblyAI, and Whisper-derivatives cover most use cases.

  • Deepgram Nova-3 STT benchmarks Primer on WER, latency, and cost alongside Deepgram's product reference. 🟢 Beginner
  • AssemblyAI Universal-Streaming Streaming STT walkthrough that doubles as a function-calling tutorial. 🟡 Intermediate
  • OpenAI Whisper / gpt-4o-transcribe API docs Easiest cloud STT if you already use OpenAI. 🟢 Beginner
  • Soniox multilingual benchmark Public WER comparison across 60 languages. 🟢 Beginner
  • Cartesia Ink Streaming STT paired with Sonic TTS for a single-vendor low-latency stack. 🟢 Beginner
  • openai/whisper The original repo and the de facto starting point for any DIY ASR project. 🟢 Beginner
  • SYSTRAN/faster-whisper CTranslate2 reimplementation up to 4× faster with INT8; recommended for self-hosted Whisper. 🟡 Intermediate
  • NVIDIA NeMo (Parakeet / Canary) Top-of-leaderboard open ASR models with streaming inference recipes. 🔴 Advanced
  • Moonshine Tiny on-device ASR (~190 MB) optimized for live streaming on edge devices. 🟡 Intermediate
  • Open ASR Leaderboard (HuggingFace) Community leaderboard across 11 datasets your reference for open-source picks. 🟢 Beginner
  • Artificial Analysis Speech-to-Text Independent leaderboard ranking 48+ STT providers by WER, speed, and cost. 🟢 Beginner
  • Streaming vs Batch ASR (Arun Baby) Engineer-friendly explainer of RNN-T and Conformer streaming architectures. 🟡 Intermediate

Latency, not raw quality, is what kills voice agents prioritize providers offering true streaming with first-byte under 200 ms.

  • ElevenLabs Docs Industry-leading quality, voice cloning, and Conversational AI in one SDK. 🟢 Beginner
  • Cartesia Sonic Quickstart Sub-100 ms first-byte latency, designed specifically for voice agents. 🟢 Beginner
  • Deepgram Aura Low-latency streaming TTS that pairs cleanly with Deepgram STT. 🟢 Beginner
  • OpenAI TTS (gpt-4o-mini-tts) Easiest plug-in TTS for the OpenAI stack. 🟢 Beginner
  • Artificial Analysis TTS leaderboard ELO, price, and speed comparison covering Rime, PlayHT, Hume, Inworld, and others. 🟢 Beginner
  • Coqui TTS (idiap fork) Maintained fork of Coqui-TTS / XTTS v2; the most battle-tested OSS TTS toolkit. 🟡 Intermediate
  • Piper (OHF-Voice/piper1-gpl) Fast local neural TTS optimized for Raspberry Pi; perfect for offline projects. 🟢 Beginner
  • Kokoro 82M Tiny Apache-licensed model that tops community ELO arenas; runs on CPU. 🟢 Beginner
  • F5-TTS Diffusion-transformer TTS with high-quality zero-shot voice cloning. 🟡 Intermediate
  • Orpheus-TTS Llama-3B-based emotive TTS with ~200 ms streaming and emotion tags. 🟡 Intermediate
  • Sesame CSM Conversational, context-aware multi-speaker TTS using a Llama backbone with the Mimi codec. 🔴 Advanced
  • Streaming TTS for Low-Latency Agents (Picovoice) Clear taxonomy of single, output-streaming, and dual-streaming TTS. 🟡 Intermediate
  • Ethics of Voice Cloning & Deepfakes (Deepgram) Vendor-neutral discussion of misuse, regulation, and developer responsibility. 🟢 Beginner

A voice agent's perceived intelligence is bounded by how fast the LLM streams its first token. Sub-300 ms TTFT changes the conversation feel entirely.

  • Groq LPU-based inference cloud delivering ~10× faster Llama tokens/sec than commodity GPUs. 🟢 Beginner
  • Cerebras Inference Wafer-scale chip inference with very high throughput on Llama models. 🟢 Beginner
  • SambaNova Cloud Reconfigurable Dataflow inference; stable throughput at low latency. 🟢 Beginner
  • OpenAI Realtime API guide Flagship S2S product with WebRTC/WebSocket transport. 🟡 Intermediate
  • Google Gemini Live Real-time multimodal voice/video with barge-in and 70-language support. 🟡 Intermediate
  • Moshi (kyutai-labs) Open-source full-duplex speech-text foundation model with 200 ms latency the premier OSS S2S model to study. 🔴 Advanced
  • OpenAI Voice Agents Guide Compares chained vs S2S architectures with prompt and tool best practices. 🟢 Beginner
  • ElevenLabs Voice Agent Prompting Guide Production-grade prompt structure tuned for voice; vendor-neutral lessons. 🟡 Intermediate
  • Voice AI Prompt Engineering Guide (VoiceInfra) Explains why voice prompts must be 60–70% shorter than chat prompts, with templates. 🟢 Beginner
  • Function Calling for Voice Agents (LiveKit Docs) Concise guide to defining tools and RPC inside a voice agent. 🟡 Intermediate

Pure VAD is no longer enough modern agents combine acoustic VAD with a small semantic model that predicts end-of-utterance from words and prosody.

  • Silero VAD MIT-licensed pre-trained VAD; <1 ms per chunk on CPU. The de facto VAD inside LiveKit and Pipecat. 🟢 Beginner
  • py-webrtcvad Python bindings for Google's classic WebRTC VAD; lightweight baseline. 🟢 Beginner
  • LiveKit Turn Detector blog post How a SmolLM-based EOU model complements VAD with semantic context. 🟡 Intermediate
  • LiveKit turn-detector model on HuggingFace Open-weights multilingual EOU model running ONNX on CPU in under 500 MB. 🟡 Intermediate
  • Pipecat Smart Turn v3 Whisper-Tiny-based audio semantic VAD with 12 ms CPU inference, BSD-2 licensed. 🟡 Intermediate
  • pipecat-ai/smart-turn Repo with model code, training scripts, and integration examples. 🟡 Intermediate
  • The Complete Guide to AI Turn-Taking (Tavus) Reader-friendly overview of why pure VAD fails in real conversations. 🟢 Beginner
  • Tackling Turn Detection in Voice AI (Notch) Engineer-first walkthrough combining VAD probability, volume, and TTS markers. 🟡 Intermediate

WebRTC is the default transport for voice agents that don't run over the phone network. Understanding ICE, STUN, TURN, and SFU architecture is non-negotiable for production work.

  • MDN WebRTC API Authoritative free reference for RTCPeerConnection, getUserMedia, and signaling. 🟢 Beginner
  • MDN: Introduction to WebRTC Protocols Beginner-friendly explanation of ICE, STUN, TURN, and SDP. 🟢 Beginner
  • WebRTC.org Getting Started Official Google-maintained intro, splitting WebRTC into media-capture and connectivity. 🟢 Beginner
  • GetStream WebRTC for the Brave Free multi-module tutorial covering networking basics through advanced topics. 🟢 Beginner
  • Why WebRTC Beats WebSockets for Voice AI (LiveKit) 2025 explainer aimed at AI builders, comparing transports in plain English. 🟡 Intermediate
  • Daily Docs Intro to Video Architecture (P2P vs SFU) One of the clearest beginner write-ups of P2P vs SFU. 🟢 Beginner
  • Agora How WebRTC Works Side-by-side WebRTC vs WebSockets walkthrough with signaling diagrams. 🟢 Beginner

The phone network has its own physics. Once you know which SIP trunk provider to point at LiveKit or Pipecat, you can ship.

  • Twilio Programmable Voice TwiML, Voice API, and PSTN connectivity in one hub; the default starting point. 🟢 Beginner
  • Twilio: Voice AI Assistant with OpenAI Realtime + Python Step-by-step junior-friendly tutorial wiring Twilio Media Streams to an LLM. 🟢 Beginner
  • Twilio SIP Quickstart Clearest beginner explainer of SIP basics, SIP Domains, and softphone setup. 🟢 Beginner
  • Telnyx Voice API Strong Twilio alternative with WebSocket media streaming and an AI Assistant tooling. 🟢 Beginner
  • Telnyx How to Set Up a SIP Trunk Friendly walkthrough of SIP trunking architecture, codecs, and authentication. 🟢 Beginner
  • Plivo Voice API Documentation XML call control and audio-streaming integrations for AI agents. 🟢 Beginner
  • SignalWire Voice Docs Built on FreeSWITCH; SWML, TwiML-compatible API, and an AI Agents SDK. 🟡 Intermediate
  • LiveKit SIP Primer Best diagram of how a call flows from PSTN → trunk → SIP service → agent. 🟢 Beginner
  • LiveKit SIP Trunk Setup Practical guide for wiring Twilio/Telnyx/Plivo trunks into LiveKit. 🟡 Intermediate
  • Pipecat Telephony Overview Differences between WebSocket-based telephony and SIP-based call control. 🟡 Intermediate

Pick one tutorial and finish it before starting another. Voice AI is unforgiving of half-built pipelines.

  • LiveKit Voice AI Quickstart Official 10-minute walkthrough in Python or Node with starter templates. 🟢 Beginner
  • Build Your First AI Voice Agent in Python (LiveKit) End-to-end Python tutorial covering streaming, latency, and deployment. 🟢 Beginner
  • Pipecat Quickstart Build and deploy a Deepgram + OpenAI + Cartesia bot in roughly 10 minutes. 🟢 Beginner
  • How to Build a Real-Time Voice Agent with Pipecat (AssemblyAI) Production-oriented walkthrough including local testing and Pipecat Cloud deployment. 🟡 Intermediate
  • Deepgram Build a Voice AI Agent Step-by-step guide wiring Deepgram STT, GPT, and Aura TTS. 🟢 Beginner
  • Build a Voice Assistant with Twilio ConversationRelay + LiteLLM Provider-agnostic tutorial supporting OpenAI, Anthropic, or DeepSeek. 🟡 Intermediate
  • freeCodeCamp Build Advanced AI Agents (LiveKit, Exa, LangChain) Free 3-part video course covering interactive voice agents end-to-end. 🟢 Beginner
  • freeCodeCamp Private On-Device Voice Assistant Hands-on local stack with Whisper, a local LLM, and system TTS. 🟡 Intermediate

Clone these instead of writing boilerplate from scratch.

  • livekit/agents The flagship open-source Python/Node framework for production voice agents. 🟢 → 🔴
  • pipecat-ai/pipecat Vendor-neutral framework with 40+ STT/LLM/TTS service plugins. 🟢 → 🔴
  • livekit-examples/agent-starter-python Production-ready starter with Dockerfile, eval suite, turn detector, and core plugins. 🟢 Beginner
  • livekit-examples (org) Official collection of LiveKit Python/React/Swift/Android starters. 🟢 Beginner
  • pipecat-ai/pipecat-examples Sample apps for push-to-talk, websocket, telephony, and multimodal use cases. 🟢 → 🟡
  • elevenlabs/elevenlabs-examples Runnable Next.js and Python examples for TTS, STT, and real-time agents. 🟢 Beginner
  • vocodedev/vocode-core Open-source modular framework for voice-LLM agents on phone, Zoom, or system audio. 🟡 Intermediate (less actively maintained than LiveKit/Pipecat)
  • kwindla/macos-local-voice-agents Pipecat example hitting sub-800 ms voice-to-voice latency entirely on M-series Macs. 🟡 Intermediate
  • zzw922cn/awesome-speech-recognition-speech-synthesis-papers Comprehensive curated index of ASR, TTS, voice conversion, and speech-LLM papers. 🟡 Intermediate
  • wildminder/awesome-ai-voice Up-to-date 2025–2026 list of open-source TTS and voice-cloning models.
  • CorentinJ/Real-Time-Voice-Cloning Classic 5-second voice cloning project for understanding TTS fundamentals. 🟡 Intermediate

You'll rarely train from scratch, but knowing which dataset a model was trained on explains its accents, languages, and failure modes.

  • LibriSpeech ASR Corpus ~1,000 hours of English audiobooks; nearly every ASR paper benchmarks against it. 🟢 Beginner
  • Mozilla Common Voice Crowdsourced multilingual dataset (100+ languages); the easiest legal way to fine-tune ASR. 🟢 Beginner
  • Common Voice on HuggingFace One-line load_dataset() access for hands-on experiments. 🟢 Beginner
  • Open ASR Leaderboard Live comparison of 60+ ASR models on WER and real-time factor. 🟢 Beginner
  • Artificial Analysis Speech Independent benchmarks of commercial STT and TTS providers. 🟢 Beginner
  • LJSpeech Dataset ~24 hours of single-speaker English audio; baseline corpus for Tacotron 2 and VITS. 🟢 Beginner
  • VCTK Corpus ~110 English speakers with diverse accents; widely used for multi-speaker TTS. 🟡 Intermediate
  • VoxCeleb (Oxford VGG) Million-utterance "in the wild" dataset for speaker identification and verification. 🟡 Intermediate

These are the landmark papers behind the models you'll actually use. Read the Whisper and Common Voice papers first they're unusually approachable.

  • Whisper Robust Speech Recognition via Large-Scale Weak Supervision (2022) Behind the most popular open ASR model; unusually clear prose for an ML paper. 🟡 Intermediate
  • HuggingFace Whisper fine-tuning blog (companion) Hands-on walkthrough that lets you "feel" the Whisper paper in code. 🟢 Beginner
  • VITS Conditional VAE with Adversarial Learning for End-to-End TTS (2021) The single-stage TTS model behind many open-source voice cloners. 🟡 Intermediate
  • Tacotron 2 Natural TTS Synthesis (2017) Landmark seq2seq + WaveNet-vocoder paper that made neural TTS sound natural. 🟡 Intermediate
  • Conformer Convolution-augmented Transformer for ASR (2020) The architecture inside NVIDIA Parakeet, Canary, and many leaderboard models. 🟡 Intermediate
  • wav2vec 2.0 Self-Supervised Learning of Speech Representations (2020) Showed that pretraining on unlabeled audio drastically reduces labeled-data needs. 🟡 Intermediate
  • Common Voice A Massively-Multilingual Speech Corpus (2020) Short, accessible paper describing how Common Voice is built and validated. 🟢 Beginner
  • Open ASR Leaderboard preprint (2025) Reproducible benchmark of 60+ ASR models across 11 datasets; the modern landscape map. 🟡 Intermediate

You can't ship what you can't measure. Voice-agent evaluation is fundamentally probabilistic a single transcript can pass and fail across runs, so simulation and statistics matter more than fixed test cases.

  • Coval Voice AI Testing Platform Defines the core voice-agent metrics: TTFB, WER, resolution rate, simulated accents, and interruptions. 🟢 Beginner
  • Coval How to Evaluate Voice Agents (Practical Guide) One of the most cited 2025 guides on probabilistic vs deterministic evaluation. 🟢 Beginner
  • Cekura Metrics Overview Predefined metrics, instruction-following checks, and simulation framework. 🟢 Beginner
  • Cekura Performance Testing for Voice Agents Practical 2025 guide on multi-turn simulation and edge-case generation. 🟡 Intermediate
  • Hamming AI Production-focused QA platform with simulation, load testing, and 50+ metrics. 🟡 Intermediate
  • Hamming Voice Agent Evaluation Metrics Guide Reference of latency percentiles, WER, MOS-style quality, and task completion with formulas. 🟡 Intermediate
  • LiveKit Understand and Improve Agent Latency Per-turn latency metrics (e2e, LLM TTFT, TTS TTFB) and where to optimize. 🟡 Intermediate
  • Twilio How Do You Know if Your Voice AI Agents Are Working? Vendor-neutral 2025 guide arguing for business-outcome metrics over raw WER/latency. 🟢 Beginner

Real production voice infrastructure is the hardest unsolved problem in this space. Read these before quoting anyone a per-minute price.

  • LiveKit Deploy and scale agents on LiveKit Cloud Real-world write-up on stateful load balancing, autoscaling, and warm pools. 🟡 Intermediate
  • LiveKit Why You Shouldn't Build Voice Agents Directly on Model APIs Honest breakdown of what raw model APIs don't give you. 🟡 Intermediate
  • Latent Space OpenAI Realtime API: The Missing Manual Field-tested guide from Pipecat's creator on Realtime API production realities. 🟡 Intermediate
  • TWIML Building Voice AI Agents That Don't Suck (Kwindla Kramer) One-hour discussion on real production architecture and turn-taking. 🟡 Intermediate
  • AWS Voice Agents with Pipecat and Amazon Bedrock Full architecture walkthrough including latency optimization and Nova Sonic. 🟡 Intermediate
  • Deepgram STT API Pricing Breakdown Vendor-by-vendor per-minute economics required reading before signing any contract. 🟢 Beginner
  • Sierra Shipping and Scaling AI Agents Case-study on Sonos, SiriusXM, and OluKai voice deployments. 🟡 Intermediate
  • Sierra Constellation of Models How a leading CX company composes 15+ models per agent. 🟡 Intermediate
  • LiveKit Agent Observability Built-in tracing, transcripts, and per-stage latency for LiveKit Cloud. 🟢 Beginner

If you're shipping a voice agent in 2026, disclosure and consent are no longer optional. The FCC and EU AI Act both have teeth.

  • FCC AI-Generated Voices in Robocalls Illegal (Feb 2024) The landmark TCPA ruling every U.S. voice-agent dev must read. 🟢 Beginner
  • EU AI Act Article 50 (Transparency for Deepfakes & AI Interactions) Authoritative text of EU disclosure rules; takes effect August 2026. 🟡 Intermediate
  • European Commission Code of Practice on AI-Generated Content Official EU implementation guidance on watermarking and labelling. 🟡 Intermediate
  • FTC Approaches to Address AI-Enabled Voice Cloning Plain-English summary of the Voice Cloning Challenge winners and Impersonation Rule. 🟢 Beginner
  • FTC Final Impersonation Rule (Feb 2024) Direct source on U.S. impersonation-fraud rules covering AI deepfakes. 🟢 Beginner
  • Pindrop 2025 Voice Intelligence & Security Report Industry report documenting a 1,300% rise in deepfake fraud attempts. 🟢 Beginner
  • Voice Cloning Ethics (CAMB.AI) Practical overview of consent frameworks, ELVIS Act, and EU AI Act. 🟢 Beginner
  • NCLC Top Six TCPA/Robocall Developments 2024/2025 Consumer-protection lens on what's actually being enforced. 🟡 Intermediate

Subscribe to two or three to stay current the field moves quickly.

  • LiveKit Blog Engineering deep-dives on WebRTC, agents framework releases, and production patterns.
  • Deepgram Learn Tutorials on STT/TTS, voice agent design, evals, and pipeline architecture.
  • Cartesia Blog State-space TTS models, Sonic releases, and yearly "State of Voice AI" reports.
  • ElevenLabs Blog Product and research announcements with implementation notes.
  • Daily.co Blog (Pipecat) Posts from Pipecat's maintainers covering scaling and feature releases.
  • Voice AI & Voice Agents Illustrated Primer Free, regularly-updated long-form primer.
  • Latent Space (swyx & Alessio) AI Engineer newsletter and podcast with frequent voice-AI episodes.
  • Voice AI Newsletter (Krisp) "Future of Voice AI" interview series with founders; published weekly in 2025.
  • Voice AI Weekly (Vapi) Weekly Substack rounding up news, products, and tools.
  • Voicebot.ai (Synthedia) Long-running daily news and paid newsletter on industry trends.
  • The Voicebot Podcast (Bret Kinsella) Longest-running serious voice-tech podcast; weekly founder interviews.
  • Latent Space The AI Engineer Podcast Top US tech podcast; regularly covers Realtime API, Pipecat, Voxtral, Gemini Live.
  • The Future of Voice AI (Krisp) Weekly founder interviews focused on enterprise voice AI architecture.
  • TWIML AI Podcast voice episodes Strong technical interviews; the Kwin Kramer episode is a great starting point.
  • This Week In Voice (Project Voice) News-roundtable format covering conversational AI.
  • LiveKit Community Slack Direct access to maintainers and other agent builders.
  • Pipecat Discord Active community with weekly office hours; invite link from the homepage.
  • HuggingFace Discord #ml-for-audio-and-speech 200k-member server with strong audio/speech channels.
  • Vapi Discord Builder community for Vapi voice agents; invite from the homepage.
  • Retell AI Discord Discord for Retell developers building phone-call voice agents.
  • ElevenLabs Discord Large TTS, voice cloning, and Conversational AI community with daily help threads.
  • Deepgram Discord STT/TTS/Voice Agent API support and build-with-us threads.
  • Reddit r/LocalLLaMA Active threads on local Whisper/Parakeet, on-device TTS, and end-to-end voice stacks.
  • Reddit r/AI_Agents General AI-agent community where voice topics surface frequently.
  • AI Engineer World's Fair Biggest AI-engineering conference; the Voice track has hosted major launches from ElevenLabs, Vapi, LiveKit, Pipecat, and Cartesia. 🟢 Beginner
  • AI Engineer YouTube channel All World's Fair and Summit talks are posted free; the best library of recent voice-AI talks. 🟢 Beginner
  • AI Engineer Summit Online Voice playlist Curated playlist including voice-track sessions from leading labs. 🟢 Beginner
  • AIEWF 2025 Recap (Latent Space) Written deep-dive into 2025's voice-track talks and major launches. 🟢 Beginner
  • VOICE & AI (Modev) Long-running voice technology conference with broader CX and voicebot focus. 🟢 Beginner
  • Project Voice Main U.S. event for conversational AI across voice, text, and chat. 🟢 Beginner
  • Interspeech Top academic speech-science conference; intimidating but worth knowing most landmark papers debut here. 🔴 Advanced
  • ElevenLabs Worldwide Hackathon Flagship global hackathon for conversational agents; 30+ cities and a $200K+ prize pool. 🟢 Beginner
  • ElevenHacks (weekly sprints) Weekly themed challenges with credits and prizes; low-pressure way to ship one project per week. 🟢 Beginner
  • AI Engineer World's Fair Hackathon Co-located with the conference; $10K prizes judged by 3,000+ AI engineers, with a strong voice track. 🟡 Intermediate
  • lablab.ai AI Hackathons Continuous calendar of short online hackathons frequently sponsored by voice-AI vendors. 🟢 Beginner
  • Devpost Voice AI Hackathons Centralized search for active voice-AI hackathons; the best way to find what's open right now. 🟢 Beginner
  1. Week 1 Foundations: Read the LiveKit pipeline post and Voice AI Illustrated Primer (sections 1, 7).
  2. Week 2 First agent: Finish the LiveKit or Pipecat quickstart end-to-end (sections 2, 9).
  3. Week 3 Components: Swap STT, TTS, and LLM providers; benchmark latency (sections 3, 4, 5).
  4. Week 4 Turn-taking & telephony: Add Silero VAD and a turn detector; connect a SIP trunk (sections 6, 8).
  5. Week 5 Production: Add evaluation, observability, and read the FCC/EU AI Act material (sections 13, 14, 15).
  6. Ongoing: Subscribe to two newsletters and join voice ai community in linkedin (sections 16, 17, 18).

Pull requests welcome. Resources must be active in the last 12 months, accessible to developers, and vendor-neutral or clearly labeled when authored by a commercial party. Open an issue to suggest additions or removals.

About

Set of 📝 with 🔗 to help those building Voice AI agents 🎙️🤖

mahimairaja.github.io/voiceai/

Topics

text-to-speech awesome webrtc tts speech-synthesis speech-recognition awesome-list speech-to-text beginners asr voice-assistant ai-agents conversational-ai learning-resources voice-ai livekit llm pipecat voice-agents realtime-ai

Resources

Readme

License

MIT license

Uh oh!

There was an error while loading. Please reload this page.

Activity

Stars

97 stars

Watchers

0 watching

Forks

4 forks Report repository

Releases

No releases published

Contributors

Uh oh!

There was an error while loading. Please reload this page.