← Back to list

audio-language-models
by yonatangross
The Complete AI Development Toolkit for Claude Code — 159 skills, 34 agents, 20 commands, 144 hooks. Production-ready patterns for FastAPI, React 19, LangGraph, security, and testing.
⭐ 29🍴 4📅 Jan 23, 2026
SKILL.md
name: audio-language-models description: Gemini Live API, Grok Voice Agent, GPT-4o-Transcribe, AssemblyAI patterns for real-time voice, speech-to-text, and TTS. Use when implementing voice agents, audio transcription, or conversational AI. context: fork agent: multimodal-specialist version: 1.1.0 author: OrchestKit user-invocable: false tags: [audio, multimodal, gemini-live, grok-voice, whisper, tts, speech, voice-agent, 2026]
Audio Language Models (2026)
Build real-time voice agents and audio processing using the latest native speech-to-speech models.
Overview
- Real-time voice assistants and agents
- Live conversational AI (phone agents, support bots)
- Audio transcription with speaker diarization
- Multilingual voice interactions
- Text-to-speech generation
- Voice-to-voice translation
Model Comparison (January 2026)
Real-Time Voice (Speech-to-Speech)
| Model | Latency | Languages | Price | Best For |
|---|---|---|---|---|
| Grok Voice Agent | <1s TTFA | 100+ | $0.05/min | Fastest, #1 Big Bench |
| Gemini Live API | Low | 24 (30 voices) | Usage-based | Emotional awareness |
| OpenAI Realtime | ~1s | 50+ | $0.10/min | Ecosystem integration |
Speech-to-Text Only
| Model | WER | Latency | Best For |
|---|---|---|---|
| Gemini 2.5 Pro | ~5% | Medium | 9.5hr audio, diarization |
| GPT-4o-Transcribe | ~7% | Medium | Accuracy + accents |
| AssemblyAI Universal-2 | 8.4% | 200ms | Best features |
| Deepgram Nova-3 | ~18% | <300ms | Lowest latency |
| Whisper Large V3 | 7.4% | Slow | Self-host, 99+ langs |
Grok Voice Agent API (xAI) - Fastest
import asyncio
import websockets
import json
async def grok_voice_agent():
"""Real-time voice agent with Grok - #1 on Big Bench Audio.
Features:
- <1 second time-to-first-audio (5x faster than competitors)
- Native speech-to-speech (no transcription intermediary)
- 100+ languages, $0.05/min
- OpenAI Realtime API compatible
"""
uri = "wss://api.x.ai/v1/realtime"
headers = {"Authorization": f"Bearer {XAI_API_KEY}"}
async with websockets.connect(uri, extra_headers=headers) as ws:
# Configure session
await ws.send(json.dumps({
"type": "session.update",
"session": {
"model": "grok-4-voice",
"voice": "Aria", # or "Eve", "Leo"
"instructions": "You are a helpful voice assistant.",
"input_audio_format": "pcm16",
"output_audio_format": "pcm16",
"turn_detection": {"type": "server_vad"}
}
}))
# Stream audio in/out
async def send_audio(audio_stream):
async for chunk in audio_stream:
await ws.send(json.dumps({
"type": "input_audio_buffer.append",
"audio": base64.b64encode(chunk).decode()
}))
async def receive_audio():
async for message in ws:
data = json.loads(message)
if data["type"] == "response.audio.delta":
yield base64.b64decode(data["delta"])
return send_audio, receive_audio
# Expressive voice with auditory cues
async def expressive_response(ws, text: str):
"""Use auditory cues for natural speech."""
# Supports: [whisper], [sigh], [laugh], [pause]
await ws.send(json.dumps({
"type": "response.create",
"response": {
"instructions": "[sigh] Let me think about that... [pause] Here's what I found."
}
}))
Gemini Live API (Google) - Emotional Awareness
import google.generativeai as genai
from google.generativeai import live
genai.configure(api_key="YOUR_API_KEY")
async def gemini_live_voice():
"""Real-time voice with emotional understanding.
Features:
- 30 HD voices in 24 languages
- Affective dialog (understands emotions)
- Barge-in support (interrupt anytime)
- Proactive audio (responds only when relevant)
"""
model = genai.GenerativeModel("gemini-2.5-flash-live")
config = live.LiveConnectConfig(
response_modalities=["AUDIO"],
speech_config=live.SpeechConfig(
voice_config=live.VoiceConfig(
prebuilt_voice_config=live.PrebuiltVoiceConfig(
voice_name="Puck" # or Charon, Kore, Fenrir, Aoede
)
)
),
system_instruction="You are a friendly voice assistant."
)
async with model.connect(config=config) as session:
# Send audio
async def send_audio(audio_chunk: bytes):
await session.send(
input=live.LiveClientContent(
realtime_input=live.RealtimeInput(
media_chunks=[live.MediaChunk(
data=audio_chunk,
mime_type="audio/pcm"
)]
)
)
)
# Receive audio responses
async for response in session.receive():
if response.data:
yield response.data # Audio bytes
# With transcription
async def gemini_live_with_transcript():
"""Get both audio and text transcripts."""
async with model.connect(config=config) as session:
async for response in session.receive():
if response.server_content:
# Text transcript
if response.server_content.model_turn:
for part in response.server_content.model_turn.parts:
if part.text:
print(f"Transcript: {part.text}")
if response.data:
yield response.data # Audio
Gemini Audio Transcription (Long-Form)
import google.generativeai as genai
def transcribe_with_gemini(audio_path: str) -> dict:
"""Transcribe up to 9.5 hours of audio with speaker diarization.
Gemini 2.5 Pro handles long-form audio natively.
"""
model = genai.GenerativeModel("gemini-2.5-pro")
# Upload audio file
audio_file = genai.upload_file(audio_path)
response = model.generate_content([
audio_file,
"""Transcribe this audio with:
1. Speaker labels (Speaker 1, Speaker 2, etc.)
2. Timestamps for each segment
3. Punctuation and formatting
Format:
[00:00:00] Speaker 1: First statement...
[00:00:15] Speaker 2: Response..."""
])
return {
"transcript": response.text,
"audio_duration": audio_file.duration
}
Gemini TTS (Text-to-Speech)
def gemini_text_to_speech(text: str, voice: str = "Kore") -> bytes:
"""Generate speech with Gemini 2.5 TTS.
Features:
- Enhanced expressivity with style prompts
- Precision pacing (context-aware speed)
- Multi-speaker dialogue consistency
"""
model = genai.GenerativeModel("gemini-2.5-flash-tts")
response = model.generate_content(
contents=text,
generation_config=genai.GenerationConfig(
response_mime_type="audio/mp3",
speech_config=genai.SpeechConfig(
voice_config=genai.VoiceConfig(
prebuilt_voice_config=genai.PrebuiltVoiceConfig(
voice_name=voice # Puck, Charon, Kore, Fenrir, Aoede
)
)
)
)
)
return response.audio
OpenAI GPT-4o-Transcribe
from openai import OpenAI
client = OpenAI()
def transcribe_openai(audio_path: str, language: str = None) -> dict:
"""Transcribe with GPT-4o-Transcribe (enhanced accuracy)."""
with open(audio_path, "rb") as audio_file:
response = client.audio.transcriptions.create(
model="gpt-4o-transcribe",
file=audio_file,
language=language,
response_format="verbose_json",
timestamp_granularities=["word", "segment"]
)
return {
"text": response.text,
"words": response.words,
"segments": response.segments,
"duration": response.duration
}
AssemblyAI (Best Features)
import assemblyai as aai
aai.settings.api_key = "YOUR_API_KEY"
def transcribe_assemblyai(audio_url: str) -> dict:
"""Transcribe with speaker diarization, sentiment, entities."""
config = aai.TranscriptionConfig(
speaker_labels=True,
sentiment_analysis=True,
entity_detection=True,
auto_highlights=True,
language_detection=True
)
transcriber = aai.Transcriber()
transcript = transcriber.transcribe(audio_url, config=config)
return {
"text": transcript.text,
"speakers": transcript.utterances,
"sentiment": transcript.sentiment_analysis,
"entities": transcript.entities
}
Real-Time Streaming Comparison
async def choose_realtime_provider(
requirements: dict
) -> str:
"""Select best real-time voice provider."""
if requirements.get("fastest_latency"):
return "grok" # <1s TTFA, 5x faster
if requirements.get("emotional_understanding"):
return "gemini" # Affective dialog
if requirements.get("openai_ecosystem"):
return "openai" # Compatible tools
if requirements.get("lowest_cost"):
return "grok" # $0.05/min (half of OpenAI)
return "gemini" # Best overall for 2026
API Pricing (January 2026)
| Provider | Type | Price | Notes |
|---|---|---|---|
| Grok Voice Agent | Real-time | $0.05/min | Cheapest real-time |
| Gemini Live | Real-time | Usage-based | 30 HD voices |
| OpenAI Realtime | Real-time | $0.10/min | |
| Gemini 2.5 Pro | Transcription | $1.25/M tokens | 9.5hr audio |
| GPT-4o-Transcribe | Transcription | $0.01/min | |
| AssemblyAI | Transcription | ~$0.15/hr | Best features |
| Deepgram | Transcription | ~$0.0043/min |
Key Decisions
| Scenario | Recommendation |
|---|---|
| Voice assistant | Grok Voice Agent (fastest) |
| Emotional AI | Gemini Live API |
| Long audio (hours) | Gemini 2.5 Pro (9.5hr) |
| Speaker diarization | AssemblyAI or Gemini |
| Lowest latency STT | Deepgram Nova-3 |
| Self-hosted | Whisper Large V3 |
Common Mistakes
- Using STT+LLM+TTS pipeline instead of native speech-to-speech
- Not leveraging emotional understanding (Gemini)
- Ignoring barge-in support for natural conversations
- Using deprecated Whisper-1 instead of GPT-4o-Transcribe
- Not testing latency with real users
Related Skills
vision-language-models- Image/video processingmultimodal-rag- Audio + text retrievalstreaming-api-patterns- WebSocket patterns
Capability Details
real-time-voice
Keywords: voice agent, real-time, conversational, live audio Solves:
- Build voice assistants
- Phone agents and support bots
- Interactive voice response (IVR)
speech-to-speech
Keywords: native audio, speech-to-speech, no transcription Solves:
- Low-latency voice responses
- Natural conversation flow
- Emotional voice interactions
transcription
Keywords: transcribe, speech-to-text, STT, convert audio Solves:
- Convert audio files to text
- Generate meeting transcripts
- Process long-form audio
voice-tts
Keywords: TTS, text-to-speech, voice synthesis Solves:
- Generate natural speech
- Multi-voice dialogue
- Expressive audio output
Score
Total Score
75/100
Based on repository quality metrics
✓SKILL.md
SKILL.mdファイルが含まれている
+20
✓LICENSE
ライセンスが設定されている
+10
✓説明文
100文字以上の説明がある
+10
○人気
GitHub Stars 100以上
0/15
✓最近の活動
1ヶ月以内に更新
+10
○フォーク
10回以上フォークされている
0/5
✓Issue管理
オープンIssueが50未満
+5
✓言語
プログラミング言語が設定されている
+5
✓タグ
1つ以上のタグが設定されている
+5
Reviews
💬
Reviews coming soon
