AI Foundations
Speech-to-Speech
Speech-to-speech is an AI technology that translates spoken language directly into spoken language, without an intermediate text step. Voice agents use speech-to-speech for natural real-time phone conversations.
Also known as: S2S, Speech to Speech
How speech-to-speech works
Classical voice pipelines run in three steps: speech-to-text, language model, text-to-speech. Each step adds latency and loses information. Speech-to-speech processes audio inside the model and returns audio directly, with no intermediate stop.
Advantages over classical pipelines
- Latency under one second, suitable for natural dialogue
- Tone and pauses are preserved
- Robust against accents, background noise, interruptions
Speech-to-speech at LoyJoy
The LoyJoy voice agent uses speech-to-speech and brings the same AI agent customers know from chat to the phone channel.