This repository showcases the integration between Agent Voice Response and OpenAI's Real-time Speech-to-Speech API. The application leverages OpenAI's powerful language model to process audio input from users, providing intelligent, context-aware responses in real-time audio format.
To set up and run this project, you will need:
- Node.js and npm installed
- An OpenAI API key with access to the real-time API
- WebSocket support in your environment
git clone https://github.com/agentvoiceresponse/avr-sts-openai.git
cd avr-sts-openai
npm install
Create a .env
file in the root of the project to store your API keys and configuration. You will need to add the following variables:
OPENAI_API_KEY=your_openai_api_key
PORT=6030
OPENAI_MODEL=gpt-4o-realtime-preview # Optional, defaults to gpt-4o-realtime-preview
OPENAI_INSTRUCTIONS="You are a helpful assistant that can answer questions and help with tasks." # Optional
OPENAI_TEMPERATURE=0.8 # Optional, controls randomness (0.0-1.0), defaults to 0.8
OPENAI_MAX_TOKENS=100 # Optional, controls response length, defaults to "inf"
Replace your_openai_api_key
with your actual OpenAI API key.
Start the application by running the following command:
node index.js
The server will start on the port defined in the environment variable (default: 6030).
The Agent Voice Response system integrates with OpenAI's Real-time Speech-to-Speech API to provide intelligent audio-based responses to user queries. The server receives audio input from users, forwards it to OpenAI's API, and then returns the model's response as audio in real-time using WebSocket communication.
- Express.js Server: Handles incoming audio streams from clients
- WebSocket Communication: Manages real-time communication with OpenAI's API
- Audio Processing: Handles audio format conversion between 8kHz and 24kHz
- Real-time Streaming: Processes and streams audio data in real-time
The application includes two main audio processing functions:
-
Upsampling (8kHz to 24kHz):
- Converts client audio from 8kHz to 24kHz using linear interpolation
- Required for OpenAI's API which expects 24kHz input
-
Downsampling (24kHz to 8kHz):
- Converts OpenAI's 24kHz output back to 8kHz
- Ensures compatibility with client audio systems
This endpoint accepts an audio stream and returns a streamed audio response generated by OpenAI.
Request:
- Content-Type: audio/x-raw
- Format: 16-bit PCM at 8kHz
- Method: POST
Response:
- Content-Type: text/event-stream
- Format: 16-bit PCM at 8kHz
- Streamed audio data in real-time
You can customize the application behavior using the following environment variables:
OPENAI_API_KEY
: Your OpenAI API key (required)PORT
: The port on which the server will listen (default: 6030)OPENAI_MODEL
: The OpenAI model to use (default: gpt-4o-realtime-preview)OPENAI_INSTRUCTIONS
: Custom instructions for the AI (optional)OPENAI_TEMPERATURE
: Controls randomness in responses (0.0-1.0, default: 0.8)OPENAI_MAX_TOKENS
: Controls the maximum length of the response (default: "inf")
The application includes comprehensive error handling for:
- WebSocket connection issues
- Audio processing errors
- OpenAI API errors
- Stream processing errors
All errors are logged to the console and appropriate error messages are returned to the client.
- GitHub: https://github.com/agentvoiceresponse - Report issues, contribute code.
- Discord: https://discord.gg/DFTU69Hg74 - Join the community discussion.
- Docker Hub: https://hub.docker.com/u/agentvoiceresponse - Find Docker images.
- NPM: https://www.npmjs.com/~agentvoiceresponse - Browse our packages.
- Wiki: https://wiki.agentvoiceresponse.com/en/home - Project documentation and guides.
AVR is free and open-source. If you find it valuable, consider supporting its development:
MIT License - see the LICENSE file for details.