Skip to content

This repository showcases the integration between Agent Voice Response and OpenAI's Real-time Speech-to-Speech API

License

Notifications You must be signed in to change notification settings

agentvoiceresponse/avr-sts-openai

Repository files navigation

Agent Voice Response - OpenAI Speech-to-Speech Integration

Discord GitHub Repo stars Docker Pulls Ko-fi

This repository showcases the integration between Agent Voice Response and OpenAI's Real-time Speech-to-Speech API. The application leverages OpenAI's powerful language model to process audio input from users, providing intelligent, context-aware responses in real-time audio format.

Prerequisites

To set up and run this project, you will need:

  1. Node.js and npm installed
  2. An OpenAI API key with access to the real-time API
  3. WebSocket support in your environment

Setup

1. Clone the Repository

git clone https://github.com/agentvoiceresponse/avr-sts-openai.git
cd avr-sts-openai

2. Install Dependencies

npm install

3. Configure Environment Variables

Create a .env file in the root of the project to store your API keys and configuration. You will need to add the following variables:

OPENAI_API_KEY=your_openai_api_key
PORT=6030
OPENAI_MODEL=gpt-4o-realtime-preview  # Optional, defaults to gpt-4o-realtime-preview
OPENAI_INSTRUCTIONS="You are a helpful assistant that can answer questions and help with tasks."  # Optional
OPENAI_TEMPERATURE=0.8  # Optional, controls randomness (0.0-1.0), defaults to 0.8
OPENAI_MAX_TOKENS=100  # Optional, controls response length, defaults to "inf"

Replace your_openai_api_key with your actual OpenAI API key.

4. Running the Application

Start the application by running the following command:

node index.js

The server will start on the port defined in the environment variable (default: 6030).

How It Works

The Agent Voice Response system integrates with OpenAI's Real-time Speech-to-Speech API to provide intelligent audio-based responses to user queries. The server receives audio input from users, forwards it to OpenAI's API, and then returns the model's response as audio in real-time using WebSocket communication.

Key Components

  • Express.js Server: Handles incoming audio streams from clients
  • WebSocket Communication: Manages real-time communication with OpenAI's API
  • Audio Processing: Handles audio format conversion between 8kHz and 24kHz
  • Real-time Streaming: Processes and streams audio data in real-time

Audio Processing

The application includes two main audio processing functions:

  1. Upsampling (8kHz to 24kHz):

    • Converts client audio from 8kHz to 24kHz using linear interpolation
    • Required for OpenAI's API which expects 24kHz input
  2. Downsampling (24kHz to 8kHz):

    • Converts OpenAI's 24kHz output back to 8kHz
    • Ensures compatibility with client audio systems

API Endpoints

POST /speech-to-speech-stream

This endpoint accepts an audio stream and returns a streamed audio response generated by OpenAI.

Request:

  • Content-Type: audio/x-raw
  • Format: 16-bit PCM at 8kHz
  • Method: POST

Response:

  • Content-Type: text/event-stream
  • Format: 16-bit PCM at 8kHz
  • Streamed audio data in real-time

Customizing the Application

Environment Variables

You can customize the application behavior using the following environment variables:

  • OPENAI_API_KEY: Your OpenAI API key (required)
  • PORT: The port on which the server will listen (default: 6030)
  • OPENAI_MODEL: The OpenAI model to use (default: gpt-4o-realtime-preview)
  • OPENAI_INSTRUCTIONS: Custom instructions for the AI (optional)
  • OPENAI_TEMPERATURE: Controls randomness in responses (0.0-1.0, default: 0.8)
  • OPENAI_MAX_TOKENS: Controls the maximum length of the response (default: "inf")

Error Handling

The application includes comprehensive error handling for:

  • WebSocket connection issues
  • Audio processing errors
  • OpenAI API errors
  • Stream processing errors

All errors are logged to the console and appropriate error messages are returned to the client.

Support & Community

Support AVR

AVR is free and open-source. If you find it valuable, consider supporting its development:

Support us on Ko-fi

License

MIT License - see the LICENSE file for details.

About

This repository showcases the integration between Agent Voice Response and OpenAI's Real-time Speech-to-Speech API

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published