Skip to content

georgemandis/openai-phoneme-exploration

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 

Repository files navigation

Phoneme Explorer

An exploration of phoneme generation using OpenAI's structured outputs and Text-to-Speech (TTS) features. This tool tries to break-down words into their constituent phonemes and generates audio pronunciations for each sound.

The results were... delightfully mediocre! But interesting.

What It Does

  1. Phoneme Analysis: Uses OpenAI's GPT-4 with structured outputs to analyze any word and break it down into phonemes
  2. Audio Generation: Creates individual MP3 files for each phoneme using OpenAI's TTS API
  3. Interactive Visualization: Generates HTML pages where you can listen to individual phonemes or play them sequentially

Prerequisites

  • Python 3.7+
  • An OpenAI API key
  • Required Python packages:
    pip install openai pydantic

Setup

  1. Set your OpenAI API key as an environment variable:

    export OPENAI_API_KEY="your-api-key-here"
  2. Clone or download this repository

  3. Install the previously mentioned packages: openai and pydantic)

Usage

Run the script with any word you want to analyze:

python pronounce.py hello

This will:

  • Analyze the word "hello" and identify its phonemes in the structured output: h, ə, l,
  • Generate MP3 files for each phoneme in the sounds/ directory
  • Create an HTML file at words/hello.html with an interactive interface

Example Output

For the word "hello", the script should generate:

  • sounds/h.mp3 - pronunciation of the "h" sound
  • sounds/ə.mp3 - pronunciation of the schwa sound
  • sounds/l.mp3 - pronunciation of the "l" sound
  • sounds/oʊ.mp3 - pronunciation of the "oʊ" diphthong
  • words/hello.html - interactive webpage

Viewing Results

Open the generated HTML file in your browser:

open words/hello.html

Note: If you're loading the files locally your browser may or may not care about the security implicaitons of loading audio files via a relative path. You can probably circumnavigate the issue by using Chrome or running a web server locally to server the resulting files.

The interface allows you to:

  • See the word and its phoneme breakdown
  • Play individual phonemes by clicking their "Play" buttons
  • Play all phonemes in sequence with the "Play All Phonemes" button

Project Structure

phonemes/
├── pronounce.py        # Main script
├── sounds/            # Generated phoneme audio files
│   ├── h.mp3
│   ├── ə.mp3
│   └── ...
└── words/             # Generated HTML visualization files  
    ├── hello.html
    ├── world.html
    └── ...

How It Works

  1. Structured Output: Uses OpenAI's responses.parse() with a Pydantic model to ensure the API returns phonemes in a structured format
  2. Phoneme Generation: Leverages GPT-4's linguistic knowledge to break words into International Phonetic Alphabet (IPA) symbols
  3. Audio Synthesis: Uses OpenAI's TTS model (gpt-4o-mini-tts) with the "coral" voice to pronounce individual phonemes
  4. Caching: Reuses existing audio files and HTML pages to avoid redundant API calls

Technical Details

  • Model: GPT-4 (gpt-4o-2024-08-06) for phoneme analysis
  • TTS Model: gpt-4o-mini-tts with "coral" voice
  • Output Format: MP3 audio files, HTML with embedded JavaScript
  • Phoneme Notation: International Phonetic Alphabet (IPA) symbols

Examples

Try these interesting words:

python pronounce.py hello
python pronounce.py world
python pronounce.py george
python pronounce.py continental
python pronounce.py mediocrity

Each will generate its own set of phoneme audio files and interactive HTML page. The last word was provided as an intentional example hinting at the quality of the results you should expect (lol).

I am not an expert in phonetics and setup this project on a lark to explore the concept and cannot speak to the phoneme accuracy in the structured outputs. They appear generally accurate in my cursory testing. The audio output however has a decidedly larger gap to bridge. It makes sense to me, given the TTS models are probably optimized to produce entire words and less-so parts of words.

Hopefully this project is still interesting!

About

Exploring phoneme generation and pronunciation with OpenAI

Topics

Resources

Stars

Watchers

Forks

Languages