The Enhanced Hotword Detection System is an innovative, machine learning-based solution for recognizing specific trigger words or phrases in continuous audio streams. Developed by a team of ambitious first-semester B.Tech CSE students specializing in AI and ML, this project showcases the potential of emerging talent in the field of artificial intelligence and speech recognition.
This system provides high accuracy, low latency, and robust performance across various acoustic environments, making it suitable for a wide range of applications including voice assistants, automotive systems, industrial control, accessibility tools, smart home devices, security systems, educational technology, gaming, and telehealth.
- Advanced neural network architectures including CNN, RNN, CRNN, and Transformer for accurate hotword detection
- Robust audio preprocessing and feature extraction pipeline
- Real-time processing capabilities with low latency
- Adaptive noise cancellation and speaker normalization
- Advanced data augmentation techniques
- Support for continuous learning and model updates
- Customizable for multiple hotwords and languages
- Comprehensive API for easy integration into existing systems
- Extensive performance metrics and benchmarking tools
- Cross-platform compatibility (Windows, macOS, Linux)
- GPU optimization and mixed precision training
- Efficient data loading using TensorFlow's data pipeline
- Transfer learning capabilities using pre-trained models
- False positive reduction techniques
The Enhanced Hotword Detection System consists of several key components:
- Audio Preprocessing Module: Handles input audio streams, applying noise reduction, speaker normalization, and segmentation.
- Feature Extraction Engine: Extracts relevant acoustic features including MFCCs and Mel spectrograms.
- Neural Network Model: Multiple architecture options including CNN, RNN, CRNN, and Transformer, optimized for hotword detection.
- Post-processing Module: Applies decision thresholding and smoothing to raw model outputs.
- Continuous Learning System: Enables model updates with new data to improve performance over time.
- API Layer: Provides interfaces for easy integration with other software systems.
This project is the result of collaborative efforts by a team of first-semester B.Tech CSE students specializing in AI and ML. The team structure includes:
- 1 Head Coder: Responsible for overall architecture and core algorithm development
- 5-6 Major Team Members: Focused on various aspects such as:
- Data collection and preprocessing
- Model training and optimization
- Performance evaluation and benchmarking
- API development and integration
- Documentation and project management
As this is an ongoing project by students in their early stages of their academic journey, we acknowledge that the system may not be perfect and is continuously evolving. We welcome feedback, suggestions, and contributions from the community to help improve our project.
While our system is still under development, we are continuously working to improve its performance. Current metrics:
- False Acceptance Rate (FAR): < 0.5%
- False Rejection Rate (FRR): < 3%
- Response Time: < 500ms
Please note that these metrics are subject to change as we refine our algorithms and expand our training data. Detailed benchmarking results and comparison with other systems are available in the docs/benchmarks.md
file.
- Python 3.10.1
- TensorFlow 2.18.0
- CUDA-compatible GPU (recommended for training and high-performance inference)
- Additional libraries: librosa, numpy, scipy, soundfile, pyaudio, tqdm, scikit-learn, matplotlib
-
Clone the repository:
-
Install the required dependencies: pip install -r requirements.txt
-
Prepare your dataset:
- Place hotword audio samples in
data/hotword/
- Place non-hotword audio samples in
data/non_hotword/
- Run the main script: python main.py
src/
audio_utils.py
: Contains utility functions for audio processing and feature extractiondata_collection.py
: Handles data collection and preprocessingfeatures.py
: Handles feature extraction from audiomodel.py
: Defines the base neural network architecturetrain.py
: Main script for training the hotword detection modelevaluate.py
: Script for evaluating model performancedata/
hotword/
: Directory for hotword audio samplesnon_hotword/
: Directory for non-hotword audio samplesmodels/
: Directory for saving trained models and checkpointsdocs/
: Project documentationconfig.py
: Configuration file for system parametersmain.py
: Main execution script
The system is designed to utilize available GPUs for faster training and inference. It includes:
- Automatic GPU detection and configuration
- Mixed precision training for improved performance
- Memory growth settings for NVIDIA GPUs
The system employs various data augmentation techniques to improve model robustness:
- Pitch shifting
- Time stretching
- Volume variation
- Room reverberation simulation
- Background noise injection
The system supports incremental learning, allowing the model to adapt to new data over time:
- Checkpointing of model and optimizer states
- Resumable training sessions
- Daily training time limits to prevent overtraining
- Efficient data loading using TensorFlow's data pipeline
- Multi-core CPU optimization
- Early stopping and learning rate reduction on plateau
- Class weight balancing for imbalanced datasets
- Quantization-aware training for latency optimization
Custom loss functions and training techniques are employed to minimize false activations, crucial for hotword detection systems.
To train for a specific hotword (e.g., "kurma"):
- Add multiple recordings of the hotword "kurma" to the
data/hotword/
directory. - Add various non-hotword audio samples to the
data/non_hotword/
directory. - Adjust the
MODEL_TYPE
inconfig.py
if you want to experiment with different model architectures. - Run the
main.py
script to start the training process.
- Implement more sophisticated room simulation techniques for data augmentation
- Explore TTS-based data generation for expanding the dataset
- Implement additional advanced model architectures
- Further optimize the training process with techniques like curriculum learning
- Develop a user-friendly interface for model customization and deployment
- Implement a streaming inference pipeline for real-time detection
[Insert your chosen license here]
- TensorFlow team for their excellent deep learning framework
- The open-source community for various audio processing libraries
We welcome contributions and suggestions to help improve this project!