Master Thesis Project by Dilbar Isakova
Erasmus Mundus Joint Master's Degree in Big Data Management and Analytics
LISN & Inria AVIZ
A Multi-Modal Machine Learning Approach for Spatial Audio-Visual Analytics in Meeting Room Environments
- Project Overview
- Research Questions
- System Architecture
- Prerequisites
- Hardware Requirements
- Software Installation
- Project Structure
- Getting Started
- Running the System
- Machine Learning Pipeline
- Visualization Modes
- Data Collection
- Troubleshooting
- Future Work
- Acknowledgments
This project explores ambient data visualization on non-planar (spherical) displays for meeting room environmental awareness. The system combines:
- IoT Hardware: ESP32 microcontroller with dual-microphone spatial audio array
- Signal Processing: Real-time FFT analysis and spatial audio processing
- Machine Learning: Random Forest and Gradient Boosting models for meeting analytics
- 3D Visualization: Three.js with WebGL rendering for spherical ambient displays
- Real-time Communication: WebSocket protocol for low-latency data streaming
The system provides ambient environmental awareness through spatial audio visualization, automatically classifying meeting characteristics including speaker count, meeting type, energy level, and engagement scores.
- Real-time spatial audio capture and processing (5 kHz sampling rate)
- Multiple visualization modes (Audio 3D, Waves, Stereo Chart, Activity Chart)
- Machine learning classification of meeting characteristics
- Web-based 3D visualization with parametric display positioning
- Dual-mode operation: live ESP32 data or pre-recorded datasets
This thesis addresses three primary research questions:
RQ1: Which display form factors best suit meeting room environments?
RQ2: How can non-planar displays effectively support ambient environmental awareness in collaborative meeting environments?
RQ3: How can machine learning classification enhance the meaningfulness of real-time meeting analytics for ambient display applications?
┌─────────────────────────────────────────────────────────────┐
│ Meeting Room Environment │
│ │
│ ┌──────────────┐ ┌──────────────┐ │
│ │ MAX4466 │ │ MAX4466 │ │
│ │ Microphone │◄───────┤ Microphone │ │
│ │ (Left) │ │ (Right) │ │
│ └──────┬───────┘ └──────┬───────┘ │
│ │ │ │
│ │ Analog Audio │ │
│ └───────┬───────────────┘ │
│ │ │
│ ┌───────▼────────┐ │
│ │ ESP32 Board │ │
│ │ - ADC (12-bit)│ │
│ │ - FFT (512pt) │ │
│ │ - WiFi Module │ │
│ └───────┬────────┘ │
│ │ │
│ │ WebSocket (JSON) │
│ │ Port 81 │
└─────────────────┼───────────────────────────────────────────┘
│
│ WiFi Network
│
┌─────────────────▼─────────────────────────────────────────────┐
│ Web Browser Visualization │
│ │
│ ┌──────────────────────────────────────────────────────────┐ │
│ │ Three.js Rendering Engine │ │
│ │ - WebGL Shaders │ │
│ │ - Spherical Geometry (128×128 subdivisions │ │
│ │ - Real-time Audio Visualization │ │
│ └──────────────────────────────────────────────────────────┘ │
│ │
│ ┌──────────────────────────────────────────────────────────┐ │
│ │ Machine Learning Integration │ │
│ │ - Speaker Count Classifier │ │
│ │ - Meeting Type Classifier │ │
│ │ - Energy Level Classifier │ │
│ │ - Engagement Score Regressor │ │
│ └──────────────────────────────────────────────────────────┘ │
└───────────────────────────────────────────────────────────────┘
- Audio Capture: Dual MAX4466 microphones → ESP32 ADC (5 kHz, 12-bit)
- Signal Processing: Windowed FFT (512 samples, Hamming window, 300-3400 Hz voice filtering)
- Feature Extraction: Energy levels, stereo difference, spatial positioning
- WebSocket Transmission: JSON-formatted real-time data (~0.5s updates)
- Visualization Update: Three.js shader uniforms, real-time rendering
- ML Classification: Optional meeting analytics (speaker count, type, energy, engagement)
- Operating System: Windows, macOS, or Linux
- Arduino IDE: Version 1.8.19 or later (for ESP32 programming)
- Python: Version 3.8 or later (for machine learning pipeline)
- Web Browser: Modern browser with WebGL support (Chrome, Firefox, Edge recommended)
- Git: For cloning the repository
- Basic understanding of embedded systems
- Familiarity with JavaScript and web development
- Python programming for data science
- Understanding of audio signal processing concepts (helpful but not required)
Component | Specification | Quantity | Purpose |
---|---|---|---|
ESP32 Development Board | Dual-core, WiFi-enabled | 1 | Microcontroller for audio processing |
MAX4466 Microphone Amplifier | Electret microphone with adjustable gain | 2 | Spatial audio capture |
Breadboard | Standard size | 1 | Circuit prototyping |
Jumper Wires | Male-to-male, male-to-female | ~10 | Connections |
USB Cable | Micro-USB or USB-C (depending on ESP32) | 1 | Power and programming |
Power Supply | 5V, 500mA minimum | 1 | Optional: for standalone operation |
ESP32 Pin Connections:
┌─────────────────┐
│ ESP32 │
│ │
│ GPIO 35 ◄──────┼──── Left Microphone (MAX4466 OUT)
│ GPIO 34 ◄──────┼──── Right Microphone (MAX4466 OUT)
│ 3.3V ──────────┼──── Both MAX4466 VCC
│ GND ──────────┼──── Both MAX4466 GND
│ │
└─────────────────┘
- Positioning: Place microphones 5-10 cm apart for optimal stereo separation
- Gain Adjustment: Use onboard potentiometers to match sensitivity between channels
- Spatial Testing: Verify left-right discrimination with test audio sources
git clone https://github.com/isakovaad/sphereDisplay.git
cd sphereDisplay
Download and install from: https://www.arduino.cc/en/software
- Open Arduino IDE
- Go to File → Preferences
- Add to Additional Board Manager URLs:
https://raw.githubusercontent.com/espressif/arduino-esp32/gh-pages/package_esp32_index.json
- Go to Tools → Board → Boards Manager
- Search for "ESP32" and install "esp32 by Espressif Systems"
Go to Sketch → Include Library → Manage Libraries and install:
- ArduinoFFT (v1.5.6 or later) - For FFT audio analysis
- WebSocketsServer (v2.3.6 or later) - For WebSocket communication
- WiFi (included with ESP32 board package)
- ArduinoJson (v6.19.4 or later) - For JSON serialization
- Open
sketch_jun13a/sketch_jun13a.ino
in Arduino IDE - Locate the WiFi configuration section:
const char* ssid = "YOUR_WIFI_SSID"; const char* password = "YOUR_WIFI_PASSWORD";
- Replace with your WiFi network credentials
- Note: Ensure ESP32 and your computer are on the same network
- Connect ESP32 via USB
- Select Tools → Board → ESP32 Dev Module (or your specific board)
- Select Tools → Port → [Your ESP32 COM port]
- Click Upload button
- Wait for compilation and upload to complete
- Open Tools → Serial Monitor (115200 baud) to view ESP32's IP address
# Create virtual environment
python -m venv venv
# Activate virtual environment
# On Windows:
venv\Scripts\activate
# On macOS/Linux:
source venv/bin/activate
pip install numpy pandas scikit-learn matplotlib seaborn
pip install librosa soundfile audiomentations
pip install jupyter notebook
pip install websocket-client
pip install xgboost # For Gradient Boosting
pip install scipy # For signal processing
The web visualization requires no additional installation - simply open index.html
in a modern web browser.
Required Files:
index.html
- Main HTML pagemain.js
- Three.js visualization logicinit.js
- WebSocket and initializationstyle.css
- Stylingml-styles.css
- Machine learning UI stylingml-integration.js
- ML model integration (if available)
sphereDisplay/
│
├── sketch_jun13a/ # Arduino/ESP32 Code
│ └── sketch_jun13a.ino # Main ESP32 firmware
│
├── trained_models/ # Machine Learning Models
│ ├── speaker_count_model.pkl # Speaker count classifier
│ ├── meeting_type_model.pkl # Meeting type classifier
│ ├── energy_level_model.pkl # Energy level classifier
│ └── engagement_model.pkl # Engagement score regressor
│
├── data_analysis.ipynb # Jupyter notebook for data exploration
├── data_augmenter.ipynb # Audio data augmentation pipeline
├── data_collector.ipynb # Real-time data collection tool
│
├── index.html # Main web visualization page
├── main.js # Three.js rendering and visualization
├── init.js # WebSocket and initialization logic
├── style.css # Main stylesheet
├── ml-styles.css # ML interface styling
├── ml-integration.js # ML model integration (deployment)
├── ml_model_server.py # Python ML inference server (future)
│
├── .DS_Store # macOS metadata (ignore)
└── ml_server.log # ML server logs (if deployed)
NOTE: THe model deployment is not integrated yet. Refer for detailes to the ML part below.
- sketch_jun13a.ino: ESP32 firmware implementing:
- Dual-microphone audio capture (5 kHz sampling)
- FFT analysis with Hamming windowing
- Voice frequency filtering (300-3400 Hz)
- WebSocket server for real-time data streaming
- JSON message formatting
-
data_collector.ipynb: Interactive tool for collecting labeled audio data from ESP32
- Real-time WebSocket connection to ESP32
- Live labeling interface for speaker count, meeting type, energy level
- Automatic session management and CSV export
-
data_augmenter.ipynb: Audio data augmentation pipeline
- Time stretching (0.8x-1.2x)
- Pitch shifting (±1-2 semitones)
- Stereo positioning (8 variants)
- Background noise mixing
- Meeting flow simulation
-
data_analysis.ipynb: Exploratory data analysis and model training
- Feature engineering (36 audio features)
- Model training (Random Forest, Gradient Boosting)
- Hyperparameter optimization via GridSearchCV
- Performance evaluation and visualization
-
trained_models/: Serialized scikit-learn models (.pkl files)
-
index.html: Main application interface with:
- WebSocket connection controls
- Visualization mode switcher
- Display height adjustment
- Meeting simulation controls
-
main.js: Core Three.js visualization
- Spherical geometry rendering (128×128 subdivisions)
- Custom WebGL shaders for audio visualization
- Real-time uniform updates
- Multiple visualization modes
-
init.js: WebSocket client implementation
- Connection management
- JSON message parsing
- Real-time data forwarding to visualization
- Connect microphones to ESP32 as per wiring diagram
- Adjust microphone gain potentiometers to medium position
- Connect ESP32 to computer via USB
- Open
sketch_jun13a/sketch_jun13a.ino
in Arduino IDE - Configure WiFi credentials
- Upload to ESP32
- Note the IP address displayed in Serial Monitor (e.g.,
192.168.0.110
)
- Open
index.html
in web browser - Enter ESP32 IP address in connection field (e.g.,
192.168.0.110
) - Click "Connect to ESP32"
- Verify "Connected" status and live audio data display
- Make sounds near the microphones
- Observe real-time visualization changes
- Test left/right spatial audio by making sounds on different sides
- Switch between visualization modes to explore different representations
Best for: Real-time meeting monitoring and system testing
-
Ensure ESP32 is powered and connected to WiFi
# Check ESP32 serial output # Should show: "WiFi connected" and "IP address: X.X.X.X"
-
Open web visualization
- Launch
index.html
in browser - Enter ESP32 IP address
- Click "Connect to ESP32"
- Launch
-
Verify real-time data stream
- Audio data display should update ~2 times per second
- Left/Right mic levels should respond to sounds
- Stereo difference should reflect speaker position
-
Explore visualization modes
- Audio 3D: Directional color mapping (orange = left, green = right)
- Waves: Flowing audio waveform visualization
- Stereo Chart: Left-right channel comparison over time
- Activity Chart: Average audio level histogram
Best for: Demonstrations, analysis, and consistent testing
-
Load sample dataset (if available)
- Click "Load Real Data" button in web interface
- System will replay pre-recorded audio session
- Useful for presentations and debugging
-
Analyze recorded sessions
- Open
data_analysis.ipynb
in Jupyter - Load CSV files from data collection
- Visualize audio patterns and meeting characteristics
- Open
Note: ML model deployment is marked as future work in the thesis. Current implementation includes trained models but not real-time inference.
To use trained models for analysis:
# In Jupyter notebook
import pickle
import pandas as pd
# Load trained models
speaker_model = pickle.load(open('trained_models/speaker_count_model.pkl', 'rb'))
meeting_model = pickle.load(open('trained_models/meeting_type_model.pkl', 'rb'))
energy_model = pickle.load(open('trained_models/energy_level_model.pkl', 'rb'))
engagement_model = pickle.load(open('trained_models/engagement_model.pkl', 'rb'))
# Load and prepare your audio features
features = pd.read_csv('your_features.csv')
# Make predictions
speaker_count = speaker_model.predict(features)
meeting_type = meeting_model.predict(features)
energy_level = energy_model.predict(features)
engagement_score = engagement_model.predict(features)
The ML pipeline classifies meeting characteristics from dual-microphone audio features:
- Speaker Count: 1, 2, or 3+ speakers
- Meeting Type: Discussion, Presentation, Brainstorm, Argument
- Energy Level: Low, Medium, High
- Engagement Score: Continuous 0-100 scale (synthetic)
Metric | Value |
---|---|
Original Recordings | 16 sessions |
Total Duration | ~91 minutes |
Augmented Samples | 960 |
Features per Sample | 36 |
Sampling Rate | 5 kHz |
FFT Window Size | 512 samples |
36 Audio Features grouped into categories:
- Volume variance, mean, median
- Volume trend (slope)
- Speaker change rate
- Activity duration statistics
- Stereo difference (mean, std, range)
- Stereo switches (position changes)
- Left/Right dominance patterns
- Spectral centroid, bandwidth, rolloff
- Zero-crossing rate
- MFCC coefficients (3)
- Energy distribution
- Peak density
- Silence ratio
- High-activity ratio
- Engagement complexity
- Dynamic range
Use data_collector.ipynb
to gather labeled training data:
# Start Jupyter
jupyter notebook data_collector.ipynb
# In notebook:
# 1. Enter ESP32 IP address
# 2. Click 'Start Recording'
# 3. Label data in real-time:
# - Speaker count (1, 2, 3+)
# - Meeting type (discussion, presentation, etc.)
# - Energy level (low, medium, high)
# - Background noise level
# 4. Save session data to CSV
Run data_augmenter.ipynb
to expand dataset:
# Augmentation techniques applied:
# - Time stretching: 0.8x, 0.9x, 1.1x, 1.2x (4 variants)
# - Pitch shifting: ±1, ±2 semitones (4 variants)
# - Stereo positioning: 8 panning variants
# - Background noise: 6 ambient noise types
# - Meeting flow simulation: 4 conversation patterns
#
# Total: ~60x dataset expansion
Run data_analysis.ipynb
for complete pipeline:
# 1. Load and preprocess data
# 2. Engineer 36 audio features
# 3. Split train/validation/test sets
# 4. Train Random Forest classifiers
# 5. Train Gradient Boosting regressor
# 6. Hyperparameter optimization via GridSearchCV
# 7. Evaluate performance
# 8. Save trained models
Model | Algorithm | Accuracy/RMSE | Notes |
---|---|---|---|
Speaker Count | Random Forest | 73.3% | 3-class problem |
Meeting Type | Random Forest | 95.3% | 4-class problem |
Energy Level | Random Forest | 97.9% | 3-class problem |
Engagement Score | Gradient Boosting | 1.28 RMSE | Continuous 0-100 |
{
'n_estimators': 100,
'max_depth': 10,
'min_samples_split': 2
}
{
'n_estimators': 200,
'max_depth': None,
'min_samples_split': 2
}
{
'n_estimators': 150,
'max_depth': 10,
'min_samples_split': 5
}
{
'n_estimators': 100,
'learning_rate': 0.1,
'max_depth': 5
}
Purpose: Real-time spatial audio awareness
Visual Encoding:
- Orange regions: Left-dominant audio sources
- Green regions: Right-dominant audio sources
- Gradient blending: Balanced/centered audio
- Intensity: Mapped to audio energy level
Use Cases:
- Monitor speaker positioning in real-time
- Detect conversation balance
- Ambient awareness during meetings
Purpose: Organic, flowing audio representation
Visual Encoding:
- Wave patterns: Multiple layered sinusoidal flows
- Color palette: Purple → Magenta → Cyan
- Movement: Audio-reactive wave propagation
- Noise layers: Organic texture variations
Use Cases:
- Aesthetic ambient displays
- Non-distracting environmental awareness
- Meeting energy visualization
Purpose: Analytical left-right comparison
Visual Encoding:
- Left channel: Orange bars
- Right channel: Green bars
- Time axis: Horizontal progression
- Level axis: Decibel scale
Use Cases:
- Detailed audio analysis
- Spatial audio debugging
- Meeting documentation
Purpose: Meeting engagement overview
Visual Encoding:
- Bars: Average audio level per time segment
- Height: Audio energy intensity
- Color: Activity level gradient
- Timeline: Meeting progression
Use Cases:
- Post-meeting analysis
- Engagement assessment
- Meeting quality metrics
Height Adjustment: 5 preset positions
- Lowest: Eye-level for seated participants
- Medium: Above table surface
- Highest: Ceiling-mounted configuration
Meeting Simulation: "Add Person" button
- Simulates occupancy changes
- Updates CO2 visualization (demo mode)
- Tests display responsiveness
- ESP32 connected and streaming audio data
- Jupyter Notebook running
- Quiet meeting room for controlled recordings
-
Launch Data Collector
jupyter notebook data_collector.ipynb
-
Configure Connection
# Enter ESP32 IP address when prompted ESP32_IP = "192.168.0.110" # Your ESP32's IP
-
Start Recording Session
- Click
[s] Start/Stop recording
- System begins capturing audio features
- Real-time audio levels displayed
- Click
-
Apply Labels During Recording
- Click
[l] Add label
at any time - Enter current meeting characteristics:
- Speaker Count: 1, 2, 3, 4, or 5+
- Meeting Type: discussion, presentation, brainstorm, argument
- Energy Level: low, medium, high
- Background Noise: none, low, medium, high
- Click
-
Monitor Data Quality
- View recent audio data with
[v]
- Check stereo difference for spatial accuracy
- Verify audio levels are within expected range (40-80 dB)
- View recent audio data with
-
Stop and Save Session
- Click
[s]
again to stop recording - Data automatically saved to CSV with timestamp
- Session metadata logged to
labels/sessions_master.csv
- Click
- Recording Duration: 3-5 minutes per session minimum
- Variety: Capture diverse meeting scenarios
- Label Granularity: Apply labels every 30-60 seconds
- Spatial Coverage: Record audio from different positions
- Energy Levels: Include quiet, moderate, and energetic discussions
data/
├── recordings/
│ ├── session_20250620_130013.csv
│ ├── session_20250620_145139.csv
│ └── ...
└── labels/
└── sessions_master.csv # Master label registry
After collecting baseline data, run augmentation pipeline:
jupyter notebook data_augmenter.ipynb
This expands your dataset by ~60x through:
- Time-domain transformations
- Pitch variations
- Spatial positioning changes
- Background noise injection
- Meeting flow simulations
Solutions:
-
Verify ESP32 is on same WiFi network
// Check serial monitor output // Should show: "WiFi connected" and IP address
-
Check firewall settings
- Allow WebSocket connections on port 81
- Temporarily disable firewall to test
-
Verify IP address
- Use Serial Monitor to confirm ESP32 IP
- Try ping from command line:
ping 192.168.0.110
-
WebSocket port conflict
- Ensure no other service is using port 81
- Try changing port in both ESP32 code and web interface
Solutions:
-
Check microphone connections
- Verify GPIO 34 and 35 connections
- Ensure VCC and GND are properly connected
-
Adjust microphone gain
- Turn potentiometers clockwise to increase sensitivity
- Test with louder sounds first
-
Inspect serial monitor
- Look for FFT processing errors
- Check for ADC overflow warnings
-
Restart ESP32
- Press reset button or power cycle
- Allow 5-10 seconds for WiFi reconnection
Solutions:
-
Check WebGL support
- Visit: https://get.webgl.org/
- Update graphics drivers if needed
-
Browser console errors
- Press F12 to open Developer Tools
- Check Console tab for JavaScript errors
- Clear browser cache
-
Three.js loading
- Verify
main.js
andinit.js
are loading - Check Network tab for 404 errors
- Verify
Solutions:
-
Reduce geometry resolution
// In main.js, change: const geometry = new THREE.SphereGeometry(sphereRadius, 64, 64); // Instead of 128, 128
-
Close other browser tabs
- WebGL is resource-intensive
- Free up GPU memory
-
Lower update frequency
- Reduce WebSocket message rate in ESP32 code
Solutions:
-
Insufficient data
- Collect at least 16 diverse recording sessions
- Ensure label balance across classes
-
Feature extraction errors
- Check for NaN values in feature matrix
- Verify audio data is properly formatted
-
Overfitting warnings
- Increase training data via augmentation
- Reduce model complexity (max_depth, n_estimators)
-
Poor model performance
- Verify label accuracy in training data
- Check feature normalization
- Try different hyperparameters
Solutions:
-
Version mismatch
# Check scikit-learn version import sklearn print(sklearn.__version__) # Ensure same version as model was trained with
-
Corrupted pickle files
- Retrain and save models again
- Check file integrity
Solutions:
-
Microphone gain too high
- Reduce gain using potentiometers
- Aim for 50-70 dB range for normal speech
-
Electrical interference
- Keep wires away from power sources
- Add small capacitor (0.1µF) between VCC and GND
-
Poor grounding
- Ensure solid GND connection
- Use twisted pair wiring for analog signals
Solutions:
-
Microphone sensitivity mismatch
- Calibrate using identical test sounds
- Adjust gain potentiometers independently
-
Physical positioning
- Maintain equal distance from sound sources during calibration
- Ensure microphones face same direction
Solutions:
-
Reinstall dependencies
pip install --upgrade -r requirements.txt
-
Virtual environment not activated
# Windows venv\Scripts\activate # macOS/Linux source venv/bin/activate
-
Jupyter kernel mismatch
python -m ipykernel install --user --name=venv # Select 'venv' kernel in Jupyter
Current State: Models are trained but not integrated into live visualization
Implementation Path:
# Future ml_model_server.py architecture
from flask import Flask, request, jsonify
import pickle
import numpy as np
app = Flask(__name__)
# Load trained models
models = {
'speaker_count': pickle.load(open('trained_models/speaker_count_model.pkl', 'rb')),
'meeting_type': pickle.load(open('trained_models/meeting_type_model.pkl', 'rb')),
'energy_level': pickle.load(open('trained_models/energy_level_model.pkl', 'rb')),
'engagement': pickle.load(open('trained_models/engagement_model.pkl', 'rb'))
}
@app.route('/predict', methods=['POST'])
def predict():
features = request.json['features']
predictions = {
'speaker_count': models['speaker_count'].predict([features])[0],
'meeting_type': models['meeting_type'].predict([features])[0],
'energy_level': models['energy_level'].predict([features])[0],
'engagement_score': models['engagement'].predict([features])[0]
}
return jsonify(predictions)
if __name__ == '__main__':
app.run(port=5000)
Web Integration:
- Add HTTP requests from
ml-integration.js
to Flask server - Display ML predictions in real-time visualization UI
- Update spherical display based on classified meeting characteristics
Sequential Models for Temporal Patterns:
- LSTM Networks: Capture conversation flow and turn-taking dynamics
- Transformer Models: Attention mechanisms for long-range dependencies
- Temporal CNNs: Multi-scale temporal feature extraction
Multimodal Integration:
- Combine audio with video (facial expressions, gestures)
- Environmental sensors (temperature, CO2, humidity)
- Physiological monitoring (heart rate, galvanic skin response)
Diversity Requirements:
- Multiple languages and accents
- Different room acoustics and sizes
- Cultural conversation patterns
- Industry-specific meeting types (medical, legal, technical)
- Remote/hybrid meeting scenarios
Scale Goals:
- 500+ recording sessions
- 50+ hours of labeled audio
- 10,000+ augmented training samples
Proposed Enhancements:
- Particle systems for engagement bursts
- Fluid dynamics simulations for conversation flow
- Heat maps for speaker position history
- Frequency spectrum visualizations (3D spectrograms)
Use Case: Large meeting rooms with multiple displays
// Synchronization protocol
class DisplaySync {
constructor(displayId, coordinatorIP) {
this.displayId = displayId;
this.coordinator = new WebSocket(`ws://${coordinatorIP}:82`);
this.syncState();
}
syncState() {
this.coordinator.onmessage = (event) => {
const syncData = JSON.parse(event.data);
this.updateVisualization(syncData);
};
}
}
Context-Aware Display Modes:
- Automatic mode switching based on meeting type
- Brightness adaptation to room lighting
- Complexity adjustment based on viewing distance
- Color palette adaptation for accessibility
Ambient visualization operates in peripheral awareness, providing environmental information without demanding explicit attention. Unlike traditional dashboards requiring focused interaction, ambient displays follow "calm technology" principles where:
- Information is glanceable and non-intrusive
- Visual encoding supports peripheral perception
- Primary tasks remain uninterrupted
- Awareness is maintained without cognitive overhead
The dual-microphone system captures spatial characteristics through:
Stereo Difference Calculation:
Level_difference = 20 × log₁₀(|L[k]| / |R[k]|)
Voice Frequency Filtering: 300-3400 Hz band captures human speech while rejecting environmental noise
FFT Analysis: 512-point window with Hamming function balances frequency resolution (~10 Hz bins) and temporal responsiveness (~0.1s windows)
Ensemble Methods combine multiple weak learners:
- Random Forest: Bootstrap aggregating (bagging) of decision trees
- Gradient Boosting: Sequential training where each tree corrects previous errors
Feature Importance: Models identify which audio characteristics most predict meeting type, enabling interpretable analytics
Metric | Value | Notes |
---|---|---|
Latency (E2E) | ~500ms | ESP32 processing + network + rendering |
Audio Sampling | 5 kHz | Sufficient for voice (Nyquist: 2.5 kHz max) |
Update Rate | ~2 Hz | Balance of statistical robustness and responsiveness |
WebSocket Bandwidth | ~5 KB/s | JSON messages with audio features |
Browser FPS | 60 FPS | Three.js rendering with WebGL |
ML Inference | N/A | Currently offline (future: <100ms target) |
#include <WiFi.h> // ESP32 WiFi library
#include <WebSocketsServer.h> // Real-time communication
#include <arduinoFFT.h> // Fast Fourier Transform
#include <ArduinoJson.h> // JSON serialization
#include <driver/adc.h> // Low-level ADC control
import numpy as np # Numerical computing
import pandas as pd # Data manipulation
import sklearn # Machine learning
import librosa # Audio processing
import soundfile # Audio I/O
import audiomentations # Data augmentation
import matplotlib, seaborn # Visualization
// Three.js (r128) - 3D rendering engine
// WebSocket API - Real-time communication
// Native WebGL - Hardware-accelerated graphics
If you use this project in your research, please cite:
@mastersthesis{isakova2025ambient,
title={Intelligent Ambient Data Visualization on Non-Planar Displays:
A Multi-Modal Machine Learning Approach for Spatial Audio-Visual Analytics},
author={Isakova, Dilbar},
year={2025},
school={Universit\'e Paris-Saclay, CentraleSupelec},
type={Master's Thesis},
note={Erasmus Mundus Joint Master's Degree in Big Data Management and Analytics}
}
- Prof. Anastasia Bezerianos - Université Paris-Saclay
- Dr. Petra Isenberg - Inria, Research Director
- Dr. Tobias Isenberg - Inria, Research Director
- Erwan Achat - INRIA Laboratory Colleague (Dual-microphone array configuration)
- tVISt Project Partners - Valuable design feedback and project guidance
- INRIA AVIZ Team - Survey participants and data collection support
- Université Paris-Saclay - Academic institution
- CentraleSupélec - Partner institution
- Inria Saclay - Research laboratory
- Erasmus Mundus BDMA Program - Master's program support
This project is part of academic research conducted at LISN and Inria AVIZ.
For Academic Use: Citation and acknowledgment required
For Commercial Use: Please contact the author and supervisors
Copyright © 2025 Dilbar Isakova
Author: Dilbar Isakova
Email: [email protected]
Institution: Université Paris-Saclay, Inria AVIZ
GitHub Repository: https://github.com/isakovaad/sphereDisplay
Thesis Date: August 31, 2025
- White, S., Feiner, S. (2009). "SiteLens: Situated Visualization Techniques for Urban Site Visits"
- Ren, Y., et al. (2025). "Multi-dimensional Feature Extraction for Audio Classification"
- Zhang, Z., et al. (2013). "Ensemble Methods in Environmental Audio Classification"
- ESP32 Datasheet: https://www.espressif.com/sites/default/files/documentation/esp32_datasheet_en.pdf
- Three.js Documentation: https://threejs.org/docs/
- ArduinoFFT Library: https://github.com/kosme/arduinoFFT
- WebSocket Protocol: https://datatracker.ietf.org/doc/html/rfc6455
- Arduino IDE: https://www.arduino.cc/en/software
- scikit-learn: https://scikit-learn.org/
- librosa (Audio Analysis): https://librosa.org/
- Jupyter Notebook: https://jupyter.org/
# Compile without uploading
arduino-cli compile --fqbn esp32:esp32:esp32 sketch_jun13a/
# Upload to ESP32
arduino-cli upload -p /dev/ttyUSB0 --fqbn esp32:esp32:esp32 sketch_jun13a/
# Monitor serial output
arduino-cli monitor -p /dev/ttyUSB0 -c baudrate=115200
# Start Jupyter
jupyter notebook
# Run specific notebook
jupyter nbconvert --execute --to html data_analysis.ipynb
# Export trained model
python -c "import pickle; pickle.dump(model, open('model.pkl', 'wb'))"
# Load and test model
python -c "import pickle; model = pickle.load(open('model.pkl', 'rb')); print(model.score(X_test, y_test))"
# Start local HTTP server (for testing)
python -m http.server 8000
# Then open: http://localhost:8000/index.html
# Live reload development (using VS Code Live Server extension)
# Right-click index.html → "Open with Live Server"
# Clone repository
git clone https://github.com/isakovaad/sphereDisplay.git
# Check status
git status
# Pull latest changes
git pull origin main
# Commit changes
git add .
git commit -m "Description of changes"
git push origin main
- "The Scientist and Engineer's Guide to Digital Signal Processing" by Steven W. Smith (Free online)
- YouTube: 3Blue1Brown - "But what is the Fourier Transform?"
- Course: Coursera - "Audio Signal Processing for Music Applications"
- ESP32 Official Documentation: https://docs.espressif.com/, https://documentation.espressif.com/en/home
- Book: "Programming ESP32 with Arduino IDE" by Kolban
- Forum: ESP32.com community forums
- Book: "Neural Networks for Audio Signal Processing" by Mann & Haykin
- Tutorial: Librosa audio feature extraction tutorials
- Course: Fast.ai - "Practical Deep Learning for Coders"
- Three.js Journey: https://threejs-journey.com/
- WebGL Fundamentals: https://webglfundamentals.org/
- Book: "Interactive Computer Graphics with WebGL" by Angel & Shreiner
Last Updated: September 30, 2025