Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
95 changes: 89 additions & 6 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -14,12 +14,66 @@ Work in progress. Developed to quickly test new models running DeepSpeech in [Wi
- Streaming inference via DeepSpeech v0.2+
- Multi-user (only decodes one stream at a time, but can block until decoding is available)
- Tested and works with DeepSpeech v0.5.1 on Windows
- Mode for JSON return and enhanced/rich metadata on timing of each word
* Client
- Streams raw audio data from microphone to server via WebSocket
- Voice activity detection (VAD) to ignore noise and segment microphone input into separate utterances
- Hypnotizing spinner to indicate voice activity is detected!
- Option to automatically save each utterance to a separate .wav file, for later testing
- Need to pause/unpause listening? [See here](https://github.com/daanzu/deepspeech-websocket-server/issues/6).
- A POST endpoint to push files directly (warning, limited file upload size)


### Server Endpoints

Functionality was expanded with a few additional enpoints but the same great server wrapper.

* `/recognize` - WebSocket-based traditional recognition (plain text result)
* `/recognize_meta` - WebSocket-based enhanced recognition that includes JSON results for probability, timing, etc.
- example JSON result:
```
{
"probability": 53.0922,
"text": "your power is sufficient i said",
"duration": 5.36,
"items": [
{
"text": "your",
"start": 0.68,
"duration": 0.18
},
{
"text": "power",
"start": 0.92,
"duration": 0.50
},
{
"text": "is",
"start": 1.24,
"duration": 0.66
},
{
"text": "sufficient",
"start": 1.38,
"duration": 1.32
},
{
"text": "i",
"start": 1.86,
"duration": 1.32
},
{
"text": "said",
"start": 2.04,
"duration": 1.38
}
],
"start": 0.68
}
```
* `/recognize_file` - POST recognition allowing either enhanced (JSON) or text-only (string) for a file upload (see [Audio File Processing](Audio+File+Processing))
- uses web-form or parameter submissions using parameters `audio` (a `wav file`) and `enhanced` (integer `0` or `1`)


## Installation

Expand Down Expand Up @@ -48,7 +102,7 @@ On MacOS, try installing portaudio with brew: `brew install portaudio` .

## Server

```
```bash
> python server.py --model ../models/daanzu-6h-512l-0001lr-425dr/ -l -t
Initializing model...
2018-10-06 AM 05:55:16.357: __main__: INFO: <module>(): args.model: ../models/daanzu-6h-512l-0001lr-425dr/output_graph.pb
Expand All @@ -69,7 +123,7 @@ Hit Ctrl-C to quit.
^CKeyboardInterrupt
```

```
```bash
> python server.py -h
usage: server.py [-h] -m MODEL [-a [ALPHABET]] [-l [LM]] [-t [TRIE]] [--lw LW]
[--vwcw VWCW] [--bw BW] [-p PORT]
Expand Down Expand Up @@ -99,18 +153,18 @@ optional arguments:

## Client

```
```bash
λ py client.py
Listening...
Recognized: alpha bravo charlie
Recognized: delta echo foxtrot
^C
```

```
```bash
λ py client.py -h
usage: client.py [-h] [-s SERVER] [-a AGGRESSIVENESS] [--nospinner]
[-w SAVEWAV]
[-w SAVEWAV] [-d DEVICE] [-v]

Streams raw audio data from microphone with VAD to server via WebSocket

Expand All @@ -124,7 +178,28 @@ optional arguments:
speech, 3 the most aggressive. Default: 3
--nospinner Disable spinner
-w SAVEWAV, --savewav SAVEWAV
Save .wav files of utterences to given directory
Save .wav files of utterences to given directory.
Example for current directory: -w .
-d DEVICE, --device DEVICE
Set audio device for input, according to system. The
default utilizes system-specified recording device.
-v, --verbose Print debugging info

```

### Audio File Processing
Want to send a file directly to the server instead of from a live source?

```bash
# process a single file for text alone; must be wav file
curl -X POST -F file=@../audio/8455-210777-0068.wav http://localhost:8787/recognize_file

# process a single file with enhanced return; must be wav file
curl -X POST -F file=@../audio/8455-210777-0068.wav -F enhanced=1 http://localhost:8787/recognize_file

# process a single file with enhanced return; must be wav file (alternate with url-based parameter)
curl -X POST -F file=@../audio/8455-210777-0068.wav http://localhost:8787/recognize_file?enhanced=1

```

## Contributions
Expand All @@ -133,3 +208,11 @@ Pull requests welcome.

Contributors:
* [@Zeddy913](https://github.com/Zeddy913)


## Changes

Coarse description of significant modifications as they come.

- 190905 - add POST API for file endpoint; enhanced mode for server returns; launch server at `0.0.0.0` instead of localhost
- 190903 - add device index for pyaudio so you can use other loopback devices (e.g. [MacOS Soundflower](https://github.com/mattingalls/Soundflower) )
50 changes: 41 additions & 9 deletions client.py
Original file line number Diff line number Diff line change
Expand Up @@ -3,6 +3,7 @@
import threading, collections, queue, os, os.path
import wave
import pyaudio
import pprint
import webrtcvad
from lomond import WebSocket, events
from halo import Halo
Expand All @@ -25,7 +26,7 @@ class Audio(object):
CHANNELS = 1
BLOCKS_PER_SECOND = 50

def __init__(self, callback=None, buffer_s=0, flush_queue=True):
def __init__(self, callback=None, buffer_s=0, flush_queue=True, device_index=None):
def proxy_callback(in_data, frame_count, time_info, status):
callback(in_data)
return (None, pyaudio.paContinue)
Expand All @@ -38,6 +39,7 @@ def proxy_callback(in_data, frame_count, time_info, status):
channels=self.CHANNELS,
rate=self.sample_rate,
input=True,
input_device_index=device_index,
frames_per_buffer=self.block_size,
stream_callback=proxy_callback)
self.stream.start_stream()
Expand All @@ -56,11 +58,27 @@ def read(self):
else:
return None


def read_loop(self, callback):
"""Block looping reading, repeatedly passing a block of audio data to callback."""
for block in iter(self):
callback(block)

@staticmethod
def device_list():
"""Iterate and return the audio devices in the system."""
local_pa = pyaudio.PyAudio()
device_info = { "Input":[], "Output":[] }
device_count = local_pa.get_device_count()
for idx_dev in range(device_count):
local_info = local_pa.get_device_info_by_index(idx_dev)
for local_type in ["Output", "Input"]:
local_channels = f"max{local_type}Channels"
if local_channels in local_info and local_info[local_channels] > 0:
device_info[local_type].append({"device":idx_dev, "name":local_info["name"],
"channels":local_info[local_channels]})
return device_info

def __iter__(self):
"""Generator that yields all audio blocks from microphone."""
while True:
Expand Down Expand Up @@ -89,8 +107,8 @@ def write_wav(self, filename, data):
class VADAudio(Audio):
"""Filter & segment audio with voice activity detection."""

def __init__(self, aggressiveness=3):
super().__init__()
def __init__(self, aggressiveness=3, device_index=None):
super().__init__(device_index=device_index)
self.vad = webrtcvad.Vad(aggressiveness)

def vad_collector_simple(self, pre_padding_ms, blocks=None):
Expand Down Expand Up @@ -209,7 +227,11 @@ def on_event(event):
print_output("Connected!")
ready = True
elif isinstance(event, events.Text):
if 1: print_output("Recognized: %s" % event.text)
# TODO: modify for inclusion of timing information?
# TODO: what do we do with a rich / metadata return instead?

if len(event.text):
print_output("Recognized: %s" % event.text)
elif 1:
logging.debug(event)

Expand All @@ -221,15 +243,21 @@ def on_event(event):
websocket.close()

def main():
if ARGS.listdevice:
dict_devices = Audio.device_list()
print_output("Available devices...")
print_output(pprint.pprint(dict_devices))
return 0

vad_audio = VADAudio(aggressiveness=ARGS.aggressiveness, device_index=ARGS.device)

websocket = WebSocket(ARGS.server)
# TODO: compress?
print_output("Connecting to '%s'..." % websocket.url)

vad_audio = VADAudio(aggressiveness=ARGS.aggressiveness)
print_output("Listening (ctrl-C to exit)...")
audio_consumer_thread = threading.Thread(target=lambda: audio_consumer(vad_audio, websocket))
print_output("Listening (ctrl-C to exit)...")
audio_consumer_thread.start()

print_output("Connecting to '%s'..." % websocket.url)
websocket_runner(websocket)


Expand All @@ -246,7 +274,7 @@ def consumer(self, blocks):
else:
print('.', end='', flush=True)
length_ms = 0
VADAudio(consumer)
VADAudio(consumer, device_index=ARGS.device)
elif 1:
VADAudio.test_vad(3)

Expand All @@ -261,8 +289,12 @@ def consumer(self, blocks):
help="Disable spinner")
parser.add_argument('-w', '--savewav',
help="Save .wav files of utterences to given directory. Example for current directory: -w .")
parser.add_argument('-d', '--device', type=int, default=None,
help="Set audio device for input, according to system. The default utilizes system-specified recording device.")
parser.add_argument('-v', '--verbose', action='store_true',
help="Print debugging info")
parser.add_argument('-l', '--listdevice', action='store_true',
help="List available devices for live capture")
ARGS = parser.parse_args()

if ARGS.verbose: logging.getLogger().setLevel(10)
Expand Down
1 change: 1 addition & 0 deletions requirements-server.txt
Original file line number Diff line number Diff line change
@@ -1,3 +1,4 @@
numpy>=1.15.1
bottle>=0.12.13
bottle-websocket>=0.2.9
scipy>=0.12.0
Loading