daanzu · ezavesky · Sep 3, 2019 · Sep 5, 2019 · Sep 21, 2019
diff --git a/README.md b/README.md
@@ -14,12 +14,66 @@ Work in progress. Developed to quickly test new models running DeepSpeech in [Wi
     - Streaming inference via DeepSpeech v0.2+
     - Multi-user (only decodes one stream at a time, but can block until decoding is available)
     - Tested and works with DeepSpeech v0.5.1 on Windows
+    - Mode for JSON return and enhanced/rich metadata on timing of each word
 * Client
     - Streams raw audio data from microphone to server via WebSocket
     - Voice activity detection (VAD) to ignore noise and segment microphone input into separate utterances
     - Hypnotizing spinner to indicate voice activity is detected!
     - Option to automatically save each utterance to a separate .wav file, for later testing
     - Need to pause/unpause listening? [See here](https://github.com/daanzu/deepspeech-websocket-server/issues/6).
+    - A POST endpoint to push files directly (warning, limited file upload size)
+
+
+### Server Endpoints
+
+Functionality was expanded with a few additional enpoints but the same great server wrapper.
+
+* `/recognize` - WebSocket-based traditional recognition (plain text result)
+* `/recognize_meta` - WebSocket-based enhanced recognition that includes JSON results for probability, timing, etc.
+    - example JSON result: 
+    ```
+    {
+        "probability": 53.0922,
+        "text": "your power is sufficient i said",
+        "duration": 5.36,
+        "items": [
+            {
+                "text": "your",
+                "start": 0.68,
+                "duration": 0.18
+            },
+            {
+                "text": "power",
+                "start": 0.92,
+                "duration": 0.50
+            },
+            {
+                "text": "is",
+                "start": 1.24,
+                "duration": 0.66
+            },
+            {
+                "text": "sufficient",
+                "start": 1.38,
+                "duration": 1.32
+            },
+            {
+                "text": "i",
+                "start": 1.86,
+                "duration": 1.32
+            },
+            {
+                "text": "said",
+                "start": 2.04,
+                "duration": 1.38
+            }
+        ],
+        "start": 0.68
+    }
+    ```
+* `/recognize_file` - POST recognition allowing either enhanced (JSON) or text-only (string) for a file upload (see [Audio File Processing](Audio+File+Processing))
+    - uses web-form or parameter submissions using parameters `audio` (a `wav file`) and `enhanced` (integer `0` or `1`)
+
 
 ## Installation
 
@@ -48,7 +102,7 @@ On MacOS, try installing portaudio with brew: `brew install portaudio` .
 
 ## Server
 
-```
+```bash
 > python server.py --model ../models/daanzu-6h-512l-0001lr-425dr/ -l -t
 Initializing model...
 2018-10-06 AM 05:55:16.357: __main__: INFO: <module>(): args.model: ../models/daanzu-6h-512l-0001lr-425dr/output_graph.pb
@@ -69,7 +123,7 @@ Hit Ctrl-C to quit.
 ^CKeyboardInterrupt
 ```
 
-```
+```bash
 > python server.py -h
 usage: server.py [-h] -m MODEL [-a [ALPHABET]] [-l [LM]] [-t [TRIE]] [--lw LW]
                  [--vwcw VWCW] [--bw BW] [-p PORT]
@@ -99,18 +153,18 @@ optional arguments:
 
 ## Client
 
-```
+```bash
 λ py client.py
 Listening...
 Recognized: alpha bravo charlie
 Recognized: delta echo foxtrot
 ^C
 ```
 
-```
+```bash
 λ py client.py -h
 usage: client.py [-h] [-s SERVER] [-a AGGRESSIVENESS] [--nospinner]
-                 [-w SAVEWAV]
+                 [-w SAVEWAV] [-d DEVICE] [-v]
 
 Streams raw audio data from microphone with VAD to server via WebSocket
 
@@ -124,7 +178,28 @@ optional arguments:
                         speech, 3 the most aggressive. Default: 3
   --nospinner           Disable spinner
   -w SAVEWAV, --savewav SAVEWAV
-                        Save .wav files of utterences to given directory
+                        Save .wav files of utterences to given directory.
+                        Example for current directory: -w .
+  -d DEVICE, --device DEVICE
+                        Set audio device for input, according to system. The
+                        default utilizes system-specified recording device.
+  -v, --verbose         Print debugging info
+
+```
+
+### Audio File Processing
+Want to send a file directly to the server instead of from a live source?
+
+```bash
+# process a single file for text alone; must be wav file
+curl -X POST -F file=@../audio/8455-210777-0068.wav http://localhost:8787/recognize_file
+
+# process a single file with enhanced return; must be wav file
+curl -X POST -F file=@../audio/8455-210777-0068.wav -F enhanced=1 http://localhost:8787/recognize_file
+
+# process a single file with enhanced return; must be wav file (alternate with url-based parameter)
+curl -X POST -F file=@../audio/8455-210777-0068.wav http://localhost:8787/recognize_file?enhanced=1
+
 ```
 
 ## Contributions
@@ -133,3 +208,11 @@ Pull requests welcome.
 
 Contributors:
 * [@Zeddy913](https://github.com/Zeddy913)
+
+
+## Changes
+
+Coarse description of significant modifications as they come.
+
+- 190905 - add POST API for file endpoint; enhanced mode for server returns; launch server at `0.0.0.0` instead of localhost
+- 190903 - add device index for pyaudio so you can use other loopback devices (e.g. [MacOS Soundflower](https://github.com/mattingalls/Soundflower) )
diff --git a/client.py b/client.py
@@ -3,6 +3,7 @@
 import threading, collections, queue, os, os.path
 import wave
 import pyaudio
+import pprint
 import webrtcvad
 from lomond import WebSocket, events
 from halo import Halo
@@ -25,7 +26,7 @@ class Audio(object):
     CHANNELS = 1
     BLOCKS_PER_SECOND = 50
 
-    def __init__(self, callback=None, buffer_s=0, flush_queue=True):
+    def __init__(self, callback=None, buffer_s=0, flush_queue=True, device_index=None):
         def proxy_callback(in_data, frame_count, time_info, status):
             callback(in_data)
             return (None, pyaudio.paContinue)
@@ -38,6 +39,7 @@ def proxy_callback(in_data, frame_count, time_info, status):
                                    channels=self.CHANNELS,
                                    rate=self.sample_rate,
                                    input=True,
+                                   input_device_index=device_index,
                                    frames_per_buffer=self.block_size,
                                    stream_callback=proxy_callback)
         self.stream.start_stream()
@@ -56,11 +58,27 @@ def read(self):
         else:
             return None
 
+
     def read_loop(self, callback):
         """Block looping reading, repeatedly passing a block of audio data to callback."""
         for block in iter(self):
             callback(block)
 
+    @staticmethod
+    def device_list():
+        """Iterate and return the audio devices in the system."""
+        local_pa = pyaudio.PyAudio()
+        device_info = { "Input":[], "Output":[] }
+        device_count = local_pa.get_device_count()
+        for idx_dev in range(device_count):
+            local_info = local_pa.get_device_info_by_index(idx_dev)
+            for local_type in ["Output", "Input"]:
+                local_channels = f"max{local_type}Channels"
+                if local_channels in local_info and local_info[local_channels] > 0:
+                    device_info[local_type].append({"device":idx_dev, "name":local_info["name"], 
+                                                    "channels":local_info[local_channels]})
+        return device_info
+
     def __iter__(self):
         """Generator that yields all audio blocks from microphone."""
         while True:
@@ -89,8 +107,8 @@ def write_wav(self, filename, data):
 class VADAudio(Audio):
     """Filter & segment audio with voice activity detection."""
 
-    def __init__(self, aggressiveness=3):
-        super().__init__()
+    def __init__(self, aggressiveness=3, device_index=None):
+        super().__init__(device_index=device_index)
         self.vad = webrtcvad.Vad(aggressiveness)
 
     def vad_collector_simple(self, pre_padding_ms, blocks=None):
@@ -209,7 +227,11 @@ def on_event(event):
                 print_output("Connected!")
             ready = True
         elif isinstance(event, events.Text):
-            if 1: print_output("Recognized: %s" % event.text)
+            # TODO: modify for inclusion of timing information?
+            # TODO: what do we do with a rich / metadata return instead?
+
+            if len(event.text): 
+                print_output("Recognized: %s" % event.text)
         elif 1:
             logging.debug(event)
 
@@ -221,15 +243,21 @@ def on_event(event):
             websocket.close()
 
 def main():
+    if ARGS.listdevice:
+        dict_devices = Audio.device_list()
+        print_output("Available devices...")
+        print_output(pprint.pprint(dict_devices))
+        return 0
+
+    vad_audio = VADAudio(aggressiveness=ARGS.aggressiveness, device_index=ARGS.device)
+
     websocket = WebSocket(ARGS.server)
     # TODO: compress?
-    print_output("Connecting to '%s'..." % websocket.url)
-
-    vad_audio = VADAudio(aggressiveness=ARGS.aggressiveness)
-    print_output("Listening (ctrl-C to exit)...")
     audio_consumer_thread = threading.Thread(target=lambda: audio_consumer(vad_audio, websocket))
+    print_output("Listening (ctrl-C to exit)...")
     audio_consumer_thread.start()
 
+    print_output("Connecting to '%s'..." % websocket.url)
     websocket_runner(websocket)
 
 
@@ -246,7 +274,7 @@ def consumer(self, blocks):
                 else:
                     print('.', end='', flush=True)
                     length_ms = 0
-        VADAudio(consumer)
+        VADAudio(consumer, device_index=ARGS.device)
     elif 1:
         VADAudio.test_vad(3)
 
@@ -261,8 +289,12 @@ def consumer(self, blocks):
         help="Disable spinner")
     parser.add_argument('-w', '--savewav',
         help="Save .wav files of utterences to given directory. Example for current directory: -w .")
+    parser.add_argument('-d', '--device', type=int, default=None,
+        help="Set audio device for input, according to system. The default utilizes system-specified recording device.")
     parser.add_argument('-v', '--verbose', action='store_true',
         help="Print debugging info")
+    parser.add_argument('-l', '--listdevice', action='store_true',
+        help="List available devices for live capture")
     ARGS = parser.parse_args()
 
     if ARGS.verbose: logging.getLogger().setLevel(10)

diff --git a/requirements-server.txt b/requirements-server.txt
@@ -1,3 +1,4 @@
 numpy>=1.15.1
 bottle>=0.12.13
 bottle-websocket>=0.2.9
+scipy>=0.12.0