- 
                Notifications
    You must be signed in to change notification settings 
- Fork 4
Open
Description
If we sort out #10, that will leave us with the following complexity of language model lookups
- 38.3/(100-47.6) = 73.1% ot time spend in language model lookup in diacritics_h2mor.confconfiguration
- 30.6/(100-44.2) = 54.8% ot time spend in language model lookup in spellchecking_h2mor.confconfiguration
One way to go is to use #3 and hope KenLM is faster.
Alternatively, we could improve the existing LM implementation:
- one possible improvement is to allow searching in LM using "state" and "another word in the sequence", instead of supplying N words all the time (i.e., we could identify a sequence being searched by a state, then we would supply another word and we would get new state and information about the last N-gram)
- what I like better is to change the LM representation entirely (this will also sort out the previous point). Instead of tree structure, we would have a hash table for every order (i.e., one for unigrams, another for bigrams, etc.). In the hash table, we would use only hashes, not the n-grams themselves, i.e. a hash position H would contain informatou about all n-grams with hash H merged. Although this merges some of the n-grams, it would probably make little impact on accuracy. Searching in this structure would be really fast (and the structure can be quite large as it contains only an array of n-gram statistics, so the number of collisions could be quite small). Also, the "state" of current sentence can be simply the hash of last N-grams and by using appropriate hash we can update it to hash of N-gram resulting by append a word to current N-gram in constant time. I would guess that such implementation could be something like 5-10 times faster than the current one (and probably also smaller then the current LM models).
 [Edited] The above does not deal with non-existing n-grams, but these can be handled with using several ways (global Bloom filter; small Blom for every hash; or use perfect hashing and store the hashes)
But using KenLM is definitely easier, more manageable and probably just as fast :-)
Metadata
Metadata
Assignees
Labels
No labels