The idea is to use a regex pattern for tokenization and deterministic tagging. Then, a classifier (LSTM etc) can fill in the tags on ambiguous tokens
We are trying to define some classes, that should work across most languages
kwfl
: flow keyword. if, for, return, try, exceptkwop
: operator keyword. Used like operator. in, is, select, new, echokwmo
: modifier keyword. pub, private, static, final, volatilekwde
: declare variable, class, functionkwim
: import keyword. import, from, #include (?), use
id
: indentation. space/tab at beginning of linews
: whitespace. space, tabnl
: new-line.brop
: opening bracketsbrcl
: closing bracketssy
: syntax features. :, ::, ->, =>, >>>, also <> in typespu
: punctuation.co
: comments (inline/multiline/single line)
nu
: number. dec, int, scientific, hex, bin, percent.st
: string.bo
: boolean literals.li
: other literal. null, None, undefined, built in constant values
opbi
: binary operator. Other binary operatorsopun
: unary operator. &ref, !not, X', x++, --xopas
: assignment operators. =, <-, +=,opmo
: modifier operators. references, pointers etc
pa
: parameter. a variable defined together with a function.ty
: type keyword. int, f64, voidtyco
: type keyword cosntructor.cl
: class. Non-primitve defined, also traits.clco
: class constructor. class name used as a functionmo
: module/namespace.fnme
: method. A function on an object instancefnas
: associated/static method/function. On module or classfnfr
: standalone function.fnto
: function tear-off.an
: annotation. @Override, #[ allow() ], @property, rust lifetimesva
: variable or similar user defined identifier.at
: attribute. a variable/constant on some object or module.
uk
: unknown.
- ✅ LSTM Tagger 24-12-07
- ✅ Render HTML preview 25-01-19
- ✅ NDJSON dataset 25-08-30
- ✅ Cleanup labels, linting 25-09-03
- ✅ Optuna, settle for a good LSTM model 25-09-20
- ❓ Reset indentation: avoid unnecessary indentation of all lines
- ❓ RNN variant comparison
- ❓ Feature based classifier
- ❓ character level LM
- ❓ inline mode: try to catch code fragments in text?
- ❓ language classifier?
- ❓ highlighting inside strings?