Skip to content

The best lexicon type/format to use #1

Open
@nciric

Description

@nciric

Lexicons are a critical part of the inflection project. They need to be used at the runtime, and will also be used by our tools for potential ML training.

We need to decide on the format we collect the data in. This decision needs to be based on multiple criteria:

  1. Is the format open, or under a friendly license?
  2. Can other lexicons be converted into that format, so we have consistent data>
  3. Is the format efficient, to reduce size & allow quick lookup?
  4. Quality of existing tools to operate on lexicon
  5. Can the lexicon data be easily pruned to what the user needs, to reduce deployment size?

An example tool and format, used in some universities (see languages):

  1. Unitex/GramLab from university in France, https://unitexgramlab.org/ (LGPL)
  2. Unitex lexicons (22 languages, with varied coverage), https://unitexgramlab.org/language-resources (LGPLLR)

They use Dela class of dictionaries (couldn't find a better link to describe Dela format).

What are other options we can use? Other criteria for selecting a lexicon/dictionary?

Metadata

Metadata

Assignees

No one assigned

    Labels

    discussDiscussion item

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions