Skip to content

English language support #1

@mthebaud

Description

@mthebaud

English should be the next language to be implemented in Predict4All.
Implementing english support is only a matter of data and small implementations, as its structure is similar to French. The only specific case that can matter in english is the apostroph, that might need some tweaks to be well handled : most of the "may have to" in the following list are guided by this point.

A good start would be to create org.predict4all.nlp.language.english from org.predict4all.nlp.language.french.

You should keep in mind that any language specific code should be created under interfaces : if something previously implemented in French should be different in English, add something related to the LanguageModel. Never use if(language instanceof FrenchLanguageModel) ;-)

These are the steps to implement english prediction

  • Find an open english dictionary with unigram to replace french Lexique.org
  • Create a clean english corpus ( Wikipedia + find subtitle or language corpus) - 20 millions word should be reached
  • Implement specific TokenMatcher (if needed, list should be determined as most of the french token matchers are directly correct for english)
  • Create unit tests for english
  • You may have then to modify (it depends if they are english specific related problems)
    • Tokenizer : if the apostrophe case should be handled differently in english
    • WordPrefixDetector : again, apostrophe could cause problems
    • WordPredictor
  • (optionnal) find an english stop-word dictionary (not used right now because it's only useful with semantic)

These are the steps to implement english correction rules

  • Transfert french rule that could be directly used (e.g. space, azerty, etc...)
  • Find other rules to implements (link with OT/ST is essential on this step !)
  • Verify WordCorrectionGenerator > some specific part of algo may not be fully compatible with english

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions