Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BPE: dont merge categories #28

Open
thammegowda opened this issue Aug 10, 2020 · 1 comment
Open

BPE: dont merge categories #28

thammegowda opened this issue Aug 10, 2020 · 1 comment

Comments

@thammegowda
Copy link
Member

Keep certain characters separate; don't merge them even if there is sufficient frequency

  1. digits
  2. punctuations
  3. dates months years
  4. ... anything else?

watch out: be language agnostic. use Unicode table to figure out digit/punch annotation

@thammegowda
Copy link
Member Author

Related:
-Do NLP Models Know Numbers? Probing Numeracy in Embeddings; Eric Wallace∗1, Yizhong Wang∗2, Sujian Li2, Sameer Singh3, Matt Gardner1 https://arxiv.org/pdf/1909.07940.pdf

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant