79603556

Date: 2025-05-02 14:25:30
Score: 1
Natty:
Report link

Yeah, this actually comes up a lot when training a tokeniser from scratch. Just because a word shows up in your training data doesn’t mean it will end up in the vocab. It depends on how the tokeniser is building things.

Even if “awesome” appears a bunch of times, it might not make it into the vocab as a full word. WordPiece tokenisers don’t just add whole words automatically. They try to balance coverage and compression, so sometimes they keep subword pieces instead.

If you want common words like that to stay intact, here are a few things you can try:

Another thing to be aware of is that when you load the tokeniser using BertTokenizer.from_pretrained(), it expects more than just a vocab file. It usually looks for tokenizer_config.json, special_tokens_map.json, and maybe a few others. If those aren't there, sometimes things load strangely. You could try using PreTrainedTokenizerFast instead, especially if you trained the tokeniser with the tokenizers library directly.

You can also just check vocab.txt and search for “awesome”. If it’s not in there as a full token, that would explain the split you are seeing.

Nothing looks broken in your code. This is just standard behaviour for how WordPiece handles vocab limits and slightly uncommon words. I’ve usually had better results with vocab sizes in the 8 to 16k range when I want to avoid unnecessary token splits.

Reasons:
  • RegEx Blacklisted phrase (1): I want
  • Long answer (-1):
  • No code block (0.5):
  • Low reputation (0.5):
Posted by: Dmitry543