Reports

Yeah, this actually comes up a lot when training a tokeniser from scratch. Just because a word shows up in your training data doesn’t mean it will end up in the vocab. It depends on how the tokeniser is building things.

Even if “awesome” appears a bunch of times, it might not make it into the vocab as a full word. WordPiece tokenisers don’t just add whole words automatically. They try to balance coverage and compression, so sometimes they keep subword pieces instead.

If you want common words like that to stay intact, here are a few things you can try:

Increase vocab_size to something like 8000 or 10000. With 3000, you are going to see a lot of splits.
Lowering min_frequency might help, but only if the word is just barely making the cut.
Check the text file you're using to train. If “awesome” shows up with different casing or punctuation, like “Awesome” or “awesome,”, it might be treated as separate entries.
Also make sure it’s not just appearing two or three times in a sea of other data. That might not be enough for it to get included.

Another thing to be aware of is that when you load the tokeniser using BertTokenizer.from_pretrained(), it expects more than just a vocab file. It usually looks for tokenizer_config.json, special_tokens_map.json, and maybe a few others. If those aren't there, sometimes things load strangely. You could try using PreTrainedTokenizerFast instead, especially if you trained the tokeniser with the tokenizers library directly.

You can also just check vocab.txt and search for “awesome”. If it’s not in there as a full token, that would explain the split you are seeing.

Nothing looks broken in your code. This is just standard behaviour for how WordPiece handles vocab limits and slightly uncommon words. I’ve usually had better results with vocab sizes in the 8 to 16k range when I want to avoid unnecessary token splits.

79603556