Reports

Byte-level BPE (BBPE) utilizes UTF-8 to encode every characters into 1 to 4 bytes. To ensure base vocab size is 256 (which is 1 byte), BBPE only use 1 byte per token. So in case a character requires 2 or more bytes to represent, BBPE breaks down those bytes into individual tokens (which means 1 character is transformed into 2, 3 or 4 different tokens).

For example, the UTF-8 code of character "の" is E3 81 AE (3 bytes), so in BBPE, "の" is written as 3 different tokens: E3, 81, and AE.

(Note that these 3 tokens are individual to each other, and may not pair up again in BPE merging step)

BBPE tokenizer may cause the tokenized text to be up to 4x longer than that in BPE tokenizer (when every characters are 4 bytes in UTF-8), but it's a trade-off to keep the vocab size to as low as 256.

The above example is taken from Figure 1 of the original paper of Byte-level Text Representation.

79366263