79536267

Date: 2025-03-26 11:37:55
Score: 4
Natty:
Report link

I am also looking for the same information.

I am trying to convert the following HF model https://huggingface.co/nickypro/tinyllama-15M/tree/main tokenizer.json into tokenizer.model inorder to run Karpathy's llama2.c - https://github.com/karpathy/llama2.c/blob/master/doc/train_llama_tokenizer.md.

I tried the following steps:

1. Extract vocabulary from tokenizer.json
2. Train the sentencepiece tokenizer using spm_train with the extracted vocabulary (vocab_size = 32000). This generates tokenizer.model
3. Use tokenizer.py to convert the tokenizer.model to tokenizer.bin.

Even though the above steps were successful, the inference resulted in gibberish. I assume this has something to do with the tokenizer.model that was generated. If anyone could assist with this, it would be really helpful.

@user2741831

Reasons:
  • Blacklisted phrase (1): I am trying to
  • Blacklisted phrase (2): I am also looking
  • Long answer (-0.5):
  • No code block (0.5):
  • Low reputation (1):
Posted by: Neetha Jyothish