I am also looking for the same information.
I am trying to convert the following HF model https://huggingface.co/nickypro/tinyllama-15M/tree/main tokenizer.json into tokenizer.model inorder to run Karpathy's llama2.c - https://github.com/karpathy/llama2.c/blob/master/doc/train_llama_tokenizer.md.
I tried the following steps:
1. Extract vocabulary from tokenizer.json
2. Train the sentencepiece tokenizer using spm_train with the extracted vocabulary (vocab_size = 32000). This generates tokenizer.model
3. Use tokenizer.py to convert the tokenizer.model to tokenizer.bin.
Even though the above steps were successful, the inference resulted in gibberish. I assume this has something to do with the tokenizer.model that was generated. If anyone could assist with this, it would be really helpful.