79161090

Date: 2024-11-06 01:06:02
Score: 1.5
Natty:
Report link

As @AKX commented, you'll need to test and tell us the time taken in different areas of your code. For example, the document indexing line, and the querying line.

You can try using a lighter LLM like Llama 2 7b Q4_K_M.
If using LlamaCPP, use model_kwargs={"n_gpu_layers": -1} to use GPU for faster inferences. For example:

llm_url = 'https://huggingface.co/TheBloke/Llama-2-7B-Chat-GGUF/resolve/main/llama-2-7b-chat.Q4_K_M.gguf'
llm = LlamaCPP(model_url=llm_url, temperature=0.7, max_new_tokens=256, context_window=4096, generate_kwargs = {"stop": ["", "[INST]", "[/INST]"]}, model_kwargs={"n_gpu_layers": -1}, verbose=True)

You can refer to my full working script I made a while ago that takes ~3 seconds to index documents, and ~5 seconds to see the first generated token. https://colab.research.google.com/github/kazcfz/LlamaIndex-RAG/blob/main/LlamaIndex_RAG.ipynb

Reasons:
  • Contains signature (1):
  • Long answer (-0.5):
  • Has code block (-0.5):
  • User mentioned (1): @AKX
  • Low reputation (0.5):
Posted by: kazcfz