Reports

As @AKX commented, you'll need to test and tell us the time taken in different areas of your code. For example, the document indexing line, and the querying line.

You can try using a lighter LLM like Llama 2 7b Q4_K_M.
If using LlamaCPP, use model_kwargs={"n_gpu_layers": -1} to use GPU for faster inferences. For example:

llm_url = 'https://huggingface.co/TheBloke/Llama-2-7B-Chat-GGUF/resolve/main/llama-2-7b-chat.Q4_K_M.gguf'
llm = LlamaCPP(model_url=llm_url, temperature=0.7, max_new_tokens=256, context_window=4096, generate_kwargs = {"stop": ["", "[INST]", "[/INST]"]}, model_kwargs={"n_gpu_layers": -1}, verbose=True)

You can refer to my full working script I made a while ago that takes ~3 seconds to index documents, and ~5 seconds to see the first generated token. https://colab.research.google.com/github/kazcfz/LlamaIndex-RAG/blob/main/LlamaIndex_RAG.ipynb

79161090