I would start by looking at Retrieval Augmented Generation to include only the relevant parts of the video for a query instead of sending the transcript fully