79255650

Date: 2024-12-05 17:20:01
Score: 2
Natty:
Report link

Very intriguing question. I guess there a number of ways to solve things but for practicality reasons let's try to keep it simple.

I think you can benefit from indexing your markdown files in Chroma first and then searching for them and lastly asking an LLM (e.g. OpenAI gpt4o) to generate a markdown for you. A typical RAG app.

A side note: You can also embed the images themselves for even better retrieval context, but I will not be including this part here for brevity. Feel free to join Chroma discord and we can discuss more on this (look for @taz)

My suggestion is to process the MD files and extract the images for each MD file as metadata and store that in Chroma which can then be passed on to the LLM for generation. As it is simpler to illustrate this in Python I will make the assumption that you can either convert the following code to TS or use a Python backend that can handle the ingestion of the markdown files.

With the above out of the way let's dive in. First we'll create a custom Langchain🦜🔗 Markdown Loader. The reason we need a custom one is because the ones off the shelf cannot handle image tags or at least don't know how to handle them.

from typing import Dict, Iterator, Union, Any, Optional, List
from langchain_core.documents import Document
import json
from langchain_community.document_loaders.base import BaseLoader


class CustomMDLoader(BaseLoader):
    def __init__(
            self,
            markdown_content: str,
            *,
            images_as_metadata: bool = False,
            beautifulsoup_kwargs: Optional[Dict[str, Any]] = None,
            split_by: Optional[str] = None,
    ) -> None:

        try:
            from bs4 import BeautifulSoup
        except ImportError:
            raise ImportError(
                "beautifulsoup4 package not found, please install it with "
                "`pip install beautifulsoup4`"
            )
        try:
            import mistune

        except ImportError:
            raise ImportError(
                "mistune package not found, please install it with "
                "`pip install mistune`"
            )
        
        self._markdown_content = markdown_content
        self._images_as_metadata = images_as_metadata
        self._beautifulsoup_kwargs = beautifulsoup_kwargs or {"features": "html.parser"}
        self._split_by = split_by
        
    def get_metadata_for_element(self, element: "PageElement") -> Dict[str, Union[str, None]]:
        metadata: Dict[str, Union[str, None]] = {}
        if hasattr(element,"find_all") and self._images_as_metadata:
            metadata["images"] = json.dumps([{"src":img.get('src'),"alt":img.get('alt')} for img in element.find_all("img")] )
        return metadata
    
    def get_document_for_elements(self, elements: List["PageElement"]) -> Document:
        
        text = " ".join([el.get_text() for el in elements])
        metadata: Dict[str, Union[str, None]] = {}
        for el in elements:
            new_meta = self.get_metadata_for_element(el)
            if "images" in new_meta and "images" in metadata:
                old_list = json.loads(metadata["images"])
                new_list = json.loads(new_meta["images"])
                metadata["images"] = json.dumps(old_list + new_list)
            if "images" in new_meta and "images" not in metadata:
                metadata["images"] = new_meta["images"]
        return Document(page_content=text, metadata=metadata)
        
    def split_by(self, parent_page_element:"PageElements", tag:Optional[str] = None) -> Iterator[Document]:
        if tag is None or len(parent_page_element.find_all(tag)) < 2:
            yield self.get_document_for_elements([parent_page_element])
        else:
            found_tags = parent_page_element.find_all(tag)
            if len(found_tags) >= 2:
                for start_tag, end_tag in zip(found_tags, found_tags[1:]):
                    elements_between = []
                    # Iterate through siblings of the start tag
                    for element in start_tag.next_siblings:
                        if element == end_tag:
                            break
                        elements_between.append(element)
                    doc = self.get_document_for_elements(elements_between)
                    doc.metadata["split"] = start_tag.get_text()
                    yield doc
                last_tag = found_tags[-1]
                elements_between = []
                for element in last_tag.next_siblings:
                    elements_between.append(element)
                doc = self.get_document_for_elements(elements_between)
                doc.metadata["split"] = last_tag.get_text()
                yield doc
     
            
    def lazy_load(self) -> Iterator[Document]:
        import mistune
        from bs4 import BeautifulSoup
        html=mistune.create_markdown()(self._markdown_content)
        soup = BeautifulSoup(html,**self._beautifulsoup_kwargs)
        if self._split_by is not None:
            for doc in self.split_by(soup, tag=self._split_by):
                yield doc
        else:
            for doc in  self.split_by(soup):
                yield doc

Note: To use the above you'll have to install the following libs: pip install beautifulsoup4 mistune langchain langchain-community

The above expects the content of your MD file then converts it to HTML and processes it using beautifulsoup4. It also adds the ability to split the MD file by something like heading e.g. h1. Here's the resulting Langchain🦜🔗 document:

Document(id='4ce64f5c-7873-4c3d-a17f-5531486d3312', metadata={'images': '[{"src": "https://images.example.com/image1.png", "alt": "Image"}]', 'split': 'Chapter 1: Dogs'}, page_content='\n In this chapter we talk about dogs. Here is an image of a dog  \n Dogs make for good home pets. They are loyal and friendly. They are also very playful. \n')

We can then proceed to ingest some data into Chroma using the following python script (you can add this to a python backend to do this automatically for you by uploading and MD file). Here's a sample MD file we can use:

# Chapter 1: Dogs
In this chapter we talk about dogs. Here is an image of a dog ![Image](https://images.example.com/image1.png)

Dogs make for good home pets. They are loyal and friendly. They are also very playful.

# Chapter 2: Cats

In this chapter we talk about cats. Here is an image of a cat ![Image](https://images.example.com/image2.png)

Cats are very independent animals. They are also very clean and like to groom themselves.

# Chapter 3: Birds

In this chapter we talk about birds. Here is an image of a bird ![Image](https://images.example.com/image3.png)

import chromadb
loader = CustomMDLoader(markdown_content=open("test.md").read(), images_as_metadata=True, beautifulsoup_kwargs={"features": "html.parser"},split_by="h1")
docs = loader.load()

client = chromadb.HttpClient("http://localhost:8000")
col = client.get_or_create_collection("test")

col.add(
    ids=[doc.id for doc in docs],
    documents=[doc.page_content for doc in docs],
    metadatas=[doc.metadata for doc in docs],
)
# resulting docs: [Document(id='4ce64f5c-7873-4c3d-a17f-5531486d3312', metadata={'images': '[{"src": "https://images.example.com/image1.png", "alt": "Image"}]', 'split': 'Chapter 1: Dogs'}, page_content='\n In this chapter we talk about dogs. Here is an image of a dog  \n Dogs make for good home pets. They are loyal and friendly. They are also very playful. \n'), Document(id='7e1c3ab1-f737-42ea-85cc-9ac21bfd9b8b', metadata={'images': '[{"src": "https://images.example.com/image2.png", "alt": "Image"}]', 'split': 'Chapter 2: Cats'}, page_content='\n In this chapter we talk about cats. Here is an image of a cat  \n Cats are very independent animals. They are also very clean and like to groom themselves. \n'), Document(id='4d111946-f52e-4ce0-a9ff-5ffde8536736', metadata={'images': '[{"src": "https://images.example.com/image3.png", "alt": "Image"}]', 'split': 'Chapter 3: Birds'}, page_content='\n In this chapter we talk about birds. Here is an image of a bird  \n')]

As last step in your TS (react) chatbot use Chroma TS client to search for the the content you want (see the official docs here).

import { ChromaClient } from "chromadb";
const client = new ChromaClient();
const results = await collection.query({
  queryTexts: "I want to learn about dogs",
  nResults: 1, // how many results to return
});

From the above results create a meta-prompt with your results: something like this:

based on the following content generate a markdown output that includes the text content and the image or images:

Text Content:  
 In this chapter we talk about dogs. Here is an image of a dog  
 Dogs make for good home pets. They are loyal and friendly. They are also very playful. 

Images:  [{'src': 'https://images.example.com/image1.png', 'alt': 'Image'}]

If using OpenAI GPT4o you should get something like this:

# Chapter: Dogs

In this chapter, we talk about dogs. Here is an image of a dog:

![Image](https://images.example.com/image1.png)

Dogs make for good home pets. They are loyal and friendly. They are also very playful.

You can then render the markdown in your chat response to the user.

I wanted to keep this short, but it feels like there isn't super short way of describing one of the many approaches you can take to solve your challenge.

Reasons:
  • RegEx Blacklisted phrase (1): I want
  • Contains signature (1):
  • Long answer (-1):
  • Has code block (-0.5):
  • User mentioned (1): @taz
  • Low reputation (0.5):
Posted by: taz