I´m trying to create an Azure Search vector index as well in the Azure ML UI (Prompt flow) portal but having an error in the component "LLM - Crack and Chunk Data": enter image description here
The error says: User program failed with BaseRagServiceError: Rag system error
Part of the logs is:
input_data=/mnt/azureml/cr/j/60652b595f69/cap/data-capability/wd/INPUT_input_data
input_glob=**/*
allowed_extensions=.txt,.md,.html,.htm,.py,.pdf,.ppt,.pptx,.doc,.docx,.xls,.xlsx,.csv,.json
chunk_size=1024
chunk_overlap=0
output_chunks=/mnt/azureml/cr/j/606547e361134e058c4829792b595f69/cap/data-capability/wd/output_chunks
data_source_url=azureml://locations/XXXXX/workspaces/04XXXX0/data/vector-index-input-1734572551882/versions/1
document_path_replacement_regex=None
max_sample_files=-1
use_rcts=True
output_format=jsonl
custom_loader=None
doc_intel_connection_id=None
output_title_chunk=None
openai_api_version=None
openai_api_type=None
[2024-12-19 01:43:28] INFO azureml.rag.crack_and_chunk.crack_and_chunk - ActivityStarted, crack_and_chunk (activity.py:108)
[2024-12-19 01:43:28] INFO azureml.rag.crack_and_chunk - Processing file: What is prompt flow.pdf (crack_and_chunk.py:127)
/azureml-envs/rag-embeddings/lib/python3.9/site-packages/pypdf/_crypt_providers/_cryptography.py:32: CryptographyDeprecationWarning: ARC4 has been moved to cryptography.hazmat.decrepit.ciphers.algorithms.ARC4 and will be removed from cryptography.hazmat.primitives.ciphers.algorithms in 48.0.0.
from cryptography.hazmat.primitives.ciphers.algorithms import AES, ARC4
[2024-12-19 01:43:31] INFO azureml.rag.azureml.rag.documents.chunking - No file_chunks to yield, continuing (chunking.py:237)
[2024-12-19 01:43:31] INFO azureml.rag.azureml.rag.documents.chunking - No file_chunks to yield, continuing (chunking.py:237)
[2024-12-19 01:43:31] INFO azureml.rag.crack_and_chunk - [DocumentChunksIterator::filter_extensions] Filtered 0 files out of 1 (crack_and_chunk.py:129)
[2024-12-19 01:43:31] INFO azureml.rag.crack_and_chunk - [DocumentChunksIterator::filter_extensions] Skipped extensions: {} (crack_and_chunk.py:130)
[2024-12-19 01:43:31] INFO azureml.rag.crack_and_chunk - [DocumentChunksIterator::filter_extensions] Kept extensions: {
".pdf": 1
} (crack_and_chunk.py:133)
[2024-12-19 01:43:31] INFO azureml.rag.azureml.rag.documents.cracking - [DocumentChunksIterator::crack_documents] Total time to load files: 0.30446887016296387
{
".txt": 0.0,
".md": 0.0,
".html": 0.0,
".htm": 0.0,
".py": 0.0,
".pdf": 1.0,
".ppt": 0.0,
".pptx": 0.0,
".doc": 0.0,
".docx": 0.0,
".xls": 0.0,
".xlsx": 0.0,
".csv": 0.0,
".json": 0.0
} (cracking.py:381)
[2024-12-19 01:43:31] INFO azureml.rag.azureml.rag.documents.cracking - [DocumentChunksIterator::crack_documents] Total time to load files: 0.30446887016296387
{
".txt": 0.0,
".md": 0.0,
".html": 0.0,
".htm": 0.0,
".py": 0.0,
".pdf": 1.0,
".ppt": 0.0,
".pptx": 0.0,
".doc": 0.0,
".docx": 0.0,
".xls": 0.0,
".xlsx": 0.0,
".csv": 0.0,
".json": 0.0
} (cracking.py:381)
[2024-12-19 01:43:31] INFO azureml.rag.azureml.rag.documents.chunking - [DocumentChunksIterator::split_documents] Total time to split 1 documents into 0 chunks: 0.9676399230957031 (chunking.py:247)
[2024-12-19 01:43:31] INFO azureml.rag.azureml.rag.documents.chunking - [DocumentChunksIterator::split_documents] Total time to split 1 documents into 0 chunks: 0.9676399230957031 (chunking.py:247)
[2024-12-19 01:43:31] INFO azureml.rag.crack_and_chunk - Processed 0 files (crack_and_chunk.py:208)
[2024-12-19 01:43:31] INFO azureml.rag.crack_and_chunk - No chunked documents found in /mnt/azureml/cr/j/606547e361134e058c4829792b595f69/cap/data-capability/wd/INPUT_input_data with glob **/* (crack_and_chunk.py:215)
[2024-12-19 01:43:31] ERROR azureml.rag.crack_and_chunk.crack_and_chunk - ServiceError: intepreted error = Rag system error, original error = No chunked documents found in /mnt/azureml/cr/j/606547e361134e058c4829792b595f69/cap/data-capability/wd/INPUT_input_data with glob **/*. (exceptions.py:124)
[2024-12-19 01:43:36] ERROR azureml.rag.crack_and_chunk.crack_and_chunk - crack_and_chunk failed with exception: Traceback (most recent call last):
File "/azureml-envs/rag-embeddings/lib/python3.9/site-packages/azureml/rag/tasks/crack_and_chunk.py", line 229, in main_wrapper
map_exceptions(main, activity_logger, args, logger, activity_logger)
File "/azureml-envs/rag-embeddings/lib/python3.9/site-packages/azureml/rag/utils/exceptions.py", line 126, in map_exceptions
raise e
File "/azureml-envs/rag-embeddings/lib/python3.9/site-packages/azureml/rag/utils/exceptions.py", line 118, in map_exceptions
return func(*func_args, **kwargs)
File "/azureml-envs/rag-embeddings/lib/python3.9/site-packages/azureml/rag/tasks/crack_and_chunk.py", line 220, in main
raise ValueError(f"No chunked documents found in {args.input_data} with glob {args.input_glob}.")
ValueError: No chunked documents found in /mnt/azureml/cr/j/606547e361134e058c4829792b595f69/cap/data-capability/wd/INPUT_input_data with glob **/*.
(crack_and_chunk.py:231) ...................................
It seems the chunk is not doing nothing. My file is PDF format file with only one page without images to let it more easy.
Someone has a suggestion? thank you in advanced!!