Hi ,
Thanks for reaching out to Microsoft Q&A.
The error means that the document cracking (parsing) step in the Azure built-in cognitive skills pipeline failed to extract any meaningful content from the PDFs. Therefore, the 'text'
input is null or empty, and downstream skills (like language detection, key phrase extraction, etc.) have nothing to work on.
Likely Reasons for Failure?
- PDFs are scanned images or non-text PDFs
- If the PDFs contain only images (scanned pages) with no embedded text, then the indexer cannot extract text unless OCR is explicitly enabled via a custom skillset.
- Corrupted or malformed PDFs
- Some PDFs may not conform well to the standard. Even Adobe Reader may open them, but Azure Search’s document cracking library (based on
PDFium
oriFilter
) might fail.
- Some PDFs may not conform well to the standard. Even Adobe Reader may open them, but Azure Search’s document cracking library (based on
- Large embedded images or unusual encoding
- PDFs with heavy embedded images or uncommon encodings might trip the parser.
- Document splitting did not preserve structure
- Splitting PDFs manually might have resulted in chunks with no textual content.
Fixes and Recommendations to try:
- Enable OCR for image-based PDFs
If the PDFs are scanned or image based, OCR is not enabled by default in basic configurations. You will need to:
Create a custom skillset and set the OCR
setting to true
.
- Example skillset configuration:
{ "skills": [ { "@odata.type": "#Microsoft.Skills.Vision.OcrSkill", "context": "/document", "defaultLanguageCode": "en", "inputs": [ { "name": "image", "source": "/document/normalized_images/*" } ], "outputs": [ { "name": "text", "targetName": "ocrText" } ] } ]
}
2. Try opening the PDFs with `pdftotext` or `Adobe Acrobat Reader`
Confirm if the text can be selected/copied manually.
Run `pdftotext` (Linux tool or via Python wrapper) to see if content is extractable.
If text is missing, it is an image-based PDF.
3. Preprocess PDFs using Azure Document Intelligence (Form Recognizer)
Upload the PDF to Azure Form Recognizer Studio to see if it can extract the text.
If successful, consider adding a custom pre-processing step before ingestion.
4. Upgrade to a higher tier (Standard & above)
The Basic tier has limitations, including:
Max skill execution time and parallelism.
OCR might perform worse or be throttled.
5. Log and Retry Mechanism
Set up logging for skillset execution using Application Insights.
Mark and retry documents that fail parsing via a manual or script-based pipeline.
---
Temporary Workaround:
If you urgently need to ingest these PDFs for a demo or POC:
Open the PDFs manually.
- Copypaste the text into `.txt` files or use python (`PyMuPDF` or `pdfminer`) to extract text.
- Upload these `.txt` files into the blob container and reindex them.
Please feel free to click the 'Upvote' (Thumbs-up) button and 'Accept as Answer'. This helps the community by allowing others with similar queries to easily find the solution.