Indexer error on Azure search

Question

Indexer error on Azure search

Kelvin Kuria 0

My Azure search has completely refused to index some 3 PDFs that are in my blob storage. I have isolated these PDF and I have even tried splitting them into smaller PDF and ingesting them on the Azure OpenAI foundry but I'm getting this error

Your data was connected with the following warnings

Could not execute skill because one or more skill input was invalid. Required skill input is missing or empty. Name: 'text', Source: '$(/document/content)'. (5 item(s) impacted)

After indexing and trying out the chat playground I'm getting this error

The requested information is not found in the retrieved data. Please try another query or topic.

I'm on the Basic tier on Azure Search

Any one with a solution?

1 answer

Your answer

Answer 1

Hi ,

Thanks for reaching out to Microsoft Q&A.

The error means that the document cracking (parsing) step in the Azure built-in cognitive skills pipeline failed to extract any meaningful content from the PDFs. Therefore, the 'text' input is null or empty, and downstream skills (like language detection, key phrase extraction, etc.) have nothing to work on.

Likely Reasons for Failure?

PDFs are scanned images or non-text PDFs
- If the PDFs contain only images (scanned pages) with no embedded text, then the indexer cannot extract text unless OCR is explicitly enabled via a custom skillset.
Corrupted or malformed PDFs
- Some PDFs may not conform well to the standard. Even Adobe Reader may open them, but Azure Search’s document cracking library (based on PDFium or iFilter) might fail.
Large embedded images or unusual encoding
- PDFs with heavy embedded images or uncommon encodings might trip the parser.
Document splitting did not preserve structure
- Splitting PDFs manually might have resulted in chunks with no textual content.

Fixes and Recommendations to try:

Enable OCR for image-based PDFs

If the PDFs are scanned or image based, OCR is not enabled by default in basic configurations. You will need to:

Create a custom skillset and set the OCR setting to true.

Example skillset configuration:

  
  {
  "skills": [
    {
      "@odata.type": "#Microsoft.Skills.Vision.OcrSkill",
      "context": "/document",
      "defaultLanguageCode": "en",
      "inputs": [
        { "name": "image", "source": "/document/normalized_images/*" }
      ],
      "outputs": [
        { "name": "text", "targetName": "ocrText" }
      ]
    }
  ]

}

  
2. Try opening the PDFs with `pdftotext` or `Adobe Acrobat Reader`

Confirm if the text can be selected/copied manually.

Run `pdftotext` (Linux tool or via Python wrapper) to see if content is extractable.

If text is missing, it is an image-based PDF.

3. Preprocess PDFs using Azure Document Intelligence (Form Recognizer)

Upload the PDF to Azure Form Recognizer Studio to see if it can extract the text.

If successful, consider adding a custom pre-processing step before ingestion.

4. Upgrade to a higher tier (Standard & above)

The Basic tier has limitations, including:

   Max skill execution time and parallelism.
   
      OCR might perform worse or be throttled.
      
5. Log and Retry Mechanism

Set up logging for skillset execution using Application Insights.

Mark and retry documents that fail parsing via a manual or script-based pipeline.

---
Temporary Workaround:

If you urgently need to ingest these PDFs for a demo or POC:

Open the PDFs manually.

- Copypaste the text into `.txt` files or use python (`PyMuPDF` or `pdfminer`) to extract text.

- Upload these `.txt` files into the blob container and reindex them.

Please feel free to click the 'Upvote' (Thumbs-up) button and 'Accept as Answer'. This helps the community by allowing others with similar queries to easily find the solution.

Share via

Indexer error on Azure search

1 answer

Your answer