OCR issues extracting math formulas with Document Intelligence

Clemens Fiedler 0 Reputation points
2025-04-24T12:11:49.24+00:00

I want to extract mathematic formulas from scientific/ engineering PDF documents. The extracted formula values are later injected as LaTex math code back into the extracted text which is then provided to a GPT model used by a Chatbot AI app.

With Azure Document Intelligence, in a python script, we are using "begin_analyze_document" with "prebuilt-layout" model and features: "high resolution" and "formula".

Extracting "display" formulas set in a "serif" font like Times New Roman mostly works.

However, I have encountered OCR issues with unicode symbols inside "inline" formulas set in "sans serif" fonts like Arial:

  • Greek symbols like "theta" ϑ are falsely extracted as "9" or "O":
    eg. "temperature ϑ fla" extracted as "temperature \( 9 f l a \)
    or "Δ Hm" is extracted as \( A H _ { m } \)
    ... and many other similar examples with greek math symbols.
  • The copyright symbol © is falsely recognized as "O" like "© Name" recognized as "O Name"

Therefore my questions are:

  • Is there any option or setting - for us developers - how to further improve OCR results to prevent such issues?
  • Could the OCR model be further improved to improve (unicode) symbol detection and prevent false positives like above?
Azure AI Document Intelligence
Azure AI Document Intelligence
An Azure service that turns documents into usable data. Previously known as Azure Form Recognizer.
2,026 questions
0 comments No comments
{count} votes

1 answer

Sort by: Most helpful
  1. Vinodh247 32,531 Reputation points MVP
    2025-04-24T16:13:59.3466667+00:00

    Hi ,

    Thanks for reaching out to Microsoft Q&A.

    this is a known limitation of OCR systems in general, and Azure Document Intelligence in particular, when it comes to:

    Inline mathematical expressions, especially when rendered in:

    Sans-serif fonts (e.g., Arial), where glyphs like "ϑ" (theta) can look like "9" or "O".

      Small font sizes, often used in inline equations in dense academic/engineering text.
      
      Visually ambiguous Unicode characters, such as:
      
         `ϑ` vs `9`
         
            `Δ` vs `A`
            
               `μ` vs `u`
               
                  `©` vs `O`
                  
    

    These are not edge cases, they are common OCR issues and are well-documented in both academic papers and community feedback for azure, tesseract, Google Vision, and other OCR platforms.

    1. Is not because of your code or misuse of the SDK.
    2. Is due to the underlying OCR model misinterpreting glyphs, especially when dealing with non-standard fonts and math-rich Unicode.

    Use a hybrid post-processing pipeline:

    Leverage AI + heuristics + post-cleaning:

    • Post-OCR symbol correction using a custom dictionary of common math symbols, ex: if A appears in a context that expects Δ, correct it based on surrounding token context (A H_{m} -> \Delta H_{m}).
    • Use regular expressions + symbol validation to catch unlikely combinations (a digit like 9 followed by math letters).

    If applicable, pass extracted expressions through a LaTeX validator to catch malformed math.

    Train a custom model (workaround):

    Use Custom Neural OCR via Azure Form Recognizer Studio or Fine-tuned Vision AI model with:

    Tagged samples of your problematic fonts and formulas.

    Labeling Greek letters and special symbols explicitly.

    This may give better results than the default layout model if your use case is narrow and well-labeled.

    Preprocess PDFs:

    Before OCR:

    Convert PDFs to high-resolution TIFF or PNG (300+ DPI).

    • Apply font substitution where possible (replace sans-serif with serif fonts using tools like pdftk, ghostscript, or pdfplumber if you have editable PDFs).

    Please feel free to click the 'Upvote' (Thumbs-up) button and 'Accept as Answer'. This helps the community by allowing others with similar queries to easily find the solution.


Your answer

Answers can be marked as Accepted Answers by the question author, which helps users to know the answer solved the author's problem.