OCR issues extracting math formulas with Document Intelligence

Question

OCR issues extracting math formulas with Document Intelligence

Clemens Fiedler 0

I want to extract mathematic formulas from scientific/ engineering PDF documents. The extracted formula values are later injected as LaTex math code back into the extracted text which is then provided to a GPT model used by a Chatbot AI app.

With Azure Document Intelligence, in a python script, we are using "begin_analyze_document" with "prebuilt-layout" model and features: "high resolution" and "formula".

Extracting "display" formulas set in a "serif" font like Times New Roman mostly works.

However, I have encountered OCR issues with unicode symbols inside "inline" formulas set in "sans serif" fonts like Arial:

Greek symbols like "theta" ϑ are falsely extracted as "9" or "O":
eg. "temperature ϑ fla" extracted as "temperature \( 9 f l a \)
or "Δ Hm" is extracted as \( A H _ { m } \)
... and many other similar examples with greek math symbols.
The copyright symbol © is falsely recognized as "O" like "© Name" recognized as "O Name"

Therefore my questions are:

Is there any option or setting - for us developers - how to further improve OCR results to prevent such issues?
Could the OCR model be further improved to improve (unicode) symbol detection and prevent false positives like above?

1 answer

Your answer

Answer 1

Hi ,

Thanks for reaching out to Microsoft Q&A.

this is a known limitation of OCR systems in general, and Azure Document Intelligence in particular, when it comes to:

Inline mathematical expressions, especially when rendered in:

Sans-serif fonts (e.g., Arial), where glyphs like "ϑ" (theta) can look like "9" or "O".

  Small font sizes, often used in inline equations in dense academic/engineering text.
  
  Visually ambiguous Unicode characters, such as:
  
     `ϑ` vs `9`
     
        `Δ` vs `A`
        
           `μ` vs `u`
           
              `©` vs `O`

These are not edge cases, they are common OCR issues and are well-documented in both academic papers and community feedback for azure, tesseract, Google Vision, and other OCR platforms.

Is not because of your code or misuse of the SDK.
Is due to the underlying OCR model misinterpreting glyphs, especially when dealing with non-standard fonts and math-rich Unicode.

Use a hybrid post-processing pipeline:

Leverage AI + heuristics + post-cleaning:

Post-OCR symbol correction using a custom dictionary of common math symbols, ex: if A appears in a context that expects Δ, correct it based on surrounding token context (A H_{m} -> \Delta H_{m}).
Use regular expressions + symbol validation to catch unlikely combinations (a digit like 9 followed by math letters).

If applicable, pass extracted expressions through a LaTeX validator to catch malformed math.

Train a custom model (workaround):

Use Custom Neural OCR via Azure Form Recognizer Studio or Fine-tuned Vision AI model with:

Tagged samples of your problematic fonts and formulas.

Labeling Greek letters and special symbols explicitly.

This may give better results than the default layout model if your use case is narrow and well-labeled.

Preprocess PDFs:

Before OCR:

Convert PDFs to high-resolution TIFF or PNG (300+ DPI).

Apply font substitution where possible (replace sans-serif with serif fonts using tools like pdftk, ghostscript, or pdfplumber if you have editable PDFs).

Please feel free to click the 'Upvote' (Thumbs-up) button and 'Accept as Answer'. This helps the community by allowing others with similar queries to easily find the solution.

Ravada Shivaprasad 30 Reputation points Microsoft External Staff

2025-04-28T10:48:35.19+00:00

Hi Clemens Fiedler

Hope the above answer by @Vinodh247 helps

Thanks!
Ravada Shivaprasad 30 Reputation points Microsoft External Staff

2025-04-30T18:44:15.9533333+00:00

Hi Clemens Fiedler

Did you get any chance to check the response.

Thank you!

Share via

OCR issues extracting math formulas with Document Intelligence

1 answer

Your answer