Hi ,
Thanks for reaching out to Microsoft Q&A.
this is a known limitation of OCR systems in general, and Azure Document Intelligence in particular, when it comes to:
Inline mathematical expressions, especially when rendered in:
Sans-serif fonts (e.g., Arial), where glyphs like "ϑ" (theta) can look like "9" or "O".
Small font sizes, often used in inline equations in dense academic/engineering text.
Visually ambiguous Unicode characters, such as:
`ϑ` vs `9`
`Δ` vs `A`
`μ` vs `u`
`©` vs `O`
These are not edge cases, they are common OCR issues and are well-documented in both academic papers and community feedback for azure, tesseract, Google Vision, and other OCR platforms.
- Is not because of your code or misuse of the SDK.
- Is due to the underlying OCR model misinterpreting glyphs, especially when dealing with non-standard fonts and math-rich Unicode.
Use a hybrid post-processing pipeline:
Leverage AI + heuristics + post-cleaning:
- Post-OCR symbol correction using a custom dictionary of common math symbols, ex: if
A
appears in a context that expectsΔ
, correct it based on surrounding token context (A H_{m}
->\Delta H_{m}
). - Use regular expressions + symbol validation to catch unlikely combinations (a digit like
9
followed by math letters).
If applicable, pass extracted expressions through a LaTeX validator to catch malformed math.
Train a custom model (workaround):
Use Custom Neural OCR via Azure Form Recognizer Studio or Fine-tuned Vision AI model with:
Tagged samples of your problematic fonts and formulas.
Labeling Greek letters and special symbols explicitly.
This may give better results than the default layout model if your use case is narrow and well-labeled.
Preprocess PDFs:
Before OCR:
Convert PDFs to high-resolution TIFF or PNG (300+ DPI).
- Apply font substitution where possible (replace sans-serif with serif fonts using tools like
pdftk
,ghostscript
, orpdfplumber
if you have editable PDFs).
Please feel free to click the 'Upvote' (Thumbs-up) button and 'Accept as Answer'. This helps the community by allowing others with similar queries to easily find the solution.