False positives when extracting math formulas

Question

False positives when extracting math formulas

Clemens Fiedler 0

In my python script detecting math formulas for an AI chat app, I am using Azure Document Intelligence function "begin_analyze_document" with the "formula" feature.

Sometimes it is returning false positives - which are clearly not formulas:

very common practice in scientific papers is using squared brackets with dots like "[...]" to leave out parts of text when quoting. However such characters are sometimes falsely returned as formula like: \( \left[ \ldots \right] \)
Phone numbers and other strings containing "+", "-" like: "Tel: +1 123-456 789" are also detected as formulas like: "\(T e l: + 1 1 2 3 - 4 5 6 7 8 9 \)"
Sometimes even ordinary words in plain text paragraphs without any symbols are returned as formula like (German): "optimierte Ergebnisse \(z u\) erzielen" (EN: "to achieve optimized results"). German word "zu" just means "to" in english, but is clearly no formula.

Therefore my questions are:

Is there any option or setting - for developers - how to further improve OCR results to prevent such issues?
Could the OCR model be further improved to improve (unicode) symbol detection and prevent false positives like above?

1 answer

Your answer

Answer 1

Sina Salam 19,616

Hello Clemens Fiedler,

Welcome to the Microsoft Q&A and thank you for posting your questions here.

I understand that you are using begin_analyze_document function with the "formula" feature to detect math formulas in scientific papers, and this is given you false positives.

Regarding your questions, you will need to do the followings:

Clean and preprocess the text to remove or replace characters likely to cause false positives. For example, you can use regular expressions to filter out phone numbers and bracketed text before passing it to the OCR model. Check the links for more details - https://learn.microsoft.com/en-us/answers/questions/2031813/the-begin-analyze-document-method-from-the-documen Also, train a custom model tailored to your specific use case. Because Azure Document Intelligence allows you to create custom models that can be more accurate for your specific document types - https://learn.microsoft.com/en-us/python/api/azure-ai-documentintelligence/azure.ai.documentintelligence.documentintelligenceclient?view=azure-python Also, you will need to implement post-processing steps to filter out false positives by using regular expressions to identify and exclude patterns that are unlikely to be mathematical formulas - https://learn.microsoft.com/en-us/azure/ai-services/document-intelligence/prebuilt/read?view=doc-intel-4.0.0
Incorporate more diverse and representative training data that includes various Unicode symbols and non-formula text patterns - https://learn.microsoft.com/en-us/dotnet/api/overview/azure/ai.documentintelligence-readme?view=azure-dotnet - and use techniques like deskewing, denoising, and binarization to improve the quality of the input images - https://learn.microsoft.com/en-us/azure/ai-services/document-intelligence/how-to-guides/use-sdk-rest-api?view=doc-intel-4.0.0 Then, fine-tune pre-trained models on your specific dataset to improve their performance in detecting and recognizing symbols accurately.

I hope this is helpful! Do not hesitate to let me know if you have any other questions or clarifications.

Please don't forget to close up the thread here by upvoting and accept it as an answer if it is helpful.

Clemens Fiedler 0 Reputation points

2025-04-25T12:51:26.36+00:00

"fine-tune pre-trained models on your specific dataset to improve their performance in detecting and recognizing symbols accurately."

(How) can I fine-tune the pre-trained models provided by Microsoft like "prebuilt-layout" or the model of the formula feature add-on?
Do you have any resources or how-to pages for this?
Saideep Anchuri 6,690 Reputation points Microsoft External Staff

2025-04-25T16:28:18.8433333+00:00
Hello Clemens Fiedler,

To fine-tune pre-trained models provided by Microsoft, such as the "prebuilt-layout" or the formula feature add-on, you can utilize the Azure AI Foundry platform. Fine-tuning allows you to customize these models on your specific dataset, which can improve their performance in detecting and recognizing symbols accurately.

Here are some steps to fine-tune a model:

Choose a model that supports your task.

Prepare and upload your training data.

(Optional) Prepare and upload validation data.

(Optional) Configure task parameters.

Train your model.

Review metrics and evaluate the model's performance.

Use your fine-tuned model.

It's important to provide high-quality training data, ideally hundreds or thousands of examples, to achieve the best results.

Kindly refer below link: https://learn.microsoft.com/en-us/azure/ai-foundry/concepts/fine-tuning-overview

Thank You.
Sina Salam 19,616 Reputation points

2025-04-25T19:26:16.98+00:00
Hello Clemens Fiedler,

Thank you for your follow-up and feedback.

Sorry for misunderstanding me. Prebuilt models you referred to cannot be fine-tuned.

Instead of, misleading Information that Azure AI Foundry is not a valid Microsoft service, and the linked page that is broken. Also, the incorrect claim that prebuilt models like prebuilt-layout cannot be fine-tuned (Microsoft explicitly states prebuilt models are static) and none-practical guidance by @Saideep Anchuri

The best practices solution is to combine preprocessing, > post-processing, > custom models, and language filtering.

Refer to the following links:

https://learn.microsoft.com/en-us/azure/ai-services/document-intelligence/concept-custom

https://learn.microsoft.com/en-us/azure/ai-services/language-service/overview

https://docs.python.org/3/library/re.html

Follow the steps here to solve your issue:

At preprocessing, Filter Non-Formula Patterns by using regex to exclude common false positives before sending text to Azure:

import re def filter_non_formulas(text): # Remove bracketed ellipses [...] text = re.sub(r'\[\.\.\.\]', '', text) # Remove phone numbers (e.g., +1 123-456-789) text = re.sub(r'\+\d[\d\- ]+', '', text) return text

https://docs.python.org/3/library/re.html

Post-Processing, you will Flag Low-Confidence Formulas by using Azure’s confidence scores to discard weak formula predictions (e.g., formulas with only brackets/dots):
formulas = [result for result in analysis_results if result.confidence > 0.8]
https://learn.microsoft.com/en-us/python/api/azure-ai-documentintelligence/azure.ai.documentintelligence.models.documentformula?view=azure-python

Then, train a custom formula model using Document Intelligence Studio to prioritize LaTeX-like structures and exclude non-technical text. Include examples of [...], phone numbers, and German text labeled as non-formulas. https://learn.microsoft.com/en-us/azure/ai-services/document-intelligence/concept-custom

Lastly, about language-specific filtering, use Azure’s language detection API to exclude non-mathematical languages (e.g., German):
from azure.ai.textanalytics import TextAnalyticsClient def is_german(text): client = TextAnalyticsClient(endpoint, credential) result = client.detect_language([text]) return result[0].primary_language.iso6391_name == "de"
https://learn.microsoft.com/en-us/azure/ai-services/language-service/language-detection/overview

Success
Saideep Anchuri 6,690 Reputation points Microsoft External Staff

2025-04-28T12:40:12.4266667+00:00

Hello Clemens Fiedler,

We haven’t heard from you on the last response and was just checking back to see if you have a resolution yet.

Thank You.

Share via

False positives when extracting math formulas

1 answer

Your answer