False positives when extracting math formulas

Clemens Fiedler 0 Reputation points
2025-04-24T12:29:03.1666667+00:00

In my python script detecting math formulas for an AI chat app, I am using Azure Document Intelligence function "begin_analyze_document" with the "formula" feature.

Sometimes it is returning false positives - which are clearly not formulas:

  1. very common practice in scientific papers is using squared brackets with dots like "[...]" to leave out parts of text when quoting. However such characters are sometimes falsely returned as formula like: \( \left[ \ldots \right] \)
  2. Phone numbers and other strings containing "+", "-" like: "Tel: +1 123-456 789" are also detected as formulas like: "\(T e l: + 1 1 2 3 - 4 5 6 7 8 9 \)"
  3. Sometimes even ordinary words in plain text paragraphs without any symbols are returned as formula like (German):  "optimierte Ergebnisse \(z u\) erzielen" (EN: "to achieve optimized results"). German word "zu" just means "to" in english, but is clearly no formula.

Therefore my questions are:

  • Is there any option or setting - for developers - how to further improve OCR results to prevent such issues?
  • Could the OCR model be further improved to improve (unicode) symbol detection and prevent false positives like above?
Azure AI Document Intelligence
Azure AI Document Intelligence
An Azure service that turns documents into usable data. Previously known as Azure Form Recognizer.
2,026 questions
0 comments No comments
{count} votes

1 answer

Sort by: Most helpful
  1. Sina Salam 19,616 Reputation points
    2025-04-25T01:41:50.5633333+00:00

    Hello Clemens Fiedler,

    Welcome to the Microsoft Q&A and thank you for posting your questions here.

    I understand that you are using begin_analyze_document function with the "formula" feature to detect math formulas in scientific papers, and this is given you false positives.

    Regarding your questions, you will need to do the followings:

    1. Clean and preprocess the text to remove or replace characters likely to cause false positives. For example, you can use regular expressions to filter out phone numbers and bracketed text before passing it to the OCR model. Check the links for more details - https://learn.microsoft.com/en-us/answers/questions/2031813/the-begin-analyze-document-method-from-the-documen Also, train a custom model tailored to your specific use case. Because Azure Document Intelligence allows you to create custom models that can be more accurate for your specific document types - https://learn.microsoft.com/en-us/python/api/azure-ai-documentintelligence/azure.ai.documentintelligence.documentintelligenceclient?view=azure-python Also, you will need to implement post-processing steps to filter out false positives by using regular expressions to identify and exclude patterns that are unlikely to be mathematical formulas - https://learn.microsoft.com/en-us/azure/ai-services/document-intelligence/prebuilt/read?view=doc-intel-4.0.0
    2. Incorporate more diverse and representative training data that includes various Unicode symbols and non-formula text patterns - https://learn.microsoft.com/en-us/dotnet/api/overview/azure/ai.documentintelligence-readme?view=azure-dotnet - and use techniques like deskewing, denoising, and binarization to improve the quality of the input images - https://learn.microsoft.com/en-us/azure/ai-services/document-intelligence/how-to-guides/use-sdk-rest-api?view=doc-intel-4.0.0 Then, fine-tune pre-trained models on your specific dataset to improve their performance in detecting and recognizing symbols accurately.

    I hope this is helpful! Do not hesitate to let me know if you have any other questions or clarifications.


    Please don't forget to close up the thread here by upvoting and accept it as an answer if it is helpful.


Your answer

Answers can be marked as Accepted Answers by the question author, which helps users to know the answer solved the author's problem.