Hello Clemens Fiedler,
Welcome to the Microsoft Q&A and thank you for posting your questions here.
I understand that you are using begin_analyze_document function with the "formula" feature to detect math formulas in scientific papers, and this is given you false positives.
Regarding your questions, you will need to do the followings:
- Clean and preprocess the text to remove or replace characters likely to cause false positives. For example, you can use regular expressions to filter out phone numbers and bracketed text before passing it to the OCR model. Check the links for more details - https://learn.microsoft.com/en-us/answers/questions/2031813/the-begin-analyze-document-method-from-the-documen Also, train a custom model tailored to your specific use case. Because Azure Document Intelligence allows you to create custom models that can be more accurate for your specific document types - https://learn.microsoft.com/en-us/python/api/azure-ai-documentintelligence/azure.ai.documentintelligence.documentintelligenceclient?view=azure-python Also, you will need to implement post-processing steps to filter out false positives by using regular expressions to identify and exclude patterns that are unlikely to be mathematical formulas - https://learn.microsoft.com/en-us/azure/ai-services/document-intelligence/prebuilt/read?view=doc-intel-4.0.0
- Incorporate more diverse and representative training data that includes various Unicode symbols and non-formula text patterns - https://learn.microsoft.com/en-us/dotnet/api/overview/azure/ai.documentintelligence-readme?view=azure-dotnet - and use techniques like deskewing, denoising, and binarization to improve the quality of the input images - https://learn.microsoft.com/en-us/azure/ai-services/document-intelligence/how-to-guides/use-sdk-rest-api?view=doc-intel-4.0.0 Then, fine-tune pre-trained models on your specific dataset to improve their performance in detecting and recognizing symbols accurately.
I hope this is helpful! Do not hesitate to let me know if you have any other questions or clarifications.
Please don't forget to close up the thread here by upvoting and accept it as an answer if it is helpful.