Choose an Azure AI image and video processing technology

Article
2025-04-28

Azure AI services help developers and organizations create intelligent, cutting-edge, market-ready, and responsible applications with out-of-the-box and prebuilt and customizable APIs and models.

This article covers AI services that provide video and image processing capabilities, such as visual analysis and generation of images, object detection, image classification, and facial recognition.

Services

The following services provide video and image processing capabilities for AI services:

Azure OpenAI Service
- Use Azure OpenAI for image generation from natural language by using pretrained generative imaging models. For example, you can use Azure OpenAI to generate custom art on demand.
- Use Azure OpenAI when you need to perform nonspecific, broad analysis on images. For example, you can use Azure OpenAI to generate accessibility descriptions.
- Don't use Azure OpenAI if you want to use open-source image generation models that are available in Azure Machine Learning.
- Don't use Azure OpenAI if you need to perform specific types of image processing like form extraction, face recognition, or domain-specialized image characteristic detection. For these scenarios, use or build AI solutions designed specifically for those purposes.
Microsoft Azure AI Vision
- Use Vision when you need basic optical character recognition (OCR), image analysis, or basic video analysis to detect motion and other events.
- Don't use Vision for analysis that large, multimodal, foundation models already support.
- Don't use Vision to moderate content. Use Microsoft Azure AI Content Safety instead.
Microsoft Azure AI Custom Vision
- Use Custom Vision for specific requirements that can't be met by the image analysis that Vision provides. For example, Custom Vision can recognize unusual objects and manufacturing defects. It can also provide detailed custom classifications.
- Don't use Custom Vision if you need basic object detection or face detection. Use Azure AI Face or Vision instead.
- Don't use Custom Vision for basic visual analysis. Use vision-capable models from Azure OpenAI or open-source models in Machine Learning instead.
Azure AI Face
- Use Azure AI Face when you need to check whether faces are live or spoofed or to identify, group, or find similar faces.
- Don't use Azure AI Face to detect emotions in faces or perform other high-level reasoning about faces. Use multimodal language models for those tasks instead.
Microsoft Azure AI Video Indexer
- Use Video Indexer for advanced video analysis tasks that can't be handled by the basic video analysis in Vision.
- Don't use Video Indexer for basic video analysis tasks like people counting and motion and event detection. The basic video analysis in Vision is more cost-effective for these tasks.

Azure OpenAI

Azure OpenAI provides access to OpenAI's powerful language models, including the latest generation of GPT models. These models support visual analysis and generations of images. DALL-E also supports image generation.

Vision

Vision provides advanced algorithms that process images and return information based on the visual features that you specify. It provides four services: OCR, Azure AI Face, image analysis, and spatial analysis.

Capabilities

The following table provides a list of capabilities available in Vision.

Capability	Description
OCR	OCR extracts text from images. You can use the Read API to extract printed and handwritten text from photos and documents. It uses deep-learning-based models to process text across a variety of surfaces and backgrounds. These materials include business documents, invoices, receipts, posters, business cards, letters, and whiteboards. The OCR APIs support printed text extraction in several languages.
Azure AI Vision Image Analysis	Image Analysis extracts many visual features from images, such as objects, faces, and autogenerated text descriptions. You can create custom image identifier models by using Image Analysis 4.0 that's based on the Florence foundation model.
Video Analysis	Video Analysis includes video-related features like Spatial Analysis and Video Retrieval. Spatial Analysis analyzes the presence and movement of people on a video feed and produces events that other systems can respond to.

Custom Vision

Custom Vision is an image recognition service that you can use to build, deploy, and improve your image identifier models. An image identifier applies labels to images according to their visual characteristics. Each label represents a classification or object. Use Custom Vision to specify your own labels and train custom models to detect them.

Custom Vision uses a machine learning algorithm to analyze images for custom features. You submit sets of images that do have and don't have the visual characteristics that you want. Then you label the images with your own labels, or tags, at the time of submission. The algorithm trains to this data and calculates its own accuracy by testing itself on the same images. After you train your model, you can test, retrain, and eventually use the model in your image recognition app to classify images or detect objects. You can also export the model for offline use.

Capabilities

The following table provides a list of capabilities available in Custom Vision.

Capability	Description
Image classification	Predict a category, or class, based on a set of inputs, which are called features. Calculate a probability score for each possible class and return a label that indicates the class that the object most likely belongs to. To use this model, you need data that consists of features and their labels.
Object detection	Get the coordinates of an object in an image. To use this model, you need data that consists of features and their labels.

Use cases

The following table provides a list of possible use cases for Custom Vision.

Use case	Description
Use Custom Vision with an IoT device to report visual states.	Use Custom Vision to train a device that has a camera to detect visual states. You can run this detection scenario on an IoT device by using an exported ONNX model. A visual state describes the content of an image, such as an empty room or a room with people or an empty driveway or a driveway with a truck.
Classify images and objects.	Analyze photos and scan for specific logos by training a custom model.

Azure AI Face

Azure AI Face provides AI algorithms that detect, recognize, and analyze human faces in images. Facial recognition software is important in various scenarios, such as identification, touchless access control, and automatic face blurring for privacy.

Capabilities

The following table provides a list of capabilities available in Azure AI Face.

Capability	Description
Face detection and analysis	Identify the regions of an image that contain a human face, typically by returning bounding-box coordinates that form a rectangle around the face.
Find similar faces	The Find Similar operation matches a target face with a set of candidate faces. It identifies a smaller group of faces that closely resemble the target face. This functionality is useful for doing a face search by image.
Group faces	The Group operation divides a set of unknown faces into several smaller groups based on similarity. Each group is a disjoint proper subset of the original set of faces. It also returns a single `messyGroup` array that contains the face IDs for which no similarities were found.
Identification	Face identification can address one-to-many matching of one face in an image to a set of faces in a secure repository. Match candidates are returned based on how closely their face data matches the query face.
Face recognition operations	Modern enterprises and apps can use the Azure AI Face recognition technologies, including face verification (or one-to-one matching) and face identification (or one-to-many matching) to confirm that a user is who they claim to be.
Liveness detection	Liveness detection is an anti-spoofing feature that checks whether a user is physically present in front of the camera. It's used to prevent spoofing attacks that use a printed photo, recorded video, or a 3D mask of the user's face.

Use cases

The following table provides a list of possible use cases for Azure AI Face.

Use case	Description
Verify user identity	Verify a person against a trusted face image. This verification can be used to grant access to digital or physical properties. In most scenarios, the trusted face image comes from a government-issued ID, such as a passport or driver's license, or from an enrollment photo taken in person. During verification, liveness detection can play a crucial role in verifying that the image comes from a real person and not a printed photo or mask.
Face redaction	Redact or blur detected faces of people recorded in a video to protect their privacy.
Touchless access control	Compared to methods like cards or tickets, opt-in face identification enables an enhanced access control experience while reducing the hygiene and security risks from physical media sharing, loss, or theft. Facial recognition assists the check-in process with a human in the loop for check-ins in airports, stadiums, theme parks, buildings, reception kiosks at offices, hospitals, gyms, clubs, or schools.

Video Indexer

Video Indexer is a cloud app that's part of AI services. It's built by using Azure AI tools like Face, Translator, Vision, and Speech. It enables you to extract the insights from your videos by using Video Indexer video and audio models.

Capabilities

The following table provides a list of some of the capabilities available in Video Indexer.

Capability	Description
Multiple-language speech identification and transcription	Identifies the spoken language in different segments from audio. It sends each segment of the media file to be transcribed and then combines the transcription back to one unified transcription.
Face detection	Detects and groups faces that appear in the video.
Celebrity identification	Identifies over 1 million celebrities, like world leaders, actors, artists, athletes, researchers, and business and tech leaders across the globe. The data about these celebrities can also be found on various websites, such as IMDB and Wikipedia.
Account-based face identification	Trains a model for a specific account. It then recognizes faces in the video based on the trained model.
Observed people tracking (preview)	Detects observed people in videos. It provides information such as the person's location within the video frame by using bounding boxes. It also includes the exact start and end timestamps for when a person appears and a confidence level for the detection.
Audio transcription	Converts speech to text across more than 50 languages and allows extensions.
Language detection	Identifies the dominant spoken language.
Noise reduction	Clears up telephony audio or noisy recordings (based on Skype filters).
Translation	Creates translations of the audio transcript to multiple languages.

For more information, see Video Indexer documentation.

Use cases

The following table provides a list of possible use cases for Video Indexer.

Use case	Description
Deep search	Use the insights extracted from the video to enhance the search experience across a video library. For example, indexing spoken words and faces can enable the search experience of finding moments in a video where a person spoke certain words or when two people were seen together. Search based on such insights from videos is applicable to news agencies, educational institutes, broadcasters, entertainment content owners, enterprise line-of-business apps, and generally to any industry that has a video library that users need to search against.
Content creation	Create trailers, highlight reels, social media content, or news clips based on the insights Video Indexer extracts from your content. Keyframes, scene markers, and timestamps of people and label appearances simplify the creation process. These elements help you quickly locate the parts of the video that you need when you create content.
Accessibility	Whether you want to make your content available for people with disabilities or you want your content to be distributed to different regions that use different languages, you can use the transcription and translation that Video Indexer provides in multiple languages.
Monetization	Video Indexer can help increase the value of videos. For example, industries that rely on ad revenue, such as news media and social media, can deliver relevant ads by using the extracted insights as additional signals to the ad server.
Content moderation	Use textual and visual content moderation models to keep your users safe from inappropriate content and validate that the content that you publish matches your organization's values. You can automatically block certain videos or alert your users about the content.
Recommendations	Video insights can be used to improve user engagement by highlighting the relevant video moments to users. By tagging each video with extra metadata, you can recommend to users the most relevant videos and highlight the parts of the video that match their needs.

Share via

Choose an Azure AI image and video processing technology

Services

Azure OpenAI

Vision

Capabilities

Custom Vision

Capabilities

Use cases

Azure AI Face

Capabilities

Use cases

Video Indexer

Capabilities

Use cases

Next steps

Related resources

Feedback

Additional resources