Note
Access to this page requires authorization. You can try signing in or changing directories.
Access to this page requires authorization. You can try changing directories.
Azure AI services help developers and organizations create intelligent, cutting-edge, market-ready, and responsible applications with out-of-the-box and prebuilt and customizable APIs and models.
This article covers AI services that provide video and image processing capabilities, such as visual analysis and generation of images, object detection, image classification, and facial recognition.
Services
The following services provide video and image processing capabilities for AI services:
-
Use Azure OpenAI for image generation from natural language by using pretrained generative imaging models. For example, you can use Azure OpenAI to generate custom art on demand.
Use Azure OpenAI when you need to perform nonspecific, broad analysis on images. For example, you can use Azure OpenAI to generate accessibility descriptions.
Don't use Azure OpenAI if you want to use open-source image generation models that are available in Azure Machine Learning.
Don't use Azure OpenAI if you need to perform specific types of image processing like form extraction, face recognition, or domain-specialized image characteristic detection. For these scenarios, use or build AI solutions designed specifically for those purposes.
-
Use Vision when you need basic optical character recognition (OCR), image analysis, or basic video analysis to detect motion and other events.
Don't use Vision for analysis that large, multimodal, foundation models already support.
Don't use Vision to moderate content. Use Microsoft Azure AI Content Safety instead.
Microsoft Azure AI Custom Vision
Use Custom Vision for specific requirements that can't be met by the image analysis that Vision provides. For example, Custom Vision can recognize unusual objects and manufacturing defects. It can also provide detailed custom classifications.
Don't use Custom Vision if you need basic object detection or face detection. Use Azure AI Face or Vision instead.
Don't use Custom Vision for basic visual analysis. Use vision-capable models from Azure OpenAI or open-source models in Machine Learning instead.
-
Use Azure AI Face when you need to check whether faces are live or spoofed or to identify, group, or find similar faces.
Don't use Azure AI Face to detect emotions in faces or perform other high-level reasoning about faces. Use multimodal language models for those tasks instead.
Microsoft Azure AI Video Indexer
Use Video Indexer for advanced video analysis tasks that can't be handled by the basic video analysis in Vision.
Don't use Video Indexer for basic video analysis tasks like people counting and motion and event detection. The basic video analysis in Vision is more cost-effective for these tasks.
Azure OpenAI
Azure OpenAI provides access to OpenAI's powerful language models, including the latest generation of GPT models. These models support visual analysis and generations of images. DALL-E also supports image generation.
Vision
Vision provides advanced algorithms that process images and return information based on the visual features that you specify. It provides four services: OCR, Azure AI Face, image analysis, and spatial analysis.
Capabilities
The following table provides a list of capabilities available in Vision.
Capability | Description |
---|---|
OCR | OCR extracts text from images. You can use the Read API to extract printed and handwritten text from photos and documents. It uses deep-learning-based models to process text across a variety of surfaces and backgrounds. These materials include business documents, invoices, receipts, posters, business cards, letters, and whiteboards. The OCR APIs support printed text extraction in several languages. |
Azure AI Vision Image Analysis | Image Analysis extracts many visual features from images, such as objects, faces, and autogenerated text descriptions. You can create custom image identifier models by using Image Analysis 4.0 that's based on the Florence foundation model. |
Video Analysis | Video Analysis includes video-related features like Spatial Analysis and Video Retrieval. Spatial Analysis analyzes the presence and movement of people on a video feed and produces events that other systems can respond to. |
Custom Vision
Custom Vision is an image recognition service that you can use to build, deploy, and improve your image identifier models. An image identifier applies labels to images according to their visual characteristics. Each label represents a classification or object. Use Custom Vision to specify your own labels and train custom models to detect them.
Custom Vision uses a machine learning algorithm to analyze images for custom features. You submit sets of images that do have and don't have the visual characteristics that you want. Then you label the images with your own labels, or tags, at the time of submission. The algorithm trains to this data and calculates its own accuracy by testing itself on the same images. After you train your model, you can test, retrain, and eventually use the model in your image recognition app to classify images or detect objects. You can also export the model for offline use.
Capabilities
The following table provides a list of capabilities available in Custom Vision.
Capability | Description |
---|---|
Image classification | Predict a category, or class, based on a set of inputs, which are called features. Calculate a probability score for each possible class and return a label that indicates the class that the object most likely belongs to. To use this model, you need data that consists of features and their labels. |
Object detection | Get the coordinates of an object in an image. To use this model, you need data that consists of features and their labels. |
Use cases
The following table provides a list of possible use cases for Custom Vision.
Use case | Description |
---|---|
Use Custom Vision with an IoT device to report visual states. | Use Custom Vision to train a device that has a camera to detect visual states. You can run this detection scenario on an IoT device by using an exported ONNX model. A visual state describes the content of an image, such as an empty room or a room with people or an empty driveway or a driveway with a truck. |
Classify images and objects. | Analyze photos and scan for specific logos by training a custom model. |
Azure AI Face
Azure AI Face provides AI algorithms that detect, recognize, and analyze human faces in images. Facial recognition software is important in various scenarios, such as identification, touchless access control, and automatic face blurring for privacy.
Capabilities
The following table provides a list of capabilities available in Azure AI Face.
Capability | Description |
---|---|
Face detection and analysis | Identify the regions of an image that contain a human face, typically by returning bounding-box coordinates that form a rectangle around the face. |
Find similar faces | The Find Similar operation matches a target face with a set of candidate faces. It identifies a smaller group of faces that closely resemble the target face. This functionality is useful for doing a face search by image. |
Group faces | The Group operation divides a set of unknown faces into several smaller groups based on similarity. Each group is a disjoint proper subset of the original set of faces. It also returns a single messyGroup array that contains the face IDs for which no similarities were found. |
Identification | Face identification can address one-to-many matching of one face in an image to a set of faces in a secure repository. Match candidates are returned based on how closely their face data matches the query face. |
Face recognition operations | Modern enterprises and apps can use the Azure AI Face recognition technologies, including face verification (or one-to-one matching) and face identification (or one-to-many matching) to confirm that a user is who they claim to be. |
Liveness detection | Liveness detection is an anti-spoofing feature that checks whether a user is physically present in front of the camera. It's used to prevent spoofing attacks that use a printed photo, recorded video, or a 3D mask of the user's face. |
Use cases
The following table provides a list of possible use cases for Azure AI Face.
Use case | Description |
---|---|
Verify user identity | Verify a person against a trusted face image. This verification can be used to grant access to digital or physical properties. In most scenarios, the trusted face image comes from a government-issued ID, such as a passport or driver's license, or from an enrollment photo taken in person. During verification, liveness detection can play a crucial role in verifying that the image comes from a real person and not a printed photo or mask. |
Face redaction | Redact or blur detected faces of people recorded in a video to protect their privacy. |
Touchless access control | Compared to methods like cards or tickets, opt-in face identification enables an enhanced access control experience while reducing the hygiene and security risks from physical media sharing, loss, or theft. Facial recognition assists the check-in process with a human in the loop for check-ins in airports, stadiums, theme parks, buildings, reception kiosks at offices, hospitals, gyms, clubs, or schools. |
Video Indexer
Video Indexer is a cloud app that's part of AI services. It's built by using Azure AI tools like Face, Translator, Vision, and Speech. It enables you to extract the insights from your videos by using Video Indexer video and audio models.
Capabilities
The following table provides a list of some of the capabilities available in Video Indexer.
Capability | Description |
---|---|
Multiple-language speech identification and transcription | Identifies the spoken language in different segments from audio. It sends each segment of the media file to be transcribed and then combines the transcription back to one unified transcription. |
Face detection | Detects and groups faces that appear in the video. |
Celebrity identification | Identifies over 1 million celebrities, like world leaders, actors, artists, athletes, researchers, and business and tech leaders across the globe. The data about these celebrities can also be found on various websites, such as IMDB and Wikipedia. |
Account-based face identification | Trains a model for a specific account. It then recognizes faces in the video based on the trained model. |
Observed people tracking (preview) | Detects observed people in videos. It provides information such as the person's location within the video frame by using bounding boxes. It also includes the exact start and end timestamps for when a person appears and a confidence level for the detection. |
Audio transcription | Converts speech to text across more than 50 languages and allows extensions. |
Language detection | Identifies the dominant spoken language. |
Noise reduction | Clears up telephony audio or noisy recordings (based on Skype filters). |
Translation | Creates translations of the audio transcript to multiple languages. |
For more information, see Video Indexer documentation.
Use cases
The following table provides a list of possible use cases for Video Indexer.
Use case | Description |
---|---|
Deep search | Use the insights extracted from the video to enhance the search experience across a video library. For example, indexing spoken words and faces can enable the search experience of finding moments in a video where a person spoke certain words or when two people were seen together. Search based on such insights from videos is applicable to news agencies, educational institutes, broadcasters, entertainment content owners, enterprise line-of-business apps, and generally to any industry that has a video library that users need to search against. |
Content creation | Create trailers, highlight reels, social media content, or news clips based on the insights Video Indexer extracts from your content. Keyframes, scene markers, and timestamps of people and label appearances simplify the creation process. These elements help you quickly locate the parts of the video that you need when you create content. |
Accessibility | Whether you want to make your content available for people with disabilities or you want your content to be distributed to different regions that use different languages, you can use the transcription and translation that Video Indexer provides in multiple languages. |
Monetization | Video Indexer can help increase the value of videos. For example, industries that rely on ad revenue, such as news media and social media, can deliver relevant ads by using the extracted insights as additional signals to the ad server. |
Content moderation | Use textual and visual content moderation models to keep your users safe from inappropriate content and validate that the content that you publish matches your organization's values. You can automatically block certain videos or alert your users about the content. |
Recommendations | Video insights can be used to improve user engagement by highlighting the relevant video moments to users. By tagging each video with extra metadata, you can recommend to users the most relevant videos and highlight the parts of the video that match their needs. |
Next steps
- What is Vision?
- Learning path: Develop natural language processing solutions with AI services
- Learning path: Get started with AI services
- Learning path: Microsoft Azure AI fundamentals: Computer vision
- Learning path: Create computer vision solutions with Vision
- Learning path: Create an image recognition solution with Azure IoT Edge and AI services