Create an autonomous vehicle operations (AVOps) solution

Article
2025-05-07

This architecture provides guidance and recommendations for developing an automated driving solution.

Architecture

Dataflow

The following steps outline how measurement data for vehicles is extracted and stored.

(a) The measurement data for vehicles originates from data streams for sensors such as cameras, Radio detection and ranging (RADAR), ultrasound, Light detection and ranging (LiDAR), and vehicle telemetry. Data loggers in the vehicle store measurement data on logger storage devices.

(b) The logger storage data is then uploaded to the landing data lake. A service such as Azure Data Box or Azure Stack Edge, or a dedicated connection like Azure ExpressRoute, ingests data into Azure.

(c) The measurement data can also be synthetic data from simulations or from other sources. The common formats in which the data is available include Measurement Data Format version 4 (MDF4), Technical Data Management Streaming (TDMS), and ROS bag. In the data operations (DataOps) stage, ingested measurements are processed. Validation and data quality checks, such as checksum, are performed to remove low quality data. In this stage, raw information metadata that a test driver records during a test drive is extracted. A centralized metadata catalog stores the data. This information helps downstream processes identify specific scenes and sequences.
An Azure Data Factory extract, transform, and load (ETL) pipeline processes the data. The output is stored as raw and binary data in Azure Data Lake. Azure Cosmos DB stores the metadata. Depending on the scenario, it might then be sent to Azure Data Explorer or Azure AI Search. For more information, see Semantic scene search.
Additional information, insights, and context are added to the data to improve its accuracy and reliability.
The extracted measurement data is provided to labeling partners (human-in-the-loop) through the Azure Data Share. Third-party partners use a separate data lake account to automatically label, store, and access the data.
Labeled datasets flow to downstream MLOps processes, mainly to create perception and sensor fusion models. These models perform functions that are used by autonomous vehicles to detect scenes (that is, lane changes, blocked roads, pedestrians, traffic lights, and traffic signs).
In the ValOps stage, trained models are validated via open-loop and closed-loop testing.
Tools such as Foxglove or Voxel51, running on Azure Kubernetes Service, visualize ingested and processed data.

Data collection

Data collection is one of the main challenges of Autonomous Vehicles Operations (AVOps). The following diagram shows an example of how offline and online vehicle data can be collected and stored in a data lake.

DataOps

Data operations (DataOps) is a set of practices, processes, and tools for improving the quality, speed, and reliability of data operations. The goal of the DataOps flow for autonomous driving (AD) is to ensure that the data used to control the vehicle is of high quality, accurate, and reliable. By using a consistent DataOps flow, you can improve the speed and accuracy of your data operations and make better decisions to control your autonomous vehicles.

DataOps components

Data Box transfers the collected vehicle data to Azure via a regional carrier.
ExpressRoute extends the on-premises network into the Microsoft cloud over a private connection.
Azure Data Lake Storage stores data based on stages, for example, raw or extracted.
Azure Data Factory performs ETL using batch compute and creates data-driven workflows for orchestrating data movement and transforming data.
Azure Batch runs large-scale applications for tasks like data wrangling, filtering, and preparing data, and extracting metadata.
Azure Cosmos DB stores metadata results, like stored measurements.
Data Share helps in sharing data with partner organizations, such as labeling companies, with enhanced security.
Azure Databricks provides a set of tools for maintaining enterprise-grade data solutions at scale. Azure Databricks is essential for long-running operations on large amounts of vehicle data. Data engineers use Azure Databricks as an analytics workbench and performing Spark jobs (for scenarios such as scene understanding annotations).
Azure AI Foundry is a trusted platform that empowers developers to drive innovation and shape the future with AI in a safe, secure, and responsible way. For example, in context of AVOps, Azure AI Foundry helps integrate Language Models with vision capabilities like Azure OpenAI GPT-4o to perform offline scene analysis.
Azure Data Explorer is a fully managed, high-performance, big data analytics platform that simplifies analyzing high volumes of metadata like scene annotations and allows also vector search over image and text embeddings.
Azure AI Search provides data catalog search services.

MLOps

Machine learning operations (MLOps) include:

Feature extraction models such as CLIP and YOLO for classifying scenes (for example, to understand whether a pedestrian is in the scene) during the DataOps pipeline.
Auto-labeling models for labeling ingested images and LiDAR and RADAR data.
Perception and computer vision models for detecting objects and scenes.
A sensor fusion model that combines sensor streams.

The perception model is an important component of this architecture. This Azure Machine Learning model generates an object detection model by using detected and extracted scenes.

The transfer of the containerized machine learning model to a format that can be read by system on a chip (SoC) hardware and validation/simulation software occurs in the MLOps pipeline. This step requires the support of the SoC manufacturer.

MLOps components

Azure Machine Learning is used to develop machine learning algorithms, like feature extraction, auto labeling, object detection and classification, and sensor fusion.
Azure DevOps provides support for DevOps tasks like CI/CD, testing, and automation.
GitHub for enterprises is an alternative choice for DevOps tasks like CI/CD, testing, and automation.
Azure Container Registry enables you to build, store, and manage container images and artifacts in a private registry.

ValOps

Validation operations (ValOps) is the process of testing developed models in simulated environments with the help of managed scenarios before you perform expensive real-world environmental testing. ValOps tests help to ensure that the models meet your desired performance standards, accuracy standards, and safety requirements. The goal of the validation process in the cloud is to identify and address any potential issues before you deploy the autonomous vehicle in a live environment.

Validation operations include:

Simulation validation: Cloud-based simulation (open-loop and closed-loop testing) environments enable virtual testing of autonomous vehicle models. This testing runs at scale and is less expensive than real-world testing.
Performance validation: Cloud-based infrastructure can run large-scale tests to evaluate the performance of autonomous vehicle models. Performance validation can include stress tests, load tests, and benchmarks.

Using ValOps can help you take advantage of the scalability, flexibility, and cost-effectiveness of a cloud-based infrastructure and reduce time-to-market for autonomous vehicle models.

Open-loop testing

Re-simulation, or sensor processing, is an open-loop test and validation system for automatic driving functions. It's a complex process, and there might be regulatory requirements for safety, data privacy, data versioning, and auditing. Re-simulation processes record raw data from various car sensors via a graph in the cloud. Re-simulation validates data processing algorithms or detects regressions. OEMs combine sensors in a directed acyclic graph that represents a real-world vehicle.

Re-simulation is a large-scale parallel compute job. It processes tens or hundreds of PBs of data by using tens of thousands of cores. It requires I/O throughput of more than 30 GB/s. Data from multiple sensors is combined into datasets that represent a view of what the on-vehicle computer vision systems record when the vehicle navigates the real world. An open-loop test validates the performance of the algorithms against ground truth by using replay and scoring. The output is used later in the workflow for algorithm training.

Datasets are sourced from test fleet vehicles that collect raw sensor data (for example, camera, LiDAR, RADAR, and ultrasonic data).
Data volume depends on camera resolution and the number of sensors on the vehicle.
Raw data is reprocessed against different software releases of the devices.
Raw sensor data is sent to the sensor input interface of the sensor software.
Output is compared with the output of previous software versions and is checked against bug fixes or new features, like detecting new object types.
A second re-injection of the job is performed after the model and software are updated.
Ground truth data is used to validate the results.
The results are written to storage and offloaded to Azure Data Explorer for visualization and interactive search. For example, using search for collected scenes for training and validation of the ADAS perception stack.

Closed-loop testing and simulation

Closed-loop testing of autonomous vehicles is the process of testing vehicle capabilities while including real-time feedback from the environment. The vehicle's actions are based both on its pre-programmed behavior and on the dynamic conditions that it encounters, and it adjusts its actions accordingly. Closed-loop testing runs in a more complex and realistic environment. It's used to assess the vehicle's ability to handle real-world scenarios, including how it reacts to unexpected situations. The goal of closed-loop testing is to verify that the vehicle can operate safely and effectively in various conditions, and to refine its control algorithms and decision-making processes as needed.

The ValOps pipeline integrates closed-loop testing, third-party simulations, and ISV applications.

Scenario management

During the ValOps stage, a catalog of real scenarios helps validate the autonomous driving solution's ability to simulate the behavior of autonomous vehicles. The objective is to speed up the creation of scenario catalogs by automatically reading the route network, which is a part of a scenario, from publicly accessible and freely available digital maps. Use third-party tools for scenario management or a lightweight open source simulator like CARLA, which supports OpenDRIVE (xodr) format. For more information, see ScenarioRunner for CARLA.

ValOps components

Azure Kubernetes Service runs large-scale batch inference for open-loop validation within a Resin framework. We recommend that you use BlobFuse2 to access the measurement files. You can also use NFS, but you need to evaluate performance for the use case.
Azure Batch runs large-scale batch inference for open-loop validation within a Resin framework.
Azure Data Explorer provides an analytics service for measurements and KPIs (that is, re-simulation and job runs).

Centralized AVOps functions

An AVOps architecture is complex and involves various third parties, roles, and development stages, so it's important to implement a good governance model.

We recommend that you create a centralized team to handle functions such as infrastructure provisioning, cost management, metadata and data catalog, lineage, and overall orchestration and event handling. Centralizing these services is efficient and simplifies operations.

We recommend that you use a centralized team to handle these responsibilities:

Providing ARM/Bicep templates, including templates for standard services like storage and compute used by each area and subarea of the AVOps architecture
Implementing central Azure Service Bus / Azure Event Hubs instances for an event-driven orchestration of the AVOps data loop
Owning the metadata catalog
Providing capabilities for end-to-end lineage and traceability across all AVOps components

Scenario details

You can use this architecture to build an automated driving solution on Azure.

Potential use cases

Automotive OEMs, Tier 1 vendors, and ISVs that develop solutions for automated driving.

Considerations

These considerations implement the pillars of the Azure Well-Architected Framework, which is a set of guiding tenets that you can use to improve the quality of a workload. For more information, see Microsoft Azure Well-Architected Framework.

Security

Security provides assurances against deliberate attacks and the abuse of your valuable data and systems. For more information, see Overview of the security pillar.

It's important to understand the division of responsibility between the automotive OEM and the cloud provider. In the vehicle, the OEM owns the whole stack, but as the data moves to the cloud, some responsibilities transfer to the cloud provider. Azure platform as a service (PaaS) provides built-in improved security on the physical stack, including the operating system. You can apply the following improvements in addition to the infrastructure security components. These improvements enable a Zero-Trust approach.

Private endpoints for network security. For more information, see Private endpoints for Azure Data Explorer and Allow access to Azure Event Hubs namespaces via private endpoints.
Encryption at rest and in transit. For more information, see Azure encryption overview.
Identity and access management that uses Microsoft Entra identities and Microsoft Entra Conditional Access policies.
Row Level Security (RLS) for Azure Data Explorer.
Infrastructure governance that uses Azure Policy.
Data governance that uses Microsoft Purview.
Certificate management to help secure the connection of vehicles.
Least privilege access. Limit user access with Just-In-Time (JIT) and Just-Enough-Administration (JEA), risk-based adaptive policies, and data protection.

Cost optimization

Cost optimization is about reducing unnecessary expenses and improving operational efficiencies. For more information, see Overview of the cost optimization pillar.

You can use these strategies to reduce the costs that are associated with developing autonomous driving solutions:

Optimize cloud infrastructure. Careful planning and management of cloud infrastructure can help you reduce costs. For example, use cost-effective instance types and scale infrastructure to meet changing workloads. Follow the guidance in the Azure Cloud Adoption Framework.
Use Spot Virtual Machines. You can determine which workloads in your AVOps deployment don't require processing within a specific time frame and use Spot Virtual Machines for these workloads. Spot Virtual Machines allow you to take advantage of unused Azure capacity for significant cost savings. If Azure needs the capacity back, the Azure infrastructure evicts spot virtual machines.
Use autoscaling. Autoscaling enables you to automatically adjust your cloud infrastructure based on demand, reducing the need for manual intervention and helping you reduce costs. For more information, see Design for scaling.
Consider using hot, cool, and archive tiers for storage. Storage can be a significant cost in an autonomous driving solution, so you need to choose cost-effective storage options, like cold storage or infrequent-access storage. For more information, see data lifecycle management.
Use cost management and optimization tools. Microsoft Cost Management provides tools that can help you identify and address areas for cost reduction, like unused or underutilized resources.
Consider using Azure services. For example, you can use Azure Machine Learning to build and train autonomous driving models. Using these services can be more cost-effective than building and maintaining in-house infrastructure.
Use shared resources. When possible, you can use shared resources, such as shared databases or shared compute resources, to reduce the costs that are associated with autonomous driving development. The centralized functions in this architecture, for example, implement a central bus, event hub, and metadata catalog. Services like Azure Data Share can also help you achieve this goal.

Next steps

For more information about developing DataOps for an automated driving system, see:

Data operations for autonomous vehicle operations

You might also be interested in these related articles:

Share via