Note
Access to this page requires authorization. You can try signing in or changing directories.
Access to this page requires authorization. You can try changing directories.
In a big data architecture, there is often a need for an analytical data store that serves processed data in a structured format that can be queried using analytical tools. Analytical data stores that support querying of both hot-path and cold-path data are collectively referred to as the serving layer, or data serving storage.
The serving layer deals with processed data from both the hot path and cold path. In the lambda architecture, the serving layer is subdivided into a speed serving layer, which stores data that has been processed incrementally, and a batch serving layer, which contains the batch-processed output. The serving layer requires strong support for random reads with low latency. Data storage for the speed layer should also support random writes, because batch loading data into this store would introduce undesired delays. On the other hand, data storage for the batch layer does not need to support random writes, but batch writes instead.
There is no single best data management choice for all data storage tasks. Different data management solutions are optimized for different tasks. Most real-world cloud apps and big data processes have a variety of data storage requirements and often use a combination of data storage solutions.
Modern analytical solutions, such as Microsoft Fabric, provide a comprehensive platform that integrates various data services and tools to meet diverse analytical needs. Microsoft Fabric includes OneLake, a single, unified, logical data lake for your entire organization, designed to store, manage, and secure all organizational data in one location. This flexibility allows organizations to address a wide range of data storage and processing requirements within a unified framework.
What are your options when choosing an analytical data store?
There are several options for data serving storage in Azure, depending on your needs:
- Microsoft Fabric
- Azure Synapse Analytics
- Azure Synapse Spark pools
- Azure Databricks
- Azure Data Explorer
- Azure SQL Database
- SQL Server in Azure VM
- HBase/Phoenix on HDInsight
- Hive LLAP on HDInsight
- Azure Analysis Services
- Azure Cosmos DB
These options provide various database models that are optimized for different types of tasks:
- Key/value databases hold a single serialized object for each key value. They're good for storing large volumes of data where you want to get one item for a given key value and you don't have to query based on other properties of the item.
- Document databases are key/value databases in which the values are documents. A "document" in this context is a collection of named fields and values. The database typically stores the data in a format such as XML, YAML, JSON, or binary JSON (BSON), but may use plain text. Document databases can query on non-key fields and define secondary indexes to make querying more efficient. This makes a document database more suitable for applications that need to retrieve data based on criteria more complex than the value of the document key. For example, you could query on fields such as product ID, customer ID, or customer name.
- Column store databases are key/value data stores that store each column separately on disk. A wide column store database is a type of column store database that stores column families, not just single columns. For example, a census database might have a column family for a person's name (first, middle, last), a family for the person's address, and a family for the person's profile information (date of birth, gender). The database can store each column family in a separate partition, while keeping all the data for one person related to the same key. An application can read a single column family without reading through all of the data for an entity.
- Graph databases store information as a collection of objects and relationships. A graph database can efficiently perform queries that traverse the network of objects and the relationships between them. For example, the objects might be employees in a human resources database, and you might want to facilitate queries such as "find all employees who directly or indirectly work for Scott."
- Telemetry and time-series databases are an append-only collection of objects. Telemetry databases efficiently index data in a variety of column stores and in-memory structures, making them the optimal choice for storing and analyzing vast quantities of telemetry and time series data.
- Microsoft Fabric supports various database models, including key/value, document, column store, graph, and telemetry databases, providing flexibility and scalability for different analytical tasks.
Key selection criteria
To narrow the choices, start by answering these questions:
Do you need serving storage that can serve as a hot path for your data? If yes, narrow your options to those that are optimized for a speed serving layer.
Do you need massively parallel processing (MPP) support, where queries are automatically distributed across several processes or nodes? If yes, select an option that supports query scale-out.
Do you prefer to use a relational data store? If so, narrow your options to those with a relational database model. However, note that some non-relational stores support SQL syntax for querying, and tools such as PolyBase can be used to query non-relational data stores.
Do you collect time series data? Do you use append-only data?
Microsoft Fabric's OneLake supports multiple analytical engines, including Analysis Services, T-SQL, and Apache Spark, making it suitable for various data processing and querying needs
Capability matrix
The following tables summarize the key differences in capabilities.
General capabilities
Capability | SQL Database | Azure Synapse SQL pool | Azure Synapse Spark pool | Azure Data Explorer | HBase/Phoenix on HDInsight | Hive LLAP on HDInsight | Azure Analysis Services | Azure Cosmos DB | Microsoft Fabric |
---|---|---|---|---|---|---|---|---|---|
Is managed service | Yes | Yes | Yes | Yes | Yes 1 | Yes 1 | Yes | Yes | Yes |
Primary database model | Relational (column store format when using columnstore indexes) | Relational tables with column storage | Wide column store | Relational (column store), telemetry, and time series store | Wide column store | Hive/In-Memory | Tabular semantic models | Document store, graph, key-value store, wide column store | Unified data lake, relational, telemetry, time series, document store, graph, key-value store |
SQL language support | Yes | Yes | Yes | Yes | Yes (using Phoenix JDBC driver) | Yes | No | Yes | Yes |
Optimized for speed serving layer | Yes 2 | Yes 3 | Yes | Yes | Yes | Yes | No | Yes | Yes |
[1] With manual configuration and scaling.
[2] Using memory-optimized tables and hash or nonclustered indexes.
[3] Supported as an Azure Stream Analytics output.
Scalability capabilities
Capability | SQL Database | Azure Synapse SQL pool | Azure Synapse Spark pool | Azure Data Explorer | HBase/Phoenix on HDInsight | Hive LLAP on HDInsight | Azure Analysis Services | Azure Cosmos DB | Microsoft Fabric |
---|---|---|---|---|---|---|---|---|---|
Redundant regional servers for high availability | Yes | No | No | Yes | Yes | No | Yes | Yes | Yes |
Supports query scale-out | No | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes |
Dynamic scalability (scale up) | Yes | Yes | Yes | Yes | No | No | Yes | Yes | Yes |
Supports in-memory caching of data | Yes | Yes | Yes | Yes | No | Yes | Yes | No | Yes |
Security capabilities
Capability | SQL Database | Azure Synapse | Azure Data Explorer | HBase/Phoenix on HDInsight | Hive LLAP on HDInsight | Azure Analysis Services | Azure Cosmos DB | Microsoft Fabric |
---|---|---|---|---|---|---|---|---|
Authentication | SQL / Microsoft Entra ID | SQL / Microsoft Entra ID | Microsoft Entra ID | local / Microsoft Entra ID 1 | local / Microsoft Entra ID 1 | Microsoft Entra ID | database users / Microsoft Entra ID via access control (identity and access management (IAM)) | Microsoft Entra ID |
Data encryption at rest | Yes 2 | Yes 2 | Yes | Yes 1 | Yes 1 | Yes | Yes | Yes |
Row-level security | Yes | Yes 3 | Yes | Yes 1 | Yes 1 | Yes | No | Yes |
Supports firewalls | Yes | Yes | Yes | Yes 4 | Yes 4 | Yes | Yes | Yes |
Dynamic data masking | Yes | Yes | Yes | Yes 1 | Yes | No | No | Yes |
[1] Requires using a domain-joined HDInsight cluster.
[2] Requires using transparent data encryption to encrypt and decrypt your data at rest.
[3] Filter predicates only. See Row-Level Security
[4] When used within an Azure virtual network. For more information, see Extend Azure HDInsight using an Azure Virtual Network.
Contributors
This article is maintained by Microsoft.
Next steps
- Analyze data in a relational data warehouse
- Create a single database - Azure SQL Database
- Create an Azure Databricks workspace
- Create Apache Spark cluster in Azure HDInsight using Azure portal
- Creating a Synapse workspace
- Explore Azure data services for modern analytics
- Explore Azure database and analytics services
- Query Azure Cosmos DB by using the API for NoSQL