Edit

Share via


Troubleshoot high data ingestion in Application Insights

An increase in billing charges for Application Insights or Log Analytics often occurs due to high data ingestion. This article helps you troubleshoot this issue and provides methods to reduce data ingestion costs.

General troubleshooting steps

Step 1: Identify resources presenting high data ingestion

In the Azure portal, navigate to your subscription and select Cost Management > Cost analysis. This blade offers cost analysis views to chart costs per resource, as follows:

Screenshot that shows the 'cost analysis' blade.

Step 2: Identify costly tables with high data ingestion

Once you've identified an Application Insights resource or a Log Analytics workspace, analyze the data and determine where the highest ingestion occurs. Consider the approach that best suits your scenario:

  • Based on raw record count

    Use the following query to compare record counts across tables:

    search *
    | where timestamp > ago(7d)
    | summarize count() by $table
    | sort by count_ desc
    

    This query can help identify the noisiest tables. From there, you can refine your queries to narrow down the investigation.

  • Based on consumed bytes

    Determine tables with the highest byte ingestion using the format_bytes() scalar function:

    systemEvents
    | where timestamp > ago(7d)
    | where type == "Billing"
    | extend BillingTelemetryType = tostring(dimensions["BillingTelemetryType"])
    | extend BillingTelemetrySizeInBytes = todouble(measurements["BillingTelemetrySize"])
    | summarize TotalBillingTelemetrySize = sum(BillingTelemetrySizeInBytes) by BillingTelemetryType
    | extend BillingTelemetrySizeGB = format_bytes(TotalBillingTelemetrySize, 1 ,"GB")
    | sort by BillingTelemetrySizeInBytes desc
    | project-away BillingTelemetrySizeInBytes
    

    Similar to the record count queries, the preceding queries can help identify the most active tables, allowing you to pinpoint specific tables for further investigation.

  • Using Log Analytics workspace Workbooks

    In the Azure portal, navigate to your Log Analytics workspace, select Monitoring > Workbooks, and then select Usage under Log Analytics Workspace Insights.

    Screenshot that shows the Log Analytics workbook pane.

    This workbook provides valuable insights, such as the percentage of data ingestion for each table and detailed ingestion statistics for each resource reporting to the same workspace.

Step 3: Determine factors contributing to high data ingestion

After identifying the tables with high data ingestion, focus on the table with the highest activity and determine factors contributing to high data ingestion. This might be a specific application that generates more data than the others, an exception message that is logged too frequently, or a new logger category that emits too much information.

Here are some sample queries you can use for this identification:

requests
| where timestamp > ago(7d)
| summarize count() by cloud_RoleInstance
| sort by count_ desc
requests
| where timestamp > ago(7d)
| summarize count() by operation_Name
| sort by count_ desc
dependencies
| where timestamp > ago(7d)
| summarize count() by cloud_RoleName
| sort by count_ desc
dependencies
| where timestamp > ago(7d)
| summarize count() by type
| sort by count_ desc
traces
| where timestamp > ago(7d)
| summarize count() by message
| sort by count_ desc
exceptions
| where timestamp > ago(7d)
| summarize count() by message
| sort by count_ desc

You can try different telemetry fields. For example, perhaps you first run the following query and observe there's no obvious cause for the excessive telemetry:

dependencies
| where timestamp > ago(7d)
| summarize count() by target
| sort by count_ desc

However, you can try another telemetry field instead of target, such as type.

dependencies
| where timestamp > ago(7d)
| summarize count() by type
| sort by count_ desc

In some scenarios, you might need to investigate a specific application or instance further. Use the following queries to identify noisy messages or exception types:

traces
| where timestamp > ago(7d)
| where cloud_RoleName == 'Specify a role name'
| summarize count() by type
| sort by count_ desc
exceptions
| where timestamp > ago(7d)
| where cloud_RoleInstance == 'Specify a role instance'
| summarize count() by type
| sort by count_ desc

Step 4: Investigate the evolution of ingestion over time

Examine the evolution of ingestion over time based on the factors identified previously. This way can determine whether this behavior has been consistent or if changes occurred at a specific point. By analyzing data in this way, you can pinpoint when the change happened and provide a clearer understanding of the causes behind the high data ingestion. This insight will be important for addressing the issue and implementing effective solutions.

In the following queries, the bin() Kusto Query Language (KQL) scalar function is used to segment data into one-day intervals. This approach facilitates trend analysis as you can see how data has changed or not changed over time.

dependencies
| where timestamp > ago(30d)
| summarize count() by bin(timestamp, 1d), operation_Name
| sort by timestamp desc

Use the min() aggregation function to identify the earliest recorded timestamp for specific factors. This approach helps establish a baseline and offers insights into when events or changes first occurred.

dependencies
| where timestamp > ago(30d)
| where type == 'Specify dependency type being investigated'
| summarize min(timestamp) by type
| sort by min_timestamp desc

Troubleshooting steps for specific scenarios

Scenario 1: High data ingestion in Log Analytics

  1. Query all tables within a Log Analytics workspace:

    search *
    | where TimeGenerated > ago(7d)
    | where _IsBillable == true
    | summarize TotalBilledSize = sum(_BilledSize) by $table
    | extend IngestedVolumeGB = format_bytes(TotalBilledSize, 1, "GB")
    | sort by TotalBilledSize desc
    | project-away TotalBilledSize
    

    You can know which table is the biggest contributor to costs. Here's an example of AppTraces:

    Screenshot that shows that the AppTraces table is the biggest contributor to costs.

  2. Query the specific application driving the costs for traces:

    AppTraces
    | where TimeGenerated > ago(7d)
    | where _IsBillable == true
    | summarize TotalBilledSize = sum(_BilledSize) by  AppRoleName
    | extend IngestedVolumeGB = format_bytes(TotalBilledSize, 1, "GB")
    | sort by TotalBilledSize desc
    | project-away TotalBilledSize
    

    Screenshot that shows the specific application driving the costs for traces.

  3. Run the following query specific to that application and look further into the specific logger categories sending telemetry to the AppTraces table:

    AppTraces
    | where TimeGenerated > ago(7d)
    | where _IsBillable == true
    | where AppRoleName contains 'transformation'
    | extend LoggerCategory = Properties['Category']
    | summarize TotalBilledSize = sum(_BilledSize) by tostring(LoggerCategory)
    | extend IngestedVolumeGB = format_bytes(TotalBilledSize, 1, "GB")
    | sort by TotalBilledSize desc
    | project-away TotalBilledSize
    

    The result shows two main categories responsible for the costs:

    Screenshot that shows the specific logger categories sending telemetry to the AppTraces table.

Scenario 2: High data ingestion in Application Insight

To determine the factors contributing to the costs, follow these steps:

  1. Query the telemetry across all tables and obtain a record count per table and SDK version:

    search *
    | where TimeGenerated > ago(7d)
    | summarize count() by $table, SDKVersion
    | sort by count_ desc
    

    The following exmaple shows Azure Functions is generating lots of trace and exception telemetry:

    Screenshot that shows which table and SDK is generating the most Trace and Exception telemetry.

  2. Run the following query to get the specific app generating more traces than the others:

    AppTraces
    | where TimeGenerated > ago(7d)
    | where SDKVersion == 'azurefunctions: 4.34.2.22820'
    | summarize count() by AppRoleName
    | sort by count_ desc
    

    Screenshot that shows which app is generating the most traces.

  3. Refine the query to include that specific app and generate a count of records per each individual message:

    AppTraces
    | where TimeGenerated > ago(7d)
    | where SDKVersion == 'azurefunctions: 4.34.2.22820'
    | where AppRoleName contains 'inbound'
    | summarize count() by Message
    | sort by count_ desc
    

    The result can show the specific message increasing ingestion costs:

    Screenshot that shows a count of records per each message.

Scenario 3: Reach daily cap unexpectedly

Assume you reached the daily cap unexpectedly on September 4. Use the following query to obtain a count of custom events and identify the most recent timestamp associated with each event:

customEvents
| where timestamp between(datetime(8/25/2024) .. 15d)
| summarize count(), min(timestamp) by name

This analysis indicates that certain events started being ingested on September 4 and subsequently became noisy very quickly.

Screenshot that shows a count of custom events.

Reduce data ingestion costs

After identifying the factors in the Azure Monitor tables responsible for unexpected data ingestion, reduce data ingestion costs using the following methods per your scenarios.

Method 1: Update the daily cap configuration

Adjust the daily cap to prevent excessive telemetry ingestion.

Method 2: Switch the table plan

Switch to another supported table plan for Application Insights. Billing for data ingestion depends on the table plan and the region of the Log Analytics workspace. See Table plans and Tables that support the Basic table plan in Azure Monitor Logs.

Method 3: Use telemetry SDK features for the Java agent

The default recommended solution is using sampling overrides. The Application Insights Java agent provides two types of sampling. A common use case is suppressing collecting telemetry for health checks.

There are some supplemental methods for sampling overrides:

Method 4: Update the application code (log levels or exceptions)

In some scenarios, updating the application code directly might help reduce the amount of telemetry generated and consumed by the Application Insights backend service. A common example might be a noisy exception surfaced by the application.

References

Third-party contact disclaimer

Microsoft provides third-party contact information to help you find additional information about this topic. This contact information may change without notice. Microsoft does not guarantee the accuracy of third-party contact information.

Contact us for help

If you have questions or need help, create a support request, or ask Azure community support. You can also submit product feedback to Azure feedback community.