title | description | ms.topic | ms.date | ms.custom |
---|---|---|---|---|
Metrics, alerts, and diagnostic logs |
Record and analyze diagnostic log events for Azure Batch account resources like pools and tasks. |
how-to |
04/13/2021 |
seodec18 |
Azure Monitor collects metrics and diagnostic logs for resources in your Azure Batch account.
You can collect and consume this data in a variety of ways to monitor your Batch account and diagnose issues. You can also configure metric alerts so you receive notifications when a metric reaches a specified value.
Metrics are Azure telemetry data (also called performance counters) that are emitted by your Azure resources and consumed by the Azure Monitor service. Examples of metrics in a Batch account are Pool Create Events, Low-Priority Node Count, and Task Complete Events. These metrics can help identify trends and can be used for data analysis.
See the list of supported Batch metrics.
Metrics are:
- Enabled by default in each Batch account without additional configuration
- Generated every 1 minute
- Not persisted automatically, but have a 30-day rolling history. You can persist activity metrics as part of diagnostic logging.
In the Azure portal, the Overview page for the Batch account will show key node, core, and task metrics by default.
To view additional metrics for a Batch account:
- In the Azure portal, select All services > Batch accounts, and then select the name of your Batch account.
- Under Monitoring, select Metrics.
- Select Add metric and then choose a metric from the dropdown list.
- Select an Aggregation option for the metric. For count-based metrics (like "Dedicated Core Count" or "Low-Priority Node Count"), use the Avg aggregation. For event-based metrics (like "Pool Resize Complete Events"), use the Count" aggregation. Avoid using the Sum aggregation, which adds up the values of all data points received over the period of the chart.
- To add additional metrics, repeat steps 3 and 4.
You can also retrieve metrics programmatically with the Azure Monitor APIs. For an example, see Retrieve Azure Monitor metrics with .NET.
Note
Metrics emitted in the last 3 minutes may still be aggregating, so values may be under-reported during this timeframe. Metric delivery is not guaranteed, and may be affected by out-of-order delivery, data loss, or duplication.
You can configure near real-time metric alerts that trigger when the value of a specified metric crosses a threshold that you assign. The alert generates a notification when the alert is "Activated" (when the threshold is crossed and the alert condition is met) as well as when it is "Resolved" (when the threshold is crossed again and the condition is no longer met).
Because metric delivery can be subject to inconsistencies such as out-of-order delivery, data loss, or duplication, we recommend avoiding alerts that trigger on a single data point. Instead, use thresholds to account for any inconsistencies such as out-of-order delivery, data loss, and duplication over a period of time.
For example, you might want to configure a metric alert when your low priority core count falls to a certain level, so you can adjust the composition of your pools. For best results, set a period of 10 or more minutes, where the alert will be triggered if the average low priority core count falls below the threshold value for the entire period. This allows time for metrics to aggregate so that you get more accurate results.
To configure a metric alert in the Azure portal:
- Select All services > Batch accounts, and then select the name of your Batch account.
- Under Monitoring, select Alerts, then select New alert rule.
- Select Add condition, then choose a metric.
- Select the desired values for Chart period, Threshold, Operator, and Aggregation type.
- Enter a Threshold value and select the Unit for the threshold. Then select Done.
- Add an action group to the alert either by selecting an existing action group or creating a new action group.
- In the Alert rule details section, enter an Alert rule name and Description. If you want the alert to be enabled immediately, ensure that the Enable alert rule upon creation box is checked.
- Select Create alert rule.
For more information about creating metric alerts, see Understand how metric alerts work in Azure Monitor and Create, view, and manage metric alerts using Azure Monitor.
You can also configure a near real-time alert using the Azure Monitor REST API. For more information, see Overview of alerts in Microsoft Azure. To include job, task, or pool-specific information in your alerts, see Azure Monitor log Alerts.
Diagnostic logs contain information emitted by Azure resources that describe the operation of each resource. For Batch, you can collect the following logs:
- ServiceLog: events emitted by the Batch service during the lifetime of an individual resource such as a pool or task.
- AllMetrics: Metrics at the Batch account level.
You must explicitly enable diagnostic settings for each Batch account you want to monitor.
A common scenario is to select an Azure Storage account as the log destination. To store logs in Azure Storage, create the account before enabling collection of logs. If you associated a storage account with your Batch account, you can choose that account as the log destination.
Alternately, you can:
- Stream Batch diagnostic log events to an Azure Event Hub. Event Hubs can ingest millions of events per second, which you can then transform and store using any real-time analytics provider.
- Send diagnostic logs to Azure Monitor logs, where you can analyze them or export them for analysis in Power BI or Excel.
Note
You may incur additional costs to store or process diagnostic log data with Azure services.
To create a new diagnostic setting in the Azure portal, follow the steps below.
- In the Azure portal, select All services > Batch accounts, and then select the name of your Batch account.
- Under Monitoring, select Diagnostic settings.
- In Diagnostic settings, select Add diagnostic setting.
- Enter a name for the setting.
- Select a destination: Send to Log Analytics, Archive to a storage account, or Stream to an event hub. If you select a storage account, you can optionally select the number of days to retain data for each log. If you don't specify a number of days for retention, data is retained during the life of the storage account.
- Select ServiceLog, AllMetrics, or both.
- Select Save to create the diagnostic setting.
You can also enable log collection by creating diagnostic settings in the Azure portal, using a Resource Manager template, or using Azure PowerShell or the Azure CLI. For more information, see Overview of Azure platform logs.
If you archive Batch diagnostic logs in a storage account, a storage container is created in the storage account as soon as a related event occurs. Blobs are created according to the following naming pattern:
insights-{log category name}/resourceId=/SUBSCRIPTIONS/{subscription ID}/
RESOURCEGROUPS/{resource group name}/PROVIDERS/MICROSOFT.BATCH/
BATCHACCOUNTS/{Batch account name}/y={four-digit numeric year}/
m={two-digit numeric month}/d={two-digit numeric day}/
h={two-digit 24-hour clock hour}/m=00/PT1H.json
For example:
insights-metrics-pt1m/resourceId=/SUBSCRIPTIONS/XXXXXXXX-XXXX-XXXX-XXXX-XXXXXXXXXXXX/
RESOURCEGROUPS/MYRESOURCEGROUP/PROVIDERS/MICROSOFT.BATCH/
BATCHACCOUNTS/MYBATCHACCOUNT/y=2018/m=03/d=05/h=22/m=00/PT1H.json
Each PT1H.json
blob file contains JSON-formatted events that occurred within the hour specified in the blob URL (for example, h=12
). During the present hour, events are appended to the PT1H.json
file as they occur. The minute value (m=00
) is always 00
, since diagnostic log events are broken into individual blobs per hour. (All times are in UTC.)
Below is an example of a PoolResizeCompleteEvent
entry in a PT1H.json
log file. It includes information about the current and target number of dedicated and low-priority nodes, as well as the start and end time of the operation:
{ "Tenant": "65298bc2729a4c93b11c00ad7e660501", "time": "2019-08-22T20:59:13.5698778Z", "resourceId": "/SUBSCRIPTIONS/XXXXXXXX-XXXX-XXXX-XXXX-XXXXXXXXXXXX/RESOURCEGROUPS/MYRESOURCEGROUP/PROVIDERS/MICROSOFT.BATCH/BATCHACCOUNTS/MYBATCHACCOUNT/", "category": "ServiceLog", "operationName": "PoolResizeCompleteEvent", "operationVersion": "2017-06-01", "properties": {"id":"MYPOOLID","nodeDeallocationOption":"Requeue","currentDedicatedNodes":10,"targetDedicatedNodes":100,"currentLowPriorityNodes":0,"targetLowPriorityNodes":0,"enableAutoScale":false,"isAutoPool":false,"startTime":"2019-08-22 20:50:59.522","endTime":"2019-08-22 20:59:12.489","resultCode":"Success","resultMessage":"The operation succeeded"}}
To access the logs in your storage account programmatically, use the Storage APIs.
Azure Batch service logs contain events emitted by the Batch service during the lifetime of an individual Batch resource, such as a pool or task. Each event emitted by Batch is logged in JSON format. For example, this is the body of a sample pool create event:
{
"id": "myPool1",
"displayName": "Production Pool",
"vmSize": "Standard_F1s",
"imageType": "VirtualMachineConfiguration",
"cloudServiceConfiguration": {
"osFamily": "3",
"targetOsVersion": "*"
},
"networkConfiguration": {
"subnetId": " "
},
"virtualMachineConfiguration": {
"imageReference": {
"publisher": " ",
"offer": " ",
"sku": " ",
"version": " "
},
"nodeAgentId": " "
},
"resizeTimeout": "300000",
"targetDedicatedNodes": 2,
"targetLowPriorityNodes": 2,
"taskSlotsPerNode": 1,
"vmFillType": "Spread",
"enableAutoScale": false,
"enableInterNodeCommunication": false,
"isAutoPool": false
}
Service log events emitted by the Batch service include the following:
- Pool create
- Pool delete start
- Pool delete complete
- Pool resize start
- Pool resize complete
- Pool autoscale
- Task start
- Task complete
- Task fail
- Task schedule fail
- Learn about the Batch APIs and tools available for building Batch solutions.
- Learn more about monitoring Batch solutions.