New Project Flash Update: Advancing Azure Virtual Machine availability monitoring | Azure Blog and Updates

0
106
New Project Flash Update: Advancing Azure Virtual Machine availability monitoring | Azure Blog and Updates


“Earlier this year, we introduced Project Flash in the Advancing Reliability blog series, to reaffirm our commitment to empowering Azure customers in monitoring virtual machine (VM) availability in a robust and comprehensive manner. Today, we’re excited to share the progress we’ve made since then in developing holistic monitoring offerings to meet customers’ distinct needs. I’ve asked Senior Technical Program Manager, Pujitha Desiraju, from the Azure Core Production Quality Engineering team to share the latest investments as part of Project Flash, to deliver the best monitoring experience for customers.”—Mark Russinovich, CTO, Azure.


Flash, because the mission is internally recognized, is a group of efforts throughout Azure Engineering, that goals to evolve Azure’s digital machine (VM) availability monitoring ecosystem right into a centralized, holistic, and intelligible resolution prospects can depend on to satisfy their particular observability wants. As a part of this multi-year endeavor, we’re excited to announce the:

  • General availability of VM availability data in Azure Resource Graph for environment friendly and at-scale monitoring, handy for detailed downtime investigations and affect evaluation.
  • Public preview of a VM availability metric in Azure Monitor for fast debugging is now publicly accessible, pattern evaluation of VM availability over time, and establishing threshold-based alerts on situations that affect workload efficiency.
  • Private preview of VM availability standing change occasions through Azure Event Grid for instantaneous notifications on important adjustments in VM availability, to shortly set off remediation actions to stop end-user affect.

Our dedication stays, to sustaining information consistency and related rigorous high quality requirements throughout all of the monitoring options which can be a part of Flash, together with present options like Resource Health or Activity Log, so we ship a constant and cohesive expertise to prospects.

VM availability data in Azure Resource Graph for at-scale evaluation

In addition to already flowing VM availability states, we just lately printed VM well being annotations to Azure Resource Graph (ARG) for detailed failure attribution and downtime evaluation, together with enabling a 14-day change monitoring mechanism to hint historic adjustments in VM availability for fast debugging. With these new additions, we’re excited to announce the final availability of VM availability data within the HealthResources dataset in ARG! With this providing customers can:

  • Efficiently question the newest snapshot of VM availability throughout all Azure subscriptions without delay and at low latencies for periodic and fleetwide monitoring.
  • Accurately assess the affect to fleetwide enterprise SLAs and shortly set off decisive mitigation actions, in response to disruptions and kind of failure signature.
  • Set up customized dashboards to oversee the excellent well being of purposes by becoming a member of VM availability data with extra useful resource metadata current in ARG.
  • Track related adjustments in VM availability throughout a rolling 14-day window, through the use of the change-tracking mechanism for conducting detailed investigations.

Getting began

Users can question ARG through PowerShell, REST API, Azure CLI, and even the Azure Portal. The following steps element how information will be accessed from Azure Portal.

  1. Once on the Azure Portal, navigate to Resource Graph Explorer which can appear like the beneath picture:

Portal view of Azure Resource Graph displaying the list of datasets including the HealthResources table, along with a query window for Kusto queries to fetch results

Figure 1: Azure Resource Graph Explorer touchdown web page on Azure Portal.

  1. Select the Table tab and (single) click on on the HealthResources desk to retrieve the newest snapshot of VM availability data (availability state and well being annotations).

Portal view of Azure Resource Graph displaying both VM availability states and annotations across all resources at once in the results window, along with showcasing the 2 event types in the HealthResources table

Figure 2: Azure Resource Graph Explorer Window depicting the newest VM availability states and VM well being annotations within the HealthResources desk.

There will probably be two  forms of occasions populated within the HealthResources desk:

Portal view of the left-hand pane in Azure Resource Graph displaying the 2 types of events within the HealthResources table along with the type of all fields embedded within each type
 
Figure 3: Snapshot of the kind of occasions current within the HealthResources desk, as proven in Resource Graph Explorer on the Azure Portal.

This occasion denotes the newest availability standing of a VM, based mostly on the well being checks carried out by the underlying Azure platform. Below are the supply states we at present emit for VMs:

  • Available: The VM is up and working as anticipated.
  • Unavailable: We’ve detected disruptions to the conventional functioning of the VM and subsequently purposes won’t run as anticipated.
  • Unknown: The platform is unable to precisely detect the well being of the VM. Users can normally test again in a couple of minutes for an up to date state.

To ballot the newest VM availability state, confer with the properties subject which accommodates the beneath particulars:

Sample

{
       "goalResourceType": "Microsoft.Compute/digitalMachines",
       "earlierAvailabilityState": "Available",
"goalResourceId": "/subscriptions/<subscriptionId>/resourceGroups/<ResourceGroupName>/suppliers/Microsoft.Compute/digitalMachines/<VMName>",
       "occurredTime": "2022-10-11T11:13:59.9570000Z",
       "availabilityState": "Unavailable"
}

Property descriptions

Field

Description

Corresponding RHC subject

goalResourceType

Type of useful resource for which well being information is flowing

useful resourceType

goalResourceId

Resource Id

useful resourceId

occurredTime

Timestamp when the newest availability state is emitted by the platform

occasionTimestamp

earlierAvailabilityState

Previous availability state of the VM

earlierHealthStanding

availabilityState

Current availability state of the VM

presentHealthStanding

Refer to this doc for a listing of starter queries to additional discover this information.

This occasion contextualizes any adjustments to VM availability, by detailing obligatory failure attributes to assist customers examine and mitigate the disruption as wanted. See the complete checklist of VM well being annotations emitted by the platform.
These annotations will be broadly labeled into three buckets:

  • Downtime Annotations: These annotations are emitted when the platform detects VM availability transitioning to Unavailable. (For instance, throughout sudden host crashes, rebootful restore operations).
  • Informational Annotations: These annotations are emitted throughout management aircraft actions with no affect to VM availability. (Such as VM allocation/Stop/Delete/Start). Usually, no extra buyer motion is required in response.
  • Degraded Annotations: These annotations are emitted when VM availability is detected to be in danger. (For instance, when failure prediction fashions predict a degraded {hardware} element that may trigger the VM to reboot at any given time). We strongly urge customers to redeploy by the deadline specified within the annotation message, to keep away from any unanticipated lack of information or downtime.

To ballot the related VM well being annotations for a useful resource, if any, confer with the properties subject which accommodates the next particulars:

Sample

{
      "goalResourceType": "Microsoft.Compute/digitalMachines",                                                                                                                                                                        "goalResourceId": "/subscriptions/<subscriptionId>/resourceGroups/<ResourceGroupName>/suppliers/Microsoft.Compute/digitalMachines/<VMName>",
      "annotationName": "VirtualMachineHostRebootedForRestore",
      "occurredTime": "2022-09-25T20:21:37.5280000Z",
      "class": "Unplanned",
      "abstract": "We're sorry, your digital machine is not accessible as a result of an sudden failure on the host server. Azure has begun the auto-recovery course of and is at present rebooting the host server. No  extra motion is required from you presently. The digital machine will probably be again on-line after the reboot completes.",
      "context": "Platform Initiated",
      "purpose": "Unexpected host failure"
}

Property descriptions

Field

Description

Corresponding RHC subject

goalResourceType

Type of useful resource for which well being information is flowing

useful resourceType

goalResourceId

Resource Id

useful resourceId

occurredTime

Timestamp when the newest availability state is emitted by the platform

occasionTimestamp

annotationName

Name of the Annotation emitted

occasionName

purpose

Brief overview of the supply affect noticed by the shopper

title

class

Denotes whether or not the platform exercise triggering the annotation was both deliberate upkeep or unplanned restore. This subject will not be relevant to buyer/VM-initiated occasions.

Possible values: Planned | Unplanned | Not Applicable | Null

class

context

Denotes whether or not the exercise triggering the annotation was resulting from a certified person or course of (customer-initiated), or as a result of Azure platform (platform-initiated) and even exercise within the visitor OS that has resulted in availability affect (VM initiated).

Possible values: Platform-initiated | User-initiated | VM-initiated | Not Applicable | Null

context

abstract

Statement detailing the trigger for annotation emission, together with remediation steps that may be taken by customers

abstract

Refer to this doc for a listing of starter queries to additional discover this information.

Looking forward to 2023, we have now a number of enhancements deliberate for the annotation metadata that’s surfaced within the HealthResources dataset. These enrichments will give customers entry to richer failure attributes to decisively put together a response to a disruption. In parallel, we intention to increase the period of the historic lookback to a minimal of 30 days so customers can comprehensively observe previous adjustments in VM availability.

Public preview of the VM availability metric in Azure Monitor

We’re excited to share that the out-of-box VM availability metric is now accessible as a preview for all customers! This metric shows the pattern of VM availability over time, so customers can:
Set up threshold-based metric alerts on dipping VM availability to shortly set off applicable mitigation actions.
Correlate the VM availability metric with present platform metrics like reminiscence, community, or disk for deeper insights into regarding adjustments that affect the general efficiency of workloads.
Easily work together with and chart metric information throughout any related time window on Metrics Explorer, for fast and simple debugging.
Route metrics to downstream tooling like Grafana dashboards, for setting up customized visualizations and dashboards.

Getting began

Users can both eat the metric programmatically through the Azure Monitor REST API or straight from the Azure Portal. The following  steps spotlight metric consumption from the Azure Portal.

Once on the Azure Portal, navigate to the VM overview blade. The new metric will show as VM Availability (Preview), together with different platform metrics beneath the Monitoring tab.

Portal view of the VM overview page, with the newly added VM availability metric highlighted

Figure 4: View the newly added VM Availability Metric on the VM overview web page on Azure Portal.

Select (single click on) the VM availability metric chart on the overview web page, to navigate to Metrics Explorer for additional evaluation.

Portal view of VM availability metric on Metric Explorer, displaying availability as a trend in the form of a blue line, over time with occasional dips

Figure 5: View the newly added VM availability Metric on Metrics Explorer on Azure Portal.

Metric description:

Display Name

VM Availability (preview)

Metric Values

1 throughout anticipated conduct; corresponds to VM in Available state.

0 when VM is impacted by rebootful disruptions; corresponds to VM in Unavailable state.

NULL (reveals a dotted or dashed line on charts) when the Azure service that’s emitting the metric is down or is unaware of the precise standing of the VM; corresponds to VM in Unknown state.

Aggregation

The default aggregation of the metric is Average, for prioritized investigations based mostly on extent of downtime incurred.

The different aggregations accessible are:

Min, to instantly pinpoint to all of the instances the place VM was unavailable.

Max, to instantly pinpoint to all of the cases the place VM was Available.

Refer right here for extra particulars on chart vary, granularity, and information aggregation.

Data Retention

Data for the VM availability metric will probably be saved for 93 days to help in pattern evaluation and historic lookback.

Pricing

Please confer with the Pricing breakdown, particularly within the “Metrics” and “Alert Rules” sections.

Looking forward to 2023, we plan to incorporate affect particulars (person vs platform initiated, deliberate vs unplanned) as dimensions to the metric, so customers are effectively outfitted to interpret dips, and arrange rather more focused metric alerts. With the emission of dimensions in 2023, we additionally anticipate transitioning the providing to a common availability standing.

Introducing instantaneous notifications on adjustments in VM availability through Event Grid

We’re thrilled to introduce our newest monitoring providing—the personal preview of VM availability standing change occasions in an Event Grid System Topic, which makes use of the low-latency expertise of Azure Event Grid! Users can now subscribe to the system subject and route these occasions to their downstream tooling utilizing any of the accessible occasion handlers (akin to Azure Functions, Logic Apps, Event Hubs, and Storage queues). This resolution makes use of an event-driven structure to speak scoped adjustments in VM availability to finish customers in lower than 5 seconds from the disruption incidence. This empowers customers to take instantaneous mitigation actions to stop finish person affect.

As a part of the personal preview, we’ll emit occasions scoped to adjustments in VM availability states, with the pattern schema beneath:

Sample

{
      "id": "4c70abbc-4aeb-4cac-b0eb-ccf06c7cd102",
      "subject": "/subscriptions/<subscriptionId>,
    "topic": "/subscriptions/<subscriptionId>/resourceGroups/<ResourceGroupName>/suppliers/Microsoft.Compute/digitalMachines/<VMName>/suppliers/Microsoft.ResourceHealth/AvailabilityStatuses/present",
     "information": {
         "useful resourceInfo": {
"id":"/subscriptions/<subscriptionId>/resourceGroups/<ResourceGroupName>/suppliers/Microsoft.Compute/digitalMachines/<VMName>/suppliers/Microsoft.ResourceHealth/AvailabilityStatuses/present",      
"properties": {
"goalResourceId":"/subscriptions/<subscriptionId>/resourceGroups/<ResourceGroupName>/suppliers/Microsoft.Compute/digitalMachines/<VMName>"
               "goalResourceType": "Microsoft.Compute/digitalMachines",
               "occurredTime": "2022-09-25T20:21:37.5280000Z"
"earlierAvailabilityState": "Available",
       "availabilityState": "Unavailable"
            }
         },
         "apiVersion": "2020-09-01"
      },
"occasionType": "Microsoft.ResourceNotifications.HealthResources.AvailabilityStatusesChanged",
    "dataVersion": "1",
      "metadataVersion": "1",
      "eventTime": "2022-09-25T20:21:37.5280000Z"
}

The properties subject is totally according to the microsoft.resourcehealth/availabilitystatuses occasion in ARG. The occasion grid resolution affords near-real-time alerting capabilities on the information current in ARG.

We’re at present releasing the personal preview to a small subset of customers to scrupulously take a look at the answer and acquire iterative suggestions. This method allows us to publicly preview and even announce the final availability of a high-quality and well-rounded providing in 2023. As we glance towards the final availability of this resolution, customers can anticipate to obtain occasions when annotations, automated RCAs are emitted by the platform.

What’s subsequent?

We’ll be closely targeted on strengthening our monitoring platform to repeatedly enhance the expertise for patrons based mostly on ongoing suggestions collected from the neighborhood (akin to  aggregated VMSS well being displaying degraded inaccurately, VM unavailable for quarter-hour, Missing VM downtimes in Activity Log). By streamlining our inner message pipeline, we intention to not solely enhance information high quality, but additionally preserve information consistency throughout our choices and increase the scope of failure situations surfaced.

Introducing Degraded VM Availability state

In mild of our upcoming efforts to centralize our monitoring structure, we’ll be well-positioned to introduce a Degraded VM availability state for digital machines in 2023. This state will probably be extraordinarily helpful in establishing focused alerts on predicted {hardware} failure situations the place there may be imminent danger to VM availability. This state may even permit customers to effectively observe instances of degraded {hardware} or software program failures needing to redeploy, which at this time don’t trigger a corresponding change in VM availability. We may even intention to emit reminder annotations by way of the period of the VM being marked Degraded, to stop customers from overlooking the request to redeploy.

Expand scope of failure attribution to incorporate software freeze occasions

In 2023, we plan to increase our scope of failure attribution and emission to additionally embody software freeze occasions that could be triggered resulting from community agent updates, host OS updates lasting thirty seconds and freeze-causing restore operations. This will guarantee customers have enhanced visibility into freeze affect and will probably be utilized throughout our monitoring choices, together with Resource Health and Activity Logs.

Learn More

Please keep tuned for extra bulletins on the Flash initiative, by monitoring updates to the Advancing Reliability Series!

LEAVE A REPLY

Please enter your comment!
Please enter your name here