New Project Flash Update: Advancing Azure Virtual Machine availability monitoring | Azure Blog and Updates

[ad_1]

“Earlier this yr, we launched Challenge Flash within the Advancing Reliability weblog sequence, to reaffirm our dedication to empowering Azure prospects in monitoring digital machine (VM) availability in a strong and complete method. At this time, we’re excited to share the progress we’ve made since then in creating holistic monitoring choices to fulfill prospects’ distinct wants. I’ve requested Senior Technical Program Supervisor, Pujitha Desiraju, from the Azure Core Manufacturing High quality Engineering group to share the most recent investments as a part of Challenge Flash, to ship the very best monitoring expertise for patrons.”—Mark Russinovich, CTO, Azure.

Flash, because the mission is internally recognized, is a group of efforts throughout Azure Engineering, that goals to evolve Azure’s digital machine (VM) availability monitoring ecosystem right into a centralized, holistic, and intelligible resolution prospects can depend on to fulfill their particular observability wants. As a part of this multi-year endeavor, we’re excited to announce the:

Normal availability of VM availability info in Azure Useful resource Graph for environment friendly and at-scale monitoring, handy for detailed downtime investigations and influence evaluation.

Preview of a VM availability metric in Azure Monitor for fast debugging is now publicly obtainable, development evaluation of VM availability over time, and establishing threshold-based alerts on situations that influence workload efficiency.

Preview of VM availability standing change occasions through Azure Occasion Grid for instantaneous notifications on important adjustments in VM availability, to rapidly set off remediation actions to stop end-user influence.

Our dedication stays, to sustaining information consistency and comparable rigorous high quality requirements throughout all of the monitoring options which are a part of Flash, together with present options like Useful resource Well being or Exercise Log, so we ship a constant and cohesive expertise to prospects.

VM availability info in Azure Useful resource Graph for at-scale evaluation

Along with already flowing VM availability states, we lately revealed VM well being annotations to Azure Useful resource Graph (ARG) for detailed failure attribution and downtime evaluation, together with enabling a 14-day change monitoring mechanism to hint historic adjustments in VM availability for fast debugging. With these new additions, we’re excited to announce the overall availability of VM availability info within the HealthResources dataset in ARG! With this providing customers can:

Effectively question the most recent snapshot of VM availability throughout all Azure subscriptions without delay and at low latencies for periodic and fleetwide monitoring.

Precisely assess the influence to fleetwide enterprise SLAs and rapidly set off decisive mitigation actions, in response to disruptions and sort of failure signature.

Arrange customized dashboards to oversee the excellent well being of functions by becoming a member of VM availability info with further useful resource metadata current in ARG.

Monitor related adjustments in VM availability throughout a rolling 14-day window, through the use of the change-tracking mechanism for conducting detailed investigations.

Getting began

Customers can question ARG through PowerShell, REST API, Azure CLI, and even the Azure Portal. The next steps element how information might be accessed from Azure Portal.

As soon as on the Azure Portal, navigate to Useful resource Graph Explorer which can appear like the beneath picture:

Portal view of Azure Resource Graph displaying the list of datasets including the HealthResources table, along with a query window for Kusto queries to fetch results.

Determine 1: Azure Useful resource Graph Explorer touchdown web page on Azure Portal.

Choose the Desk tab and (single) click on on the HealthResources desk to retrieve the most recent snapshot of VM availability info (availability state and well being annotations).

Portal view of Azure Resource Graph displaying both VM availability states and annotations across all resources at once in the results window, along with showcasing the 2 event types in the HealthResources table.

Determine 2: Azure Useful resource Graph Explorer Window depicting the most recent VM availability states and VM well being annotations within the HealthResources desk.

There will likely be two forms of occasions populated within the HealthResources desk:

Portal view of the left-hand pane in Azure Resource Graph displaying the 2 types of events within the HealthResources table along with the type of all fields embedded within each type.

Determine three: Snapshot of the kind of occasions current within the HealthResources desk, as proven in Useful resource Graph Explorer on the Azure Portal.

This occasion denotes the most recent availability standing of a VM, primarily based on the well being checks carried out by the underlying Azure platform. Beneath are the provision states we presently emit for VMs:

Obtainable: The VM is up and operating as anticipated.

Unavailable: We’ve detected disruptions to the traditional functioning of the VM and due to this fact functions is not going to run as anticipated.

Unknown: The platform is unable to precisely detect the well being of the VM. Customers can normally examine again in a couple of minutes for an up to date state.

To ballot the most recent VM availability state, discuss with the properties area which accommodates the beneath particulars:

Pattern

Property descriptions

Area

Description

Corresponding RHC area

targetResourceType

Kind of useful resource for which well being information is flowing

resourceType

targetResourceId

Useful resource Id

resourceId

occurredTime

Timestamp when the most recent availability state is emitted by the platform

eventTimestamp

previousAvailabilityState

Earlier availability state of the VM

previousHealthStatus

availabilityState

Present availability state of the VM

currentHealthStatus

Seek advice from this doc for a listing of starter queries to additional discover this information.

This occasion contextualizes any adjustments to VM availability, by detailing essential failure attributes to assist customers examine and mitigate the disruption as wanted. See the complete checklist of VM well being annotations emitted by the platform.

These annotations might be broadly categorized into three buckets:

Downtime Annotations: These annotations are emitted when the platform detects VM availability transitioning to Unavailable. (For instance, throughout sudden host crashes, rebootful restore operations).

Informational Annotations: These annotations are emitted throughout management airplane actions with no influence to VM availability. (Akin to VM allocation/Cease/Delete/Begin). Often, no further buyer motion is required in response.

Degraded Annotations: These annotations are emitted when VM availability is detected to be in danger. (For instance, when failure prediction fashions predict a degraded element that may trigger the VM to reboot at any given time). We strongly urge customers to redeploy by the deadline specified within the annotation message, to keep away from any unanticipated lack of information or downtime.

To ballot the related VM well being annotations for a useful resource, if any, discuss with the properties area which accommodates the next particulars:

Pattern

Property descriptions

Area

Description

Corresponding RHC area

targetResourceType

Kind of useful resource for which well being information is flowing

resourceType

targetResourceId

Useful resource Id

resourceId

occurredTime

Timestamp when the most recent availability state is emitted by the platform

eventTimestamp

annotationName

Title of the Annotation emitted

eventName

motive

Temporary overview of the provision influence noticed by the shopper

title

class

Denotes whether or not the platform exercise triggering the annotation was both deliberate upkeep or unplanned restore. This area is just not relevant to buyer/VM-initiated occasions.

Doable values: Deliberate | Unplanned | Not Relevant | Null

class

context

Denotes whether or not the exercise triggering the annotation was as a consequence of a certified person or course of (customer-initiated), or as a result of Azure platform (platform-initiated) and even exercise within the visitor OS that has resulted in availability influence (VM initiated).

Doable values: Platform-initiated | Consumer-initiated | VM-initiated | Not Relevant | Null

context

abstract

Assertion detailing the trigger for annotation emission, together with remediation steps that may be taken by customers

abstract

Seek advice from this doc for a listing of starter queries to additional discover this information.

Waiting for 2023, we have now a number of enhancements deliberate for the annotation metadata that’s surfaced within the HealthResources dataset. These enrichments will give customers entry to richer failure attributes to decisively put together a response to a disruption. In parallel, we goal to increase the period of historic lookback to a minimal of 30 days so customers can comprehensively monitor previous adjustments in VM availability.

VM availability metric in Azure Monitor Preview

We’re excited to share that the out-of-box VM availability metric is now obtainable as a public preview for all customers! This metric shows the development of VM availability over time, so customers can:

Arrange threshold-based metric alerts on dipping VM availability to rapidly set off acceptable mitigation actions.

Correlate the VM availability metric with present platform metrics like reminiscence, community, or disk for deeper insights into regarding adjustments that influence the general efficiency of workloads.

Simply work together with and chart metric information throughout any related time window on Metrics Explorer, for fast and simple debugging.

Route metrics to downstream tooling like Grafana dashboards, for setting up customized visualizations and dashboards.

Getting began

Customers can both eat the metric programmatically through the Azure Monitor REST API or immediately from the Azure Portal. The next steps spotlight metric consumption from the Azure Portal.

As soon as on the Azure Portal, navigate to the VM overview blade. The brand new metric will show as VM Availability (Preview), together with different platform metrics underneath the Monitoring tab.

Portal view of the VM overview page, with the newly added VM availability metric highlighted.

Determine four: View the newly added VM Availability Metric on the VM overview web page on Azure Portal.

Choose (single click on) the VM availability metric chart on the overview web page, to navigate to Metrics Explorer for additional evaluation.

Portal view of VM availability metric on Metric Explorer, displaying availability as a trend in the form of a blue line, over time with occasional dips.

Determine 5: View the newly added VM availability Metric on Metrics Explorer on Azure Portal.

Metric description:

Show Title

VM Availability (preview)

Metric Values

1 throughout anticipated habits; corresponds to VM in Obtainable state.

zero when VM is impacted by rebootful disruptions; corresponds to VM in Unavailable state.

NULL (reveals a dotted or dashed line on charts) when the Azure service that’s emitting the metric is down or is unaware of the precise standing of the VM; corresponds to VM in Unknown state.

Aggregation

The default aggregation of the metric is Common, for prioritized investigations primarily based on extent of downtime incurred.

The opposite aggregations obtainable are:

Min, to instantly pinpoint to all of the instances the place VM was unavailable.

Max, to instantly pinpoint to all of the situations the place VM was Obtainable.

Refer right here for extra particulars on chart vary, granularity, and information aggregation.

Information Retention

Information for the VM availability metric will likely be saved for 93 days to help in development evaluation and historic lookback.

Pricing

Please discuss with the Pricing breakdown, particularly within the “Metrics” and “Alert Guidelines” sections.

Waiting for 2023, we plan to incorporate influence particulars (person vs platform initiated, deliberate vs unplanned) as dimensions to the metric, so customers are nicely outfitted to interpret dips, and arrange far more focused metric alerts. With the emission of dimensions in 2023, we additionally anticipate transitioning the providing to a common availability standing.

Introducing instantaneous notifications on adjustments in VM availability through Occasion Grid

We’re thrilled to introduce our newest monitoring providing—the non-public preview of VM availability standing change occasions in an Occasion Grid System Subject, which makes use of the low-latency expertise of Azure Occasion Grid! Customers can now subscribe to the system matter and route these occasions to their downstream tooling utilizing any of the obtainable occasion handlers (akin to Azure Features, Logic Apps, Occasion Hubs, and Storage queues). This resolution makes use of an event-driven structure to speak scoped adjustments in VM availability to finish customers in lower than 5 seconds from the disruption incidence. This empowers customers to take instantaneous mitigation actions to stop finish person influence.

As a part of the non-public preview, we’ll emit occasions scoped to adjustments in VM availability states, with the pattern schema beneath:

Pattern

{ "id": "4c70abbc-4aeb-4cac-b0eb-ccf06c7cd102", "matter": "/subscriptions/<subscriptionId>, "topic": "/subscriptions/<subscriptionId>/resourceGroups/<ResourceGroupName>/suppliers/Microsoft.Compute/virtualMachines/<VMName>/suppliers/Microsoft.ResourceHealth/AvailabilityStatuses/present", "information": , "eventType": "Microsoft.ResourceNotifications.HealthResources.AvailabilityStatusesChanged", "dataVersion": "1", "metadataVersion": "1", "eventTime": "2022-09-25T20:21:37.5280000Z" }

The properties area is absolutely per the microsoft.resourcehealth/availabilitystatuses occasion in ARG. The occasion grid resolution gives near-real-time alerting capabilities on the info current in ARG.

We’re presently releasing the preview to a small subset of customers to scrupulously take a look at the answer and accumulate iterative suggestions. This strategy allows us to preview and even announce the overall availability of a top quality and well-rounded providing in 2023. As we glance in the direction of the overall availability of this resolution, customers can count on to obtain occasions when annotations, automated RCAs are emitted by the platform.

What’s subsequent?

We’ll be closely centered on strengthening our monitoring platform to constantly enhance the expertise for patrons primarily based on ongoing suggestions collected from the neighborhood (akin to aggregated VMSS well being exhibiting degraded inaccurately, VM unavailable for 15 minutes, Lacking VM downtimes in Exercise Log). By streamlining our inner message pipeline, we goal to not solely enhance information high quality, but in addition preserve information consistency throughout our choices and develop the scope of failure situations surfaced.

Introducing Degraded VM Availability state

In gentle of our upcoming efforts to centralize our monitoring structure, we’ll be well-positioned to introduce a Degraded VM availability state for digital machines in 2023. This state will likely be extraordinarily helpful in establishing focused alerts on predicted failure situations the place there may be imminent threat to VM availability. This state may even permit customers to effectively monitor instances of degraded or software program failures needing to redeploy, which at this time don’t trigger a corresponding change in VM availability. We may even goal to emit reminder annotations via the period of the VM being marked Degraded, to stop customers from overlooking the request to redeploy.

Increase scope of failure attribution to incorporate software freeze occasions

In 2023, we plan to develop our scope of failure attribution and emission to additionally embody software freeze occasions which may be triggered as a consequence of community agent updates, host OS updates lasting thirty seconds and freeze-causing restore operations. This can guarantee customers have enhanced visibility into freeze influence and will likely be utilized throughout our monitoring choices, together with Useful resource Well being and Exercise Logs.