AI-Powered Azure VM Downtime Investigation: Automating Root Cause Analysis with MCP and Agentic AI + Video

Listen to this Post

Featured Image

Introduction:

When an Azure Virtual Machine experiences unexpected downtime, the clock starts ticking—every minute of recovery translates directly into business impact and operational cost. Traditional incident investigation relies on a Duty Response Individual (DRI) manually correlating logs, timestamps, repair service activity, and Kusto queries across disparate systems, a process that can stretch recovery timelines significantly. The Azure Compute team has pioneered an AI-powered approach that encodes the team’s best recovery knowledge into a repeatable pipeline, automating evidence gathering while keeping human engineers focused on the final decision.

Learning Objectives:

  • Understand how Model Context Protocol (MCP) and AI agents automate VM downtime investigation pipelines
  • Learn to implement automated log correlation and root cause analysis using Kusto Query Language (KQL) and Azure Monitor
  • Master the step-by-step workflow for parsing ticket metadata, querying VM health events, and building recovery timelines
  • Gain practical skills in configuring Azure Automation, Logic Apps, and MCP-based agents for incident response
  • Explore mitigation strategies for common Azure VM recovery bottlenecks and long-duration events

You Should Know:

  1. The AI-Powered Investigation Pipeline: From Alert to Structured Report

The Azure Compute team’s solution mirrors the way experienced engineers triage incidents—but runs the entire pipeline automatically. When an alert fires for a recovery event taking longer than expected, the system kicks off without requiring a DRI to open a ticket and manually correlate data. The investigation flow follows a structured sequence:

  • Parse ticket metadata: Extract VM ID, region, timestamp, and alert context from the incoming incident ticket
  • Query VM health events: Pull health status changes, heartbeat failures, and guest OS metrics from Azure Resource Health and Azure Monitor
  • Check host health: Assess underlying physical host status, including network connectivity, storage latency, and hypervisor events
  • Correlate repair logs: Cross-reference automated recovery actions (reboot, re-deploy, migration) with their timestamps and outcomes
  • Build the recovery timeline: Construct a chronological sequence of events from alert to resolution
  • Identify the bottleneck and likely cause: Pinpoint which stage consumed the most time and suggest probable root causes

This agentic approach encodes the team’s best recovery knowledge into a repeatable pipeline, letting the system handle the most time-consuming part (evidence gathering) while keeping the human focused on the final call.

Step-by-Step Guide: Manual Correlation vs. Automated Pipeline

To understand what the AI automates, here’s how a DRI would manually investigate a prolonged VM recovery event:

Step 1: Gather Incident Context

 Azure CLI - Get VM details and recent events
az vm show --1ame <VM_NAME> --resource-group <RG> --query "{id:id, location:location, powerState:powerState}"
az vm get-instance-view --1ame <VM_NAME> --resource-group <RG> --query "statuses[?code!='PowerState/running']"

Step 2: Query VM Health History (Kusto Query Language – KQL)

// Azure Resource Health events for the VM
AzureDiagnostics
| where ResourceProvider == "MICROSOFT.COMPUTE" and Resource == "<VM_NAME>"
| where Category == "ResourceHealth"
| project TimeGenerated, Level, OperationName, Properties
| sort by TimeGenerated desc

Step 3: Check Host and Infrastructure Metrics

// Check for host-related issues
AzureMetrics
| where Resource == "<VM_NAME>"
| where MetricName in ("Network In Total", "Network Out Total", "OS Disk Latency")
| where TimeGenerated > ago(1h)
| summarize avg(Average) by bin(TimeGenerated, 5m), MetricName

Step 4: Correlate Repair Actions

 List recent repair operations on the VM
az resource list --resource-group <RG> --resource-type Microsoft.Compute/virtualMachines --query "[?name=='<VM_NAME>'].{repairs:provisioningState}"
 Check activity log for repair events
az monitor activity-log list --resource-group <RG> --query "[?contains(operationName.value, 'repair')]"

Step 5: Build Timeline and Identify Bottleneck

Manually compare timestamps from health events, repair logs, and metrics to identify gaps—for example, a 10-minute delay between host failure detection and migration initiation might indicate a bottleneck in the orchestration layer.

Step 6: Generate Report

Document findings, correlate evidence, and propose mitigation. The AI system automates Steps 1–5 and attaches a structured report to the incident, reducing investigation time from minutes to seconds.

  1. MCP (Model Context Protocol) in Action: The AI Agent Framework

The Azure Compute team’s solution is powered by Model Context Protocol (MCP), a framework that enables AI agents to interact with various data sources and tools in a structured, context-aware manner. MCP acts as the orchestration layer that:

  • Connects to Azure APIs: Retrieves VM metadata, health status, and repair logs via REST API calls
  • Executes Kusto queries: Runs diagnostic queries against Azure Monitor and Resource Health data
  • Correlates structured and unstructured data: Combines log data, ticket descriptions, and historical incident patterns
  • Generates human-readable reports: Produces Markdown or JSON summaries with evidence and recommendations

MCP Agent Configuration Example (Conceptual)

{
"agent": {
"name": "VM_Downtime_Investigator",
"protocol": "MCP",
"tools": [
{
"name": "AzureResourceGraph",
"endpoint": "https://management.azure.com/providers/Microsoft.ResourceGraph/resources",
"auth": "ManagedIdentity"
},
{
"name": "KustoQueryEngine",
"endpoint": "https://api.loganalytics.io/v1/workspaces/<WORKSPACE_ID>/query",
"query_templates": [
"vm_health_events.kql",
"host_health_metrics.kql"
]
}
],
"workflow": "downtime_investigation.yaml"
}
}

Step-by-Step: Deploying an MCP-Based Investigation Agent

  1. Define the investigation workflow in YAML or JSON, specifying each stage (parse ticket, query health, check host, correlate logs, build timeline, identify bottleneck)
  2. Configure tool connections to Azure Resource Graph, Log Analytics, and Activity Log APIs using Managed Identity or Service Principal authentication
  3. Set up event triggers—the agent should be invoked automatically when a “prolonged recovery” alert is fired from Azure Monitor
  4. Test the pipeline with historical incident data to validate correlation logic and report accuracy
  5. Integrate with incident management (e.g., ServiceNow, PagerDuty) to attach reports directly to tickets

  6. Automating Evidence Gathering with Azure Monitor and Kusto Queries

The backbone of automated investigation is Azure Monitor’s rich telemetry and Kusto Query Language (KQL). The AI system pre-defines a library of queries that extract exactly the evidence needed for each investigation stage.

Essential KQL Queries for VM Downtime Investigation

// 1. VM Health Status Timeline
AzureActivity
| where ResourceProvider == "Microsoft.Compute"
| where Resource == "<VM_NAME>"
| where OperationName in ("Microsoft.Compute/virtualMachines/write", "Microsoft.Compute/virtualMachines/restart/action")
| project TimeGenerated, OperationName, ActivityStatus, Properties
| sort by TimeGenerated asc

// 2. Host-Level Failures (if available via Azure Resource Health)
ResourceHealth
| where ResourceType == "virtualMachines"
| where Resource == "<VM_NAME>"
| where EventType in ("Downtime", "Degraded")
| project EventTimestamp, EventType, EventSubType, Summary
| sort by EventTimestamp desc

// 3. Network Connectivity Checks
AzureMetrics
| where Resource == "<VM_NAME>"
| where MetricName in ("Network In Total", "Network Out Total")
| where TimeGenerated > ago(30m)
| summarize avg(Total) by bin(TimeGenerated, 1m)

// 4. Storage Latency (OS Disk)
AzureMetrics
| where Resource == "<VM_NAME>"
| where MetricName == "OS Disk Latency (Read/Write)"
| where TimeGenerated > ago(30m)
| summarize avg(Average) by bin(TimeGenerated, 1m)

// 5. Recovery Action Timeline
AzureActivity
| where Resource == "<VM_NAME>"
| where OperationName contains "repair" or OperationName contains "redeploy"
| project TimeGenerated, OperationName, ActivityStatus, Properties
| sort by TimeGenerated asc

Step-by-Step: Integrating Kusto Queries into the AI Pipeline

  1. Create a query library in Log Analytics workspace, storing each query as a saved function or template
  2. Parameterize queries to accept VM ID, time range, and incident ID as input variables
  3. Use Azure Automation Runbooks or Logic Apps to execute queries via the Log Analytics REST API
  4. Parse query results into structured JSON objects for the AI agent to correlate
  5. Set up alerts that trigger the investigation pipeline when certain conditions are met (e.g., VM recovery > 5 minutes)

4. Cloud Hardening and Proactive Mitigation Strategies

While AI-powered investigation accelerates root cause analysis, the ultimate goal is to prevent prolonged downtime events. Based on common recovery bottlenecks identified by the Azure Compute team, here are proactive hardening measures:

Network Resilience

  • Deploy Azure Load Balancer or Application Gateway with health probes to automatically route traffic away from unhealthy VMs
  • Use Availability Sets or Availability Zones to distribute VMs across fault domains
  • Implement Azure Site Recovery for cross-region failover

Storage Performance

  • Monitor disk latency and IOPS using Azure Metrics; set alerts for thresholds exceeding 100ms read latency
  • Use Premium SSDs or Ultra Disks for I/O-intensive workloads
  • Regularly review and optimize disk caching configurations

Guest OS and Application Health

  • Install and configure Azure Monitor Agent (AMA) for in-guest monitoring
  • Implement custom health probes for application-level availability (e.g., HTTP endpoints)
  • Use Azure Automation Update Management to keep guest OS patches current

Recovery Automation

  • Configure Azure Automated VM Recovery to trigger automatic mitigation actions
  • Define runbooks that attempt graceful shutdown, reboot, or redeploy based on failure type
  • Test recovery playbooks regularly using Azure Chaos Studio to simulate failures

Windows/Linux Commands for Proactive Monitoring

 Linux: Check system health and recent errors
journalctl -u cloud-init --since "1 hour ago"
dmesg | tail -20
vmstat 1 5

Linux: Check disk health and performance
iostat -x 1 3
smartctl -a /dev/sda

Windows: Check system event logs for critical errors
Get-WinEvent -LogName System -MaxEvents 50 | Where-Object { $_.LevelDisplayName -eq "Critical" }

Windows: Check disk performance
Get-Counter -Counter "\PhysicalDisk()\Avg. Disk sec/Read" -SampleInterval 1 -MaxSamples 5
  1. Security and API Hardening for AI-Driven Incident Response

Automating incident investigation introduces new security considerations. The MCP-based agent must securely access Azure APIs, Kusto workspaces, and incident management systems. Implement these hardening measures:

Authentication and Authorization

  • Use Azure Managed Identity for the automation account running the investigation pipeline—no credentials stored in code
  • Assign minimum required permissions: `Reader` on VM resources, `Log Analytics Reader` on workspaces, and `Contributor` only if mitigation actions are enabled
  • Implement Azure RBAC with custom roles scoped to specific resource groups or subscriptions

API Security

  • All API calls should use TLS 1.2 or higher
  • Implement retry policies with exponential backoff to avoid throttling
  • Use Azure API Management as a gateway to log and audit all agent API requests

Data Privacy and Compliance

  • Anonymize or redact sensitive data (PII, IP addresses) before storing investigation reports
  • Ensure Log Analytics workspace has data retention policies aligned with compliance requirements
  • Regularly audit agent access logs using Azure Sentinel or Azure Monitor

Example: Secure API Call with Managed Identity (PowerShell)

 Connect to Azure using Managed Identity
Connect-AzAccount -Identity

Query VM health via Azure Resource Graph
$query = "Resources | where type =~ 'Microsoft.Compute/virtualMachines' and name == '<VM_NAME>' | project name, location, properties.statuses"
$result = Search-AzGraph -Query $query

Send report to incident management system with secure API key stored in Key Vault
$secureToken = (Get-AzKeyVaultSecret -VaultName "<KV_NAME>" -1ame "IncidentAPIKey").SecretValueText
Invoke-RestMethod -Uri "https://incident.example.com/api/reports" -Method Post -Headers @{"Authorization"="Bearer $secureToken"} -Body ($report | ConvertTo-Json)

What Undercode Say:

  • AI-powered investigation pipelines are not about replacing human engineers—they’re about eliminating the repetitive, time-consuming evidence-gathering phase so that DRIs can focus on the final diagnostic call and mitigation strategy
  • The most effective agentic AI implementations are those that encode the team’s existing tribal knowledge and best practices into structured, repeatable workflows, rather than trying to build everything from scratch
  • Success depends on high-quality telemetry—if your monitoring data is incomplete or inaccurate, even the smartest AI agent will produce misleading conclusions

The Azure Compute team’s approach demonstrates a pragmatic evolution in cloud operations: moving from reactive, manual triage to proactive, automated intelligence. By leveraging MCP and Kusto queries, they’ve transformed a process that used to take DRIs 15–30 minutes per incident into a near-instantaneous structured report. This doesn’t just save time—it reduces cognitive load during high-pressure outages, minimizes mean time to resolution (MTTR), and enables teams to scale their incident response without proportional headcount growth. The same pattern can be applied to other cloud providers and even on-premises environments, provided you have the telemetry foundation and a clear understanding of your recovery workflows.

Expected Output:

  • Faster MTTR: Automated evidence gathering reduces investigation time from minutes to seconds, allowing DRIs to make faster, more informed decisions
  • Consistent Investigations: Every incident receives the same rigorous analysis, eliminating human error and variability in root cause identification
  • Continuous Learning: The pipeline can be updated as new failure patterns emerge, encoding institutional knowledge that persists beyond individual team members
  • Scalable Incident Response: Teams can handle more incidents without increasing headcount, as the AI handles the heavy lifting of data correlation

Prediction:

  • +1 Cloud operations teams will increasingly adopt MCP-based agents for incident investigation, reducing average MTTR by 40–60% within the next 18 months
  • +1 The integration of AI with observability platforms (Azure Monitor, Datadog, Splunk) will become a standard feature, not a differentiator
  • +1 Open-source frameworks for agentic incident response will emerge, allowing teams to build custom pipelines without vendor lock-in
  • -1 Over-reliance on AI-generated reports may lead to “automation blindness,” where engineers fail to question or validate AI conclusions, potentially missing novel failure modes
  • -1 The complexity of configuring and maintaining these pipelines will create a skills gap, requiring investment in training for cloud engineers and SREs
  • +1 As MCP and similar protocols mature, we’ll see cross-cloud investigation agents that can correlate incidents across AWS, Azure, and GCP from a single pane of glass

▶️ Related Video (80% Match):

🎯Let’s Practice For Free:

🎓 Live Courses & Certifications:

Join Undercode Academy for Verified Certifications

🚀 Request a Custom Project:

Secure, high-velocity infrastructure and disruptive technological engineering. Contact our engineering team for high-tier development and proprietary systems:
[email protected]
💎 Smart Architecture | 🛡️ Secure by Design | ⭐ Trusted by Thousands

IT/Security Reporter URL:

Reported By: Matthansen0 Azure – Hackers Feeds
Extra Hub: Undercode MoN
Basic Verification: Pass ✅

🔐JOIN OUR CYBER WORLD [ CVE News • HackMonitor • UndercodeNews ]

💬 Whatsapp | 💬 Telegram

📢 Follow UndercodeTesting & Stay Tuned:

𝕏 formerly Twitter 🐦 | @ Threads | 🔗 Linkedin | 🦋BlueSky