Codename MDASH: Microsoft’s Multi-Model Agentic Scanning System Redefining Vulnerability Discovery At AI Speed + Video

Introduction:

Every vulnerability has two clocks running—one belongs to the defender racing to find it, the other to the attacker hoping to discover it first. For decades, the attacker has held the advantage because modern code is vast, interconnected, and changing daily, while security reviews happen at fixed moments in time. Microsoft’s Codename MDASH—a multi-model agentic scanning system—flips this dynamic by orchestrating a panel of specialized AI agents to discover, validate, and help remediate software vulnerabilities end-to-end at enterprise scale.

Learning Objectives:

Understand the architecture and orchestration logic behind Microsoft’s multi-model agentic scanning system and how specialized AI agents collaborate in a structured pipeline.
Learn how MDASH integrates with Microsoft Defender, GitHub Advanced Security, and Azure DevOps to create a closed-loop vulnerability management workflow.
Gain practical insights into AI-driven vulnerability discovery, validation, and proof-of-concept generation, including real-world failure modes and mitigation strategies.

The MDASH Pipeline: From Code to Fix in a Closed Loop

MDASH is not a standalone scanner—it plugs directly into the tools security teams and developers already use. The system operates through a structured pipeline with distinct stages: prepare, scan, validate, and prove. Each stage is handled by specialized AI agents, with a routing mechanism that filters out irrelevant agents while preserving strong candidates, allowing the system to scale across diverse targets.

Validated findings surface as code scanning alerts in GitHub Advanced Security (GHAS), appearing inline on pull requests and in the repository’s security tab. The same findings flow into Azure DevOps, where they can gate pipeline builds and open work items for remediation, and into Microsoft Defender, where they are prioritized alongside threat intelligence and runtime signals. This closed loop connects discovery, validation, proof, and fix across the Microsoft stack.

Step‑by‑step guide to understanding the pipeline:

Prepare Stage: The system distinguishes the code under audit from contextual code, defining dependencies based on their role rather than origin. It generates a comprehensive threat model and identifies entry points for untrusted input, including maintainer-defined entry points like fuzz harnesses that may reside outside the primary codebase.
Scan Stage: Specialized agents analyze the scoped codebase, identifying potential vulnerabilities. The system prioritizes which files and functions to analyze first, though it can sometimes de-emphasize less obvious components.
Validate Stage: The system attempts to confirm whether identified vulnerabilities are genuinely exploitable by reasoning about reachability and concrete execution paths.
Prove Stage: The system generates working proof-of-concept exploits to demonstrate exploitability—the most challenging stage, accounting for 65.4% of failures in benchmark evaluations.

2. Real-World Impact: What MDASH Caught This Month

The measure of any security system is what it catches. This month’s Patch Tuesday cohort includes a set of vulnerability discoveries across the Windows ecosystem, Hyper-V, the Windows kernel, Active Directory Domain Services, Remote Desktop Client, HTTP.sys, DNS Client, and DHCP Client.

|–|–||||

Several findings involve high-severity remote code execution vulnerabilities in core infrastructure layers that are difficult to scrutinize using manual approaches alone. Each was identified before exploitation, in areas of the codebase that would traditionally demand significant manual effort to review.

Linux/Windows commands for vulnerability assessment related to these findings:

Windows Kernel vulnerability assessment: Use `!analyze -v` in WinDbg to analyze crash dumps, and `lm` to list loaded modules.
HTTP.sys inspection: `netsh http show servicestate` to examine HTTP.sys state, and `sc query http` to check HTTP service status.
DNS Client analysis: `ipconfig /displaydns` to view DNS cache, `nslookup` for DNS resolution testing.
Active Directory security auditing: `Get-ADObject -Filter -Properties ` for AD object enumeration, and `Set-ADObject` with appropriate permissions management.

3. Performance Benchmarking: 96.5% on CyberGym

CyberGym, an industry benchmark built on 1,507 real-world vulnerabilities, gave the team a way to iterate quickly and measure progress. The latest version of MDASH achieved 96.5% (any crash) on CyberGym, including both target and non-target vulnerabilities.

The gains were concentrated in the earliest stages of the pipeline: prepare and scan—improvements that directly raise the quality of everything downstream, including validation and proof generation.

Step‑by‑step guide to understanding the performance improvements:

Sharper scoping—The system more clearly distinguishes the code under audit from contextual code, defining dependencies based on their role rather than origin.
More comprehensive threat modeling—The system has a fuller view of a target repository’s attack surface, particularly in identifying entry points for untrusted input.
A more reliable call graph—The correctness and robustness of the call graph has been strengthened, improving the system’s ability to reason about code interactions, especially for reachability analysis during validation.
Smarter routing to specialized agents—A new routing mechanism filters out clearly irrelevant agents while preserving strong candidates, reducing unnecessary computation while maintaining coverage.

4. Understanding the Remaining 3.5%: Failure Mode Analysis

While the 96.5% score represents a significant step forward, the system missed 52 tasks (3.5% of cases). Understanding these failures is critical for improvement:

Scan Stage Failures (8 cases, 15.4%):

Incorrect scope from ambiguous descriptions—When bug descriptions are too general, especially in repositories with multiple modules, precise localization becomes difficult.
Missed prioritization of vulnerable components—The system can de-emphasize less obvious components, such as lexer/parser components in favor of other C code paths.

Validate Stage Failures (10 cases, 19.2%):

Hypothetical descriptions and code misinterpretation—Scan results sometimes include hypothetical descriptions rather than concrete execution paths. When validation cannot confirm a concrete path, it may reject the finding.

Prove Stage Failures (34 cases, 65.4%):

Highly structured input requirements—Targets like IVF/AV1, fonts, and PDFs require complex inputs that satisfy format validation while reaching vulnerable code paths.
Fuzzing until timeout—The system sometimes found crashes but failed to generate inputs accepted as valid within time constraints.
Environment mismatch—Crashes reproduced locally sometimes didn’t transfer to the evaluation harness due to build configuration mismatches.
Build complexity and time constraints—Build processes sometimes failed, ran too long, or exceeded execution budgets.

Step‑by‑step guide to addressing these failure modes:

For scan-stage failures: Improve scope generation by providing more detailed bug descriptions and implementing better component prioritization logic.
For validate-stage failures: Require concrete execution paths in scan-stage findings rather than hypothetical descriptions.
For prove-stage failures: Integrate with existing fuzzing ecosystems like OSS-Fuzz to reuse build pipelines and draw on existing seed corpora. Extend analysis to support non-traditional code artifacts like lex/yacc-generated code.
The Model Evolution: Newer Models Add 1–2% Improvement

To isolate the impact of system-level improvements, the core evaluation intentionally used the same model configuration as the previous benchmark, attributing gains directly to pipeline improvements. However, modern foundation models continue to evolve, and additional experiments on the 52 previously failed cases revealed meaningful gains:

Experiment 1 (Newer OpenAI models for discovery, Claude Opus 4.6 for prove):
– Configuration: Prepare/Scan/Validate with GPT-5.4, GPT-5.5, GPT-5.4-mini, GPT-5.3-codex; Prove with Claude Opus 4.6
– Result: 19 of 52 cases solved (36.5%), projected success rate: 97.8%
– Primary gain came from higher-quality scan-stage findings with concrete execution details

Experiment 2 (GPT-5.5 / GPT-5.5-cyber for prove):

Result (GPT-5.5): 21 of 52 cases solved (40.4%), projected success rate: 97.9%
Result (GPT-5.5-cyber): 23 of 52 cases solved (44.2%), projected success rate: 98.1%

Three distinct proof-generation strategies emerged:

1. Code-based—Reasoning over code paths to craft inputs

2. Fuzzing-based—Searching the input space for crashes

Custom instrumentation-based—Exposing vulnerability-relevant variables and using them as feedback signals to guide input generation

6. Integration with Existing Security Tools and Workflows

MDASH is designed to strengthen the software development lifecycle from the inside, not to add one more tool for teams to tend. The system integrates with:

GitHub Advanced Security (GHAS):

Findings appear inline on pull requests
Visible in the repository’s security tab
Engineers triage in the same place they review code

Azure DevOps:

Findings can gate pipeline builds
Work items automatically opened for remediation

Microsoft Defender:

Findings prioritized alongside threat intelligence
Runtime signals incorporated into prioritization

Azure OpenAI and Copilot Integration:

Security teams can use natural language to query MDASH findings, ask for remediation guidance, and generate fixes using Copilot integration that wraps the system in a chat interface.

Windows/Linux commands for integrating security findings into CI/CD:

GitHub Advanced Security API: Use `gh api -X GET /repos/:owner/:repo/code-scanning/alerts` to retrieve alerts programmatically.
Azure DevOps REST API: `az devops invoke –area code –resource scanning –http-method GET` to query scanning results.
Defender for Cloud API: Use PowerShell `Get-AzSecurityAlert` to retrieve security alerts with resource group and subscription filters.

Future Directions: From Benchmark Excellence to Real-World Impact

Microsoft is charting its course in two directions:

First—Advancing to real-world environments:

Targeting cost-efficient discovery of previously unknown vulnerabilities
Integrated capabilities to triage and fix issues at scale
Finding the bug is half the job—closing it is the other half

Second—Advancing the benchmark:

Capturing the complexity, ambiguity, and end-to-end workflows of real-world vulnerability discovery
Pushing the frontier from benchmark excellence to real-world impact

The model variation experiments point toward the same conclusion: the system and the models improve in complementary ways. The additional gains from newer models were real, especially in the precision of scan-stage findings—and that is not a complication in interpreting results but a roadmap.

What Undercode Say:

Key Takeaway 1: The MDASH architecture represents a fundamental shift from single-model AI security tools to multi-agent orchestration—a panel of specialized agents working in concert outperforms any single model, with the system architecture itself being the product, not just the model.
Key Takeaway 2: The 3.5% failure rate analysis reveals that proof-of-concept generation (65.4% of failures) is the primary bottleneck, driven by highly structured input requirements, fuzzing timeouts, environment mismatches, and build complexity—areas where integrating existing fuzzing ecosystems like OSS-Fuzz and extending analysis beyond source code will yield the greatest gains.

Analysis:

The MDASH system exemplifies a mature approach to AI-driven security that moves beyond proof-of-concept demonstrations to production-grade defense at enterprise scale. By orchestrating specialized agents rather than relying on a single model, Microsoft has created a system that can reason through the complexity of proprietary code and platforms like Windows, Hyper-V, Azure, and identity systems—components where reasoning requires understanding kernel calling conventions, object lifetime invariants, and trust boundaries that no language model encountered in its training data.

The integration with GitHub Advanced Security, Azure DevOps, and Microsoft Defender creates a closed-loop workflow that transforms vulnerability discovery from a standalone activity into an integral part of the software development lifecycle. Findings travel the same path as every other code change—with an owner, a pull request, and a fix on the other side—landing as actionable engineering work rather than stalling in a backlog.

Perhaps most significant is the measured approach to performance evaluation. By holding the model configuration constant to isolate pipeline improvements, then testing newer models separately, Microsoft demonstrates rigorous engineering discipline. The 96.5% CyberGym score with projected improvements to 98.1% using newer models shows that both system architecture and model evolution matter—and they grow stronger together.

Prediction:

+1 The multi-agent orchestration paradigm will become the industry standard for AI-driven security tools within 18–24 months, as organizations recognize that specialized agents working in concert outperform monolithic models across diverse codebases and vulnerability classes.
+1 Integration with CI/CD pipelines and existing security workflows will accelerate adoption, with MDASH-style closed-loop systems reducing mean time to remediate (MTTR) by 40–60% for organizations that implement similar architectures.
-1 The 3.5% failure rate, while impressive, highlights persistent challenges in proof-of-concept generation for targets requiring highly structured inputs—attackers may continue to exploit these gaps until fuzzing integration and artifact support mature.
+1 The complementary improvement of system architecture and foundation models suggests a virtuous cycle: better models enable better system design, which in turn generates higher-quality training data for future model iterations, creating a self-reinforcing security capability.
+1 Microsoft’s investment in agentic security positions the company to lead the next generation of DevSecOps, potentially reshaping how vulnerability discovery, validation, and remediation are performed across the software industry—not just within Microsoft’s own ecosystem.

▶️ Related Video (82% Match):

https://www.youtube.com/watch?v=4TB6mrpHt4g

🎯Let’s Practice For Free:

🎓 Live Courses & Certifications:

Join Undercode Academy for Verified Certifications

🚀 Request a Custom Project:

Secure, high-velocity infrastructure and disruptive technological engineering. Contact our engineering team for high-tier development and proprietary systems:
[email protected]
💎 Smart Architecture | 🛡️ Secure by Design | ⭐ Trusted by Thousands

IT/Security Reporter URL:

Reported By: Markolauren Mdash – Hackers Feeds
Extra Hub: Undercode MoN
Basic Verification: Pass ✅

🔐JOIN OUR CYBER WORLD [ CVE News • HackMonitor • UndercodeNews ]

💬 Whatsapp | 💬 Telegram

📢 Follow UndercodeTesting & Stay Tuned:

𝕏 formerly Twitter 🐦 | @ Threads | 🔗 Linkedin | 🦋BlueSky

Listen to this Post