StepFly is Microsoft’s AI agent framework that automates troubleshooting guide execution for IT incident management. The system achieves 94% success rates while reducing execution time by 32.9-70.4% through parallel processing and structured workflow automation.
| Released by | Microsoft Research |
|---|---|
| Release date | |
| What it is | AI agent framework for automating IT troubleshooting guides |
| Who it is for | Site reliability engineers and IT operations teams |
| Where to get it | GitHub open source |
| Price | Free |
- StepFly automates troubleshooting guide execution with 94% success rate on GPT-4.1
- Three-stage workflow includes guide quality improvement, offline preprocessing, and online execution
- Parallel execution reduces troubleshooting time by 32.9% to 70.4% for compatible guides
- Framework converts unstructured guides into directed acyclic graphs for systematic execution
- Open source code and sample data available on GitHub
- Manual troubleshooting guide execution is slow and error-prone in large-scale IT systems
- StepFly addresses key challenges including guide quality, complex control flow, and data-intensive queries
- The framework enables parallel execution of independent troubleshooting steps
- Microsoft’s empirical study analyzed 92 real-world troubleshooting guides to inform design
- DAG-guided scheduler-executor framework ensures correct workflow execution with memory system
What is StepFly
StepFly is an end-to-end agentic framework that automates troubleshooting guide execution for IT incident management. The system converts unstructured troubleshooting guides into structured workflows that AI agents can execute automatically. AI agents possess complex goal structures, natural language interfaces, and capacity to act independently of user supervision [6]. StepFly specifically targets the challenges of managing troubleshooting guide quality, interpreting complex control flow, and handling data-intensive queries in large-scale IT environments.
What is new vs previous approaches
StepFly introduces specialized capabilities that existing LLM-based incident management solutions lack.
| Feature | Previous LLM Solutions | StepFly |
|---|---|---|
| Guide Quality Management | Limited support | TSG Mentor tool for quality improvement |
| Control Flow Interpretation | Basic sequential processing | Structured DAG extraction and execution |
| Parallel Execution | Not supported | Independent step parallelization |
| Data Query Handling | Generic approach | Dedicated Query Preparation Plugins |
| Memory System | Limited context retention | Comprehensive workflow state management |
How does StepFly work
StepFly operates through a three-stage workflow that transforms manual troubleshooting into automated execution.
- Guide Quality Enhancement: TSG Mentor assists site reliability engineers in improving troubleshooting guide quality and structure
- Offline Preprocessing: LLMs extract structured execution directed acyclic graphs from unstructured guides and create Query Preparation Plugins
- Online Execution: DAG-guided scheduler-executor framework with memory system ensures correct workflow and enables parallel execution of independent steps
The framework converts troubleshooting guides into directed acyclic graphs that represent step dependencies and execution order. Agentic engineering operates at a higher level of abstraction as a control plane that orchestrates cross-team workflows and maintains long-term memory across agents [7].
Benchmarks and evidence
Microsoft’s evaluation demonstrates StepFly’s effectiveness across multiple performance metrics.
| Metric | Result | Source |
|---|---|---|
| Success Rate on GPT-4.1 | 94% | Microsoft Research evaluation |
| Execution Time Reduction | 32.9% to 70.4% | Parallelizable TSG performance |
| Real-world TSGs Analyzed | 92 guides | Empirical study foundation |
| Token Consumption | Lower than baselines | Comparative evaluation |
Who should care
Builders
Software engineers building incident management systems can leverage StepFly’s open source framework for automated troubleshooting. AI agents are most commonly deployed in software engineering, accounting for roughly half of use cases [2]. The DAG-based execution model provides a foundation for building reliable automation systems.
Enterprise
Large organizations with complex IT infrastructure benefit from StepFly’s ability to standardize and accelerate incident response. Agentic AI enhances incident response speed while providing more specific and in-depth post-incident analysis [3]. The framework reduces dependency on manual troubleshooting expertise.
End users
Site reliability engineers and IT operations teams gain tools for improving troubleshooting guide quality and execution efficiency. The TSG Mentor component specifically addresses guide quality issues that impact automation success rates.
Investors
The incident management automation market represents significant opportunity as organizations seek to reduce operational costs and improve system reliability. StepFly’s open source availability accelerates adoption and ecosystem development.
How to use StepFly today
StepFly is available as open source software with complete implementation and sample data.
- Clone the repository:
git clone https://github.com/microsoft/StepFly - Install dependencies according to the project requirements
- Prepare troubleshooting guides using the TSG Mentor tool for quality improvement
- Run offline preprocessing to convert guides into structured DAGs
- Configure the scheduler-executor framework for your environment
- Execute troubleshooting workflows through the agentic system
StepFly vs competitors
StepFly competes with other AI-powered incident management solutions in the market.
| Feature | StepFly | AWS DevOps Agent | Neubird AI SRE |
|---|---|---|---|
| Open Source | Yes | No | No |
| Parallel Execution | Yes | Not disclosed | Not disclosed |
| Success Rate | 94% on GPT-4.1 | Not disclosed | Not disclosed |
| Guide Quality Tools | TSG Mentor | Not disclosed | Not disclosed |
| Structured Workflows | DAG-based | Not disclosed | Telemetry analysis focus |
Risks, limits, and myths
- Quality Dependency: Success rates depend heavily on troubleshooting guide quality and structure
- LLM Limitations: Performance varies across different language models and may require model-specific tuning
- Complex Dependencies: Some troubleshooting scenarios may have dependencies too complex for DAG representation
- Data Requirements: Query Preparation Plugins require access to relevant data sources and APIs
- Myth: Complete Automation: StepFly assists rather than replaces human expertise in complex incident scenarios
- Myth: Universal Application: Framework works best with well-structured guides rather than ad-hoc troubleshooting
FAQ
What is StepFly and how does it work?
StepFly is Microsoft’s AI agent framework that automates troubleshooting guide execution through a three-stage workflow including guide quality improvement, offline preprocessing into DAGs, and online execution with parallel processing capabilities.
What success rate does StepFly achieve?
StepFly achieves approximately 94% success rate on GPT-4.1 while outperforming baseline approaches with reduced time and token consumption.
How much faster is StepFly compared to manual troubleshooting?
StepFly reduces execution time by 32.9% to 70.4% for parallelizable troubleshooting guides through its DAG-guided parallel execution framework.
Is StepFly open source and free to use?
Yes, StepFly is available as open source software on GitHub with complete code and sample data at no cost.
What makes StepFly different from other AI incident management tools?
StepFly provides specialized support for troubleshooting guide quality management, complex control flow interpretation, parallel execution, and dedicated query preparation plugins.
Who should use StepFly for incident management?
Site reliability engineers, IT operations teams, and organizations with large-scale IT infrastructure benefit most from StepFly’s automated troubleshooting capabilities.
What are the main components of StepFly’s architecture?
StepFly includes TSG Mentor for guide quality improvement, offline preprocessing for DAG extraction, Query Preparation Plugins, and a scheduler-executor framework with memory system.
How many real-world troubleshooting guides did Microsoft analyze for StepFly?
Microsoft conducted an empirical study on 92 real-world troubleshooting guides to inform StepFly’s design and identify key automation challenges.
Can StepFly handle complex troubleshooting scenarios with dependencies?
StepFly converts troubleshooting guides into directed acyclic graphs to manage step dependencies and enable parallel execution of independent operations.
What language models does StepFly support?
StepFly demonstrates 94% success rate on GPT-4.1, though specific support for other language models is not yet disclosed in available documentation.
Glossary
- Agentic AI
- AI systems that can act independently with complex goal structures, natural language interfaces, and autonomous decision-making capabilities
- DAG (Directed Acyclic Graph)
- A graph structure with directed edges and no cycles, used by StepFly to represent troubleshooting step dependencies and execution order
- Query Preparation Plugins (QPPs)
- Specialized components in StepFly that handle data-intensive queries during troubleshooting guide execution
- Site Reliability Engineer (SRE)
- IT professionals responsible for maintaining system reliability, availability, and performance in large-scale environments
- TSG Mentor
- StepFly’s tool that assists site reliability engineers in improving troubleshooting guide quality and structure
- Troubleshooting Guide (TSG)
- Structured documentation that provides step-by-step procedures for diagnosing and resolving IT system incidents
Sources
- Neubird AI SRE – Autonomous Incident Resolution. https://neubird.ai/
- Best practices for building agentic systems. InfoWorld. https://www.infoworld.com/article/4154570/best-practices-for-building-agentic-systems.html
- What is Agentic AI? AWS. https://aws.amazon.com/what-is/agentic-ai/
- AWS Announces General Availability of DevOps Agent for Automated Incident Investigation. InfoQ. https://www.infoq.com/news/2026/04/aws-devops-agent-ga/
- An MSP’s guide to agentic AI. SuperOps. https://superops.com/blog/an-msps-guide-to-agentic-ai
- AI agent. Wikipedia. https://en.wikipedia.org/wiki/AI_agent
- Agentic Engineering: How Swarms of AI Agents Are Redefining Software Engineering. LangChain. https://www.langchain.com/blog/agentic-engineering-redefining-software-engineering
- Incident response for AI: Same fire, different fuel. Microsoft Security Blog. https://www.microsoft.com/en-us/security/blog/2026/04/15/incident-response-for-ai-same-fire-different-fuel/