StepFly: AI Agent Automates IT Troubleshooting Guides

StepFly is Microsoft’s AI agent framework that automates troubleshooting guide execution for IT incident management. The system achieves 94% success rates while reducing execution time by 32.9-70.4% through parallel processing and structured workflow automation.

Released by	Microsoft Research
Release date	April 22, 2026
What it is	AI agent framework for automating IT troubleshooting guides
Who it is for	Site reliability engineers and IT operations teams
Where to get it	GitHub open source
Price	Free

StepFly automates troubleshooting guide execution with 94% success rate on GPT-4.1
Three-stage workflow includes guide quality improvement, offline preprocessing, and online execution
Parallel execution reduces troubleshooting time by 32.9% to 70.4% for compatible guides
Framework converts unstructured guides into directed acyclic graphs for systematic execution
Open source code and sample data available on GitHub

What is StepFly
What is new vs previous approaches
How does StepFly work
Benchmarks and evidence
Who should care
How to use StepFly today
StepFly vs competitors
Risks, limits, and myths

Manual troubleshooting guide execution is slow and error-prone in large-scale IT systems
StepFly addresses key challenges including guide quality, complex control flow, and data-intensive queries
The framework enables parallel execution of independent troubleshooting steps
Microsoft’s empirical study analyzed 92 real-world troubleshooting guides to inform design
DAG-guided scheduler-executor framework ensures correct workflow execution with memory system

What is StepFly

StepFly is an end-to-end agentic framework that automates troubleshooting guide execution for IT incident management. The system converts unstructured troubleshooting guides into structured workflows that AI agents can execute automatically. AI agents possess complex goal structures, natural language interfaces, and capacity to act independently of user supervision [6]. StepFly specifically targets the challenges of managing troubleshooting guide quality, interpreting complex control flow, and handling data-intensive queries in large-scale IT environments.

What is new vs previous approaches

StepFly introduces specialized capabilities that existing LLM-based incident management solutions lack.

Feature	Previous LLM Solutions	StepFly
Guide Quality Management	Limited support	TSG Mentor tool for quality improvement
Control Flow Interpretation	Basic sequential processing	Structured DAG extraction and execution
Parallel Execution	Not supported	Independent step parallelization
Data Query Handling	Generic approach	Dedicated Query Preparation Plugins
Memory System	Limited context retention	Comprehensive workflow state management

How does StepFly work

StepFly operates through a three-stage workflow that transforms manual troubleshooting into automated execution.

Guide Quality Enhancement: TSG Mentor assists site reliability engineers in improving troubleshooting guide quality and structure
Offline Preprocessing: LLMs extract structured execution directed acyclic graphs from unstructured guides and create Query Preparation Plugins
Online Execution: DAG-guided scheduler-executor framework with memory system ensures correct workflow and enables parallel execution of independent steps

The framework converts troubleshooting guides into directed acyclic graphs that represent step dependencies and execution order. Agentic engineering operates at a higher level of abstraction as a control plane that orchestrates cross-team workflows and maintains long-term memory across agents [7].

Benchmarks and evidence

Microsoft’s evaluation demonstrates StepFly’s effectiveness across multiple performance metrics.

Metric	Result	Source
Success Rate on GPT-4.1	94%	Microsoft Research evaluation
Execution Time Reduction	32.9% to 70.4%	Parallelizable TSG performance
Real-world TSGs Analyzed	92 guides	Empirical study foundation
Token Consumption	Lower than baselines	Comparative evaluation

Who should care

Builders

Software engineers building incident management systems can leverage StepFly’s open source framework for automated troubleshooting. AI agents are most commonly deployed in software engineering, accounting for roughly half of use cases [2]. The DAG-based execution model provides a foundation for building reliable automation systems.

Enterprise

Large organizations with complex IT infrastructure benefit from StepFly’s ability to standardize and accelerate incident response. Agentic AI enhances incident response speed while providing more specific and in-depth post-incident analysis [3]. The framework reduces dependency on manual troubleshooting expertise.

End users

Site reliability engineers and IT operations teams gain tools for improving troubleshooting guide quality and execution efficiency. The TSG Mentor component specifically addresses guide quality issues that impact automation success rates.

Investors

The incident management automation market represents significant opportunity as organizations seek to reduce operational costs and improve system reliability. StepFly’s open source availability accelerates adoption and ecosystem development.

How to use StepFly today

StepFly is available as open source software with complete implementation and sample data.

Clone the repository: git clone https://github.com/microsoft/StepFly
Install dependencies according to the project requirements
Prepare troubleshooting guides using the TSG Mentor tool for quality improvement
Run offline preprocessing to convert guides into structured DAGs
Configure the scheduler-executor framework for your environment
Execute troubleshooting workflows through the agentic system

StepFly vs competitors

StepFly competes with other AI-powered incident management solutions in the market.

Feature	StepFly	AWS DevOps Agent	Neubird AI SRE
Open Source	Yes	No	No
Parallel Execution	Yes	Not disclosed	Not disclosed
Success Rate	94% on GPT-4.1	Not disclosed	Not disclosed
Guide Quality Tools	TSG Mentor	Not disclosed	Not disclosed
Structured Workflows	DAG-based	Not disclosed	Telemetry analysis focus

Risks, limits, and myths

Quality Dependency: Success rates depend heavily on troubleshooting guide quality and structure
LLM Limitations: Performance varies across different language models and may require model-specific tuning
Complex Dependencies: Some troubleshooting scenarios may have dependencies too complex for DAG representation
Data Requirements: Query Preparation Plugins require access to relevant data sources and APIs
Myth: Complete Automation: StepFly assists rather than replaces human expertise in complex incident scenarios
Myth: Universal Application: Framework works best with well-structured guides rather than ad-hoc troubleshooting

FAQ

What is StepFly and how does it work?

StepFly is Microsoft’s AI agent framework that automates troubleshooting guide execution through a three-stage workflow including guide quality improvement, offline preprocessing into DAGs, and online execution with parallel processing capabilities.

What success rate does StepFly achieve?

StepFly achieves approximately 94% success rate on GPT-4.1 while outperforming baseline approaches with reduced time and token consumption.

How much faster is StepFly compared to manual troubleshooting?

StepFly reduces execution time by 32.9% to 70.4% for parallelizable troubleshooting guides through its DAG-guided parallel execution framework.

Is StepFly open source and free to use?

Yes, StepFly is available as open source software on GitHub with complete code and sample data at no cost.

What makes StepFly different from other AI incident management tools?

StepFly provides specialized support for troubleshooting guide quality management, complex control flow interpretation, parallel execution, and dedicated query preparation plugins.

Who should use StepFly for incident management?

Site reliability engineers, IT operations teams, and organizations with large-scale IT infrastructure benefit most from StepFly’s automated troubleshooting capabilities.

What are the main components of StepFly’s architecture?

StepFly includes TSG Mentor for guide quality improvement, offline preprocessing for DAG extraction, Query Preparation Plugins, and a scheduler-executor framework with memory system.

How many real-world troubleshooting guides did Microsoft analyze for StepFly?

Microsoft conducted an empirical study on 92 real-world troubleshooting guides to inform StepFly’s design and identify key automation challenges.

Can StepFly handle complex troubleshooting scenarios with dependencies?

StepFly converts troubleshooting guides into directed acyclic graphs to manage step dependencies and enable parallel execution of independent operations.

What language models does StepFly support?

StepFly demonstrates 94% success rate on GPT-4.1, though specific support for other language models is not yet disclosed in available documentation.

Glossary

Agentic AI: AI systems that can act independently with complex goal structures, natural language interfaces, and autonomous decision-making capabilities
DAG (Directed Acyclic Graph): A graph structure with directed edges and no cycles, used by StepFly to represent troubleshooting step dependencies and execution order
Query Preparation Plugins (QPPs): Specialized components in StepFly that handle data-intensive queries during troubleshooting guide execution
Site Reliability Engineer (SRE): IT professionals responsible for maintaining system reliability, availability, and performance in large-scale environments
TSG Mentor: StepFly’s tool that assists site reliability engineers in improving troubleshooting guide quality and structure
Troubleshooting Guide (TSG): Structured documentation that provides step-by-step procedures for diagnosing and resolving IT system incidents

Visit the StepFly GitHub repository at https://github.com/microsoft/StepFly to access the open source code and begin implementing automated troubleshooting in your environment.

Sources

Neubird AI SRE – Autonomous Incident Resolution. https://neubird.ai/
Best practices for building agentic systems. InfoWorld. https://www.infoworld.com/article/4154570/best-practices-for-building-agentic-systems.html
What is Agentic AI? AWS. https://aws.amazon.com/what-is/agentic-ai/
AWS Announces General Availability of DevOps Agent for Automated Incident Investigation. InfoQ. https://www.infoq.com/news/2026/04/aws-devops-agent-ga/
An MSP’s guide to agentic AI. SuperOps. https://superops.com/blog/an-msps-guide-to-agentic-ai
AI agent. Wikipedia. https://en.wikipedia.org/wiki/AI_agent
Agentic Engineering: How Swarms of AI Agents Are Redefining Software Engineering. LangChain. https://www.langchain.com/blog/agentic-engineering-redefining-software-engineering
Incident response for AI: Same fire, different fuel. Microsoft Security Blog. https://www.microsoft.com/en-us/security/blog/2026/04/15/incident-response-for-ai-same-fire-different-fuel/

Author

Siegfried Kamgo

Founder and editorial lead at FrontierWisdom. Engineer turned operator-analyst writing about AI systems, automation infrastructure, decentralised stacks, and the practical economics of frontier technology. Focus: turning fast-moving releases into durable, implementation-ready playbooks.