StepFly is Microsoft’s agentic AI framework that automates IT troubleshooting guide execution, achieving 94% success rates while reducing execution time by 32.9-70.4% through parallel processing and structured workflow management for site reliability engineers.
| Released by | Microsoft Research |
|---|---|
| Release date | |
| What it is | AI agent framework for automating IT troubleshooting guides |
| Who it is for | Site reliability engineers and IT operations teams |
| Where to get it | GitHub repository at microsoft/StepFly |
| Price | Open source |
- StepFly automates troubleshooting guide execution with 94% success rate on GPT-4.1
- Three-stage workflow includes quality improvement, preprocessing, and parallel execution
- Reduces execution time by 32.9-70.4% for parallelizable troubleshooting guides
- Converts unstructured guides into directed acyclic graphs for systematic execution
- Open source framework available on GitHub with sample data and documentation
- Manual troubleshooting guide execution is slow and error-prone in large-scale IT systems
- StepFly addresses TSG quality issues, complex control flow, and data-intensive queries
- The framework enables parallel execution of independent troubleshooting steps
- Empirical study analyzed 92 real-world troubleshooting guides to inform design
- DAG-guided scheduler-executor framework ensures correct workflow execution
What is StepFly
StepFly is an end-to-end agentic framework that automates troubleshooting guide execution for IT incident management. The system converts unstructured troubleshooting guides into structured execution workflows using large language models. Agentic AI enhances incident response speed while also providing a more specific and in-depth post-incident analysis to prevent the same errors from recurring in the future [3]. StepFly specifically targets the challenges of manual troubleshooting guide execution, which is traditionally slow and error-prone in large-scale IT environments.
What is new vs previous approaches
StepFly introduces specialized capabilities that existing LLM-based incident management solutions lack.
| Feature | Previous LLM Solutions | StepFly |
|---|---|---|
| TSG Quality Management | Limited support | TSG Mentor tool for quality improvement |
| Control Flow Interpretation | Basic sequential processing | Directed acyclic graph extraction and execution |
| Data-Intensive Queries | Generic handling | Dedicated Query Preparation Plugins |
| Parallel Execution | Not supported | Scheduler-executor framework with memory system |
| Workflow Structure | Ad-hoc processing | Three-stage systematic workflow |
How does StepFly work
StepFly operates through a three-stage workflow that transforms unstructured troubleshooting guides into automated execution systems.
- Quality Improvement Stage: TSG Mentor tool assists site reliability engineers in improving troubleshooting guide quality and completeness
- Offline Preprocessing Stage: LLMs extract structured directed acyclic graphs from unstructured guides and create Query Preparation Plugins for data handling
- Online Execution Stage: DAG-guided scheduler-executor framework with memory system executes workflows and supports parallel processing of independent steps
The system maintains workflow correctness through its memory system while enabling parallel execution of independent troubleshooting steps. AI agents can detect issues like CPU spikes or failed processes, analyze root causes using real-time and historical data, and apply fixes automatically [6].
Benchmarks and evidence
StepFly demonstrates superior performance across multiple metrics in real-world troubleshooting scenarios.
| Metric | StepFly Performance | Source |
|---|---|---|
| Success Rate | 94% on GPT-4.1 | Microsoft Research evaluation |
| Execution Time Reduction | 32.9% to 70.4% for parallelizable TSGs | Microsoft Research evaluation |
| Token Consumption | Lower than baseline approaches | Microsoft Research evaluation |
| Real-world TSGs Analyzed | 92 troubleshooting guides | Empirical study foundation |
Who should care
Builders
Software engineers and DevOps professionals can integrate StepFly’s open-source framework into existing incident management workflows. The system provides APIs and tools for customizing troubleshooting guide automation. According to Anthropic, a provider of large language models (LLMs), AI agents are most commonly deployed in software engineering, accounting for roughly half of use cases, followed by back-office automation, marketing, sales, finance, and data analysis [2].
Enterprise
Large organizations with complex IT infrastructures can reduce incident resolution time and human error through automated troubleshooting. The framework addresses scalability challenges in manual incident management processes. Automation is essential here; manual review cannot keep pace [7] with the volume of incidents in enterprise environments.
End users
Site reliability engineers and IT operations teams benefit from reduced manual workload and faster incident resolution. The system maintains human oversight while automating repetitive troubleshooting tasks.
Investors
The incident management automation market represents significant opportunity as organizations seek to reduce operational costs and improve system reliability through AI-powered solutions.
How to use StepFly today
StepFly is available as an open-source framework with complete implementation and sample data.
- Clone the repository:
git clone https://github.com/microsoft/StepFly - Install dependencies according to the provided requirements file
- Prepare troubleshooting guides using the TSG Mentor tool for quality improvement
- Run offline preprocessing to convert guides into directed acyclic graphs
- Configure the scheduler-executor framework for your IT environment
- Deploy the system for automated troubleshooting guide execution
StepFly vs competitors
StepFly competes with other AI-powered incident management and troubleshooting automation solutions.
| Feature | StepFly | Traditional LLM Solutions | Manual TSG Execution |
|---|---|---|---|
| Success Rate | 94% on GPT-4.1 | Not yet disclosed | Variable, error-prone |
| Parallel Execution | Supported with 32.9-70.4% time reduction | Not supported | Not supported |
| Quality Management | TSG Mentor tool included | Limited | Manual review |
| Structured Workflow | DAG-based execution | Sequential processing | Ad-hoc execution |
| Open Source | Yes | Varies | N/A |
Risks, limits, and myths
- System performance depends on troubleshooting guide quality and completeness
- Complex IT environments may require extensive customization and configuration
- LLM accuracy limitations can affect automated decision-making in critical incidents
- Parallel execution benefits only apply to troubleshooting guides with independent steps
- Human oversight remains necessary for high-stakes incident resolution scenarios
- Integration complexity may require significant engineering resources for deployment
FAQ
What is StepFly and how does it automate troubleshooting?
StepFly is Microsoft’s agentic AI framework that converts unstructured IT troubleshooting guides into automated workflows, achieving 94% success rates through structured execution and parallel processing capabilities.
How much faster is StepFly compared to manual troubleshooting?
StepFly reduces execution time by 32.9% to 70.4% for parallelizable troubleshooting guides while consuming fewer tokens than baseline approaches according to Microsoft’s evaluation.
What makes StepFly different from other AI incident management tools?
StepFly provides specialized support for TSG quality management, complex control flow interpretation, data-intensive queries, and parallel execution that existing LLM-based solutions lack.
Is StepFly available for commercial use?
StepFly is open source and available on GitHub at microsoft/StepFly with complete implementation code and sample data for commercial and research use.
What are the three stages of StepFly’s workflow?
StepFly operates through quality improvement using TSG Mentor, offline preprocessing with DAG extraction, and online execution with parallel processing capabilities.
How does StepFly handle complex troubleshooting guide structures?
StepFly extracts directed acyclic graphs from unstructured guides and uses a DAG-guided scheduler-executor framework with memory system to ensure correct workflow execution.
What types of IT environments can benefit from StepFly?
Large-scale IT systems with complex troubleshooting procedures benefit most from StepFly’s automation, particularly environments with parallelizable troubleshooting steps and quality management needs.
Does StepFly require human oversight for incident management?
StepFly automates troubleshooting guide execution but maintains integration points for human oversight, particularly for high-stakes incidents and quality improvement processes through TSG Mentor.
What LLM models does StepFly support?
StepFly achieved 94% success rate on GPT-4.1 according to Microsoft’s evaluation, though specific support for other LLM models is not yet disclosed.
How does StepFly improve troubleshooting guide quality?
StepFly includes TSG Mentor, a dedicated tool that assists site reliability engineers in improving troubleshooting guide quality and completeness before automated execution.
Glossary
- Agentic AI
- AI systems that can act independently with complex goal structures, natural language interfaces, and integration of software tools or planning systems
- DAG (Directed Acyclic Graph)
- A structured representation of workflows where tasks have dependencies but no circular references, enabling parallel execution of independent steps
- TSG (Troubleshooting Guide)
- Structured documentation that provides step-by-step procedures for diagnosing and resolving IT incidents and system issues
- SRE (Site Reliability Engineer)
- IT professionals responsible for maintaining system reliability, performance, and incident response in large-scale technology environments
- Query Preparation Plugins (QPPs)
- Specialized components in StepFly that handle data-intensive queries during troubleshooting guide execution
- TSG Mentor
- StepFly’s tool for assisting site reliability engineers in improving troubleshooting guide quality and completeness
Sources
- Neubird AI SRE – Autonomous Incident Resolution
- InfoWorld – Best practices for building agentic systems
- AWS – What is Agentic AI?
- InfoQ – AWS Announces General Availability of DevOps Agent
- Wikipedia – AI agent
- SuperOps – An MSP’s guide to agentic AI
- Microsoft Security Blog – Incident response for AI
- Automation Anywhere – What is Agentic AI?