StepFly is an AI agent framework that automates IT troubleshooting guides for incident management, achieving 94% success rates while reducing execution time by 32.9-70.4% through parallel processing and structured workflow automation.
| Released by | Microsoft Research |
|---|---|
| Release date | |
| What it is | AI agent framework for automating IT troubleshooting guides |
| Who it is for | Site reliability engineers and IT operations teams |
| Where to get it | GitHub open source repository |
| Price | Free |
- StepFly automates manual troubleshooting guides that are typically slow and error-prone for IT incidents
- The framework uses a three-stage workflow with guide quality improvement, offline preprocessing, and online execution
- It achieves 94% success rate on GPT-4.1 while consuming fewer tokens than baseline approaches
- Parallel execution capabilities reduce troubleshooting time by 32.9% to 70.4% for compatible guides
- The system is open-sourced on GitHub with sample data for implementation
- StepFly addresses four key challenges in automated incident management: TSG quality issues, complex control flow interpretation, data-intensive queries, and execution parallelism
- The framework was developed based on empirical analysis of 92 real-world troubleshooting guides
- It features TSG Mentor tool to help site reliability engineers improve guide quality before automation
- The system extracts structured execution DAGs from unstructured troubleshooting guides using LLMs
- Query Preparation Plugins handle data-intensive operations while maintaining workflow integrity
What is StepFly
StepFly is an end-to-end agentic framework that automates troubleshooting guide execution for IT incident management. The system addresses the manual, slow, and error-prone nature of traditional troubleshooting guide execution in large-scale IT systems.
Traditional troubleshooting guides require manual execution by site reliability engineers, leading to delays and human errors during critical incidents. Agentic AI enhances incident response speed while also providing a more specific and in-depth post-incident analysis to prevent the same errors from recurring in the future [3].
The framework leverages large language models to interpret unstructured troubleshooting documentation and convert it into automated workflows. AI agents possess several key attributes, including complex goal structures, natural language interfaces, the capacity to act independently of user supervision, and the integration of software tools or planning systems [6].
What is new vs previous approaches
StepFly introduces specialized capabilities that existing LLM-based incident management solutions lack.
| Feature | Previous LLM Solutions | StepFly |
|---|---|---|
| TSG Quality Management | No specialized support | TSG Mentor tool for quality improvement |
| Control Flow Interpretation | Limited complex workflow handling | Structured DAG extraction from unstructured guides |
| Data-Intensive Queries | Basic query processing | Dedicated Query Preparation Plugins (QPPs) |
| Execution Parallelism | Sequential execution only | DAG-guided scheduler with parallel step execution |
| Memory System | Limited workflow state tracking | Comprehensive memory system for workflow integrity |
How does StepFly work
StepFly operates through a three-stage workflow that transforms manual troubleshooting guides into automated execution systems.
- Guide Quality Improvement Stage: TSG Mentor tool assists site reliability engineers in identifying and fixing quality issues in existing troubleshooting guides before automation begins.
- Offline Preprocessing Stage: Large language models extract structured execution directed acyclic graphs (DAGs) from unstructured troubleshooting guides and create dedicated Query Preparation Plugins for data-intensive operations.
- Online Execution Stage: DAG-guided scheduler-executor framework with memory system ensures correct workflow execution and supports parallel processing of independent troubleshooting steps.
The system maintains workflow integrity through its memory system while enabling parallel execution of independent troubleshooting steps. Agentic engineering operates at a higher level of abstraction: it’s a control plane that orchestrates cross-team workflows, maintains long-term memory across agents, and manages state and traceability across the full software delivery lifecycle [7].
Benchmarks and evidence
StepFly demonstrates superior performance across multiple metrics compared to baseline approaches.
| Metric | StepFly Performance | Source |
|---|---|---|
| Success Rate on GPT-4.1 | ~94% | Microsoft Research evaluation [Source] |
| Execution Time Reduction | 32.9% to 70.4% for parallelizable TSGs | Microsoft Research evaluation [Source] |
| Token Consumption | Lower than baseline approaches | Microsoft Research evaluation [Source] |
| Real-world TSG Analysis | 92 troubleshooting guides studied | Microsoft Research empirical study [Source] |
Who should care
Builders
Software engineers and DevOps professionals can integrate StepFly’s open-source framework into existing incident management workflows. According to Anthropic, a provider of large language models (LLMs), AI agents are most commonly deployed in software engineering, accounting for roughly half of use cases, followed by back-office automation, marketing, sales, finance, and data analysis [2].
Enterprise
Large-scale IT organizations can reduce incident response times and human errors through automated troubleshooting guide execution. AI agents can detect issues like CPU spikes or failed processes, analyze root causes using real-time and historical data, and apply fixes automatically [5].
End users
Site reliability engineers and IT operations teams benefit from reduced manual workload and faster incident resolution times during critical system outages.
Investors
The framework represents Microsoft’s investment in agentic AI for enterprise operations, potentially reducing operational costs and improving system reliability across cloud infrastructure.
How to use StepFly today
StepFly is available as an open-source framework with implementation guidance and sample data.
- Access the repository: Visit https://github.com/microsoft/StepFly to download the framework code and documentation.
- Review sample data: Examine provided troubleshooting guide examples to understand the expected input format and structure.
- Install dependencies: Set up required Python packages and large language model access according to the repository documentation.
- Prepare troubleshooting guides: Use TSG Mentor tool to improve existing guide quality before automation implementation.
- Configure execution environment: Set up the DAG-guided scheduler-executor framework with appropriate memory system configuration.
- Test with sample incidents: Run the framework against provided sample incidents to validate installation and configuration.
StepFly vs competitors
StepFly competes with other AI-powered incident management and troubleshooting automation solutions.
| Feature | StepFly | AWS DevOps Agent | Neubird AI SRE |
|---|---|---|---|
| Open Source | Yes | No | No |
| Parallel Execution | Yes, DAG-guided | Not yet disclosed | Not yet disclosed |
| Success Rate | ~94% on GPT-4.1 | Not yet disclosed | Not yet disclosed |
| Guide Quality Tools | TSG Mentor included | Not yet disclosed | Not yet disclosed |
| Execution Time Reduction | 32.9-70.4% | Not yet disclosed | Not yet disclosed |
Risks, limits, and myths
- Quality dependency: StepFly’s effectiveness depends on the quality of input troubleshooting guides, requiring initial manual review and improvement.
- LLM limitations: The framework inherits potential biases and errors from underlying large language models used for guide interpretation.
- Complex incident handling: Highly complex or novel incidents may require human intervention beyond automated troubleshooting capabilities.
- Integration complexity: Organizations need existing monitoring and telemetry systems to provide data for automated troubleshooting execution.
- Myth – Complete automation: StepFly augments rather than replaces human site reliability engineers, requiring oversight for critical incidents.
- Myth – Universal applicability: Not all troubleshooting guides are suitable for automation, particularly those requiring subjective judgment or manual hardware intervention.
FAQ
- What is StepFly and how does it work for IT incidents?
- StepFly is an AI agent framework that automates troubleshooting guide execution for IT incidents using a three-stage workflow with guide improvement, preprocessing, and parallel execution capabilities.
- How much faster is StepFly compared to manual troubleshooting?
- StepFly reduces execution time by 32.9% to 70.4% for parallelizable troubleshooting guides while achieving a 94% success rate on GPT-4.1.
- Is StepFly open source and free to use?
- Yes, StepFly is available as an open-source framework on GitHub with sample data and implementation documentation at no cost.
- What makes StepFly different from other AI incident management tools?
- StepFly provides specialized support for troubleshooting guide quality management, complex control flow interpretation, data-intensive queries, and parallel execution that other solutions lack.
- Who developed StepFly and when was it released?
- Microsoft Research developed StepFly, releasing it as an open-source framework on .
- What are the main components of StepFly’s architecture?
- StepFly includes TSG Mentor for guide quality improvement, DAG extraction for workflow structuring, Query Preparation Plugins for data operations, and a scheduler-executor with memory system.
- Can StepFly handle all types of IT troubleshooting scenarios?
- StepFly works best with structured troubleshooting guides but may require human intervention for highly complex incidents or those requiring subjective judgment.
- How does StepFly ensure troubleshooting workflow accuracy?
- StepFly uses a comprehensive memory system and DAG-guided execution to maintain workflow integrity while supporting parallel processing of independent troubleshooting steps.
- What prerequisites are needed to implement StepFly?
- Organizations need existing troubleshooting guides, monitoring systems for data input, Python environment setup, and access to large language models for framework operation.
- How was StepFly’s performance validated?
- Microsoft Research conducted empirical evaluation using 92 real-world troubleshooting guides and incidents, demonstrating superior performance compared to baseline approaches.
Glossary
- Agentic AI
- Artificial intelligence systems that can act autonomously to achieve goals without constant human supervision, using natural language interfaces and integrated tools.
- DAG (Directed Acyclic Graph)
- A structured representation of workflow steps and dependencies that prevents circular execution loops while enabling parallel processing of independent tasks.
- Query Preparation Plugins (QPPs)
- Specialized components in StepFly that handle data-intensive operations and queries during troubleshooting guide execution.
- Site Reliability Engineer (SRE)
- IT professionals responsible for maintaining system reliability, availability, and performance through monitoring, incident response, and automation practices.
- TSG (Troubleshooting Guide)
- Documented procedures that provide step-by-step instructions for diagnosing and resolving specific IT system issues or incidents.
- TSG Mentor
- A tool within StepFly that assists site reliability engineers in identifying and improving quality issues in troubleshooting guides before automation.
Sources
- Neubird AI. “AI SRE – Autonomous Incident Resolution.” https://neubird.ai/
- InfoWorld. “Best practices for building agentic systems.” https://www.infoworld.com/article/4154570/best-practices-for-building-agentic-systems.html
- AWS. “What is Agentic AI? – Agentic AI Explained.” https://aws.amazon.com/what-is/agentic-ai/
- InfoQ. “AWS Announces General Availability of DevOps Agent for Automated Incident Investigation.” https://www.infoq.com/news/2026/04/aws-devops-agent-ga/
- SuperOps. “An MSP’s guide to agentic AI.” https://superops.com/blog/an-msps-guide-to-agentic-ai
- Wikipedia. “AI agent.” https://en.wikipedia.org/wiki/AI_agent
- LangChain. “Agentic Engineering: How Swarms of AI Agents Are Redefining Software Engineering.” https://www.langchain.com/blog/agentic-engineering-redefining-software-engineering
- Microsoft Security Blog. “Incident response for AI: Same fire, different fuel.” https://www.microsoft.com/en-us/security/blog/2026/04/15/incident-response-for-ai-same-fire-different-fuel/