StepFly is an AI-powered framework that automates troubleshooting guide execution for IT incident management, achieving 94% success rate with GPT-4.1 while reducing execution time by up to 70.4% through parallel processing and structured workflow automation.
| Released by | Microsoft Research |
|---|---|
| Release date | |
| What it is | AI-powered framework for automating IT troubleshooting guides |
| Who it is for | Site reliability engineers and IT operations teams |
| Where to get it | GitHub repository at microsoft/StepFly |
| Price | Open source |
- StepFly automates manual troubleshooting guide execution using a three-stage agentic AI workflow
- The framework achieves 94% success rate on GPT-4.1, outperforming existing baseline approaches
- Parallel execution capabilities reduce troubleshooting time by 32.9% to 70.4% for compatible guides
- TSG Mentor tool helps site reliability engineers improve troubleshooting guide quality
- System processes unstructured guides into structured execution graphs for automated workflow management
- Manual troubleshooting guide execution is slow and error-prone in large-scale IT systems
- StepFly addresses key challenges including TSG quality issues, complex control flow, and data-intensive queries
- The framework uses directed acyclic graphs to structure and parallelize troubleshooting workflows
- Empirical evaluation on 92 real-world troubleshooting guides demonstrates significant performance improvements
- Open source availability enables widespread adoption across IT operations teams
What is StepFly
StepFly is an end-to-end agentic framework that automates troubleshooting guide execution for IT incident management. The system transforms manual, error-prone troubleshooting processes into automated workflows using large language models and structured execution graphs. AI agents can detect issues like CPU spikes or failed processes, analyze root causes using real-time and historical data, and apply fixes automatically [5].
The framework addresses critical challenges in incident management where manual execution of troubleshooting guides creates bottlenecks and introduces human error. StepFly leverages agentic AI principles to provide autonomous decision-making capabilities while maintaining oversight and control for site reliability engineers.
What is new vs previous approaches
StepFly introduces specialized support for troubleshooting guide automation that existing LLM-based solutions lack.
| Feature | Previous LLM Solutions | StepFly |
|---|---|---|
| TSG Quality Management | No specialized support | TSG Mentor tool for guide improvement |
| Control Flow Interpretation | Limited complex workflow handling | Structured DAG extraction and execution |
| Data-Intensive Queries | Basic query processing | Dedicated Query Preparation Plugins |
| Parallel Execution | Sequential processing only | DAG-guided parallel step execution |
| Memory System | Limited workflow state tracking | Comprehensive memory for workflow continuity |
How does StepFly work
StepFly operates through a three-stage workflow that transforms unstructured troubleshooting guides into automated execution systems.
- Guide Quality Enhancement: TSG Mentor assists site reliability engineers in improving troubleshooting guide quality and structure before automation
- Offline Preprocessing: LLMs extract structured execution directed acyclic graphs from unstructured guides and create Query Preparation Plugins for data operations
- Online Execution: DAG-guided scheduler-executor framework with memory system ensures correct workflow execution and supports parallel processing of independent steps
Agentic AI enhances incident response speed while also providing a more specific and in-depth post-incident analysis to prevent the same errors from recurring in the future [3]. The system maintains state and traceability across the full troubleshooting lifecycle.
Benchmarks and evidence
StepFly demonstrates significant performance improvements across multiple metrics in empirical evaluations.
| Metric | StepFly Performance | Source |
|---|---|---|
| Success Rate on GPT-4.1 | 94% | Microsoft Research evaluation |
| Execution Time Reduction | 32.9% to 70.4% for parallelizable TSGs | Microsoft Research evaluation |
| Real-world TSGs Analyzed | 92 troubleshooting guides | Empirical study dataset |
| Token Consumption | Lower than baseline approaches | Microsoft Research comparison |
Who should care
Builders
Software engineers and DevOps professionals can integrate StepFly into existing incident management workflows. According to Anthropic, a provider of large language models, AI agents are most commonly deployed in software engineering, accounting for roughly half of use cases [2]. The open source framework enables customization for specific infrastructure requirements.
Enterprise
Large-scale IT operations teams benefit from reduced mean time to resolution and improved incident response consistency. Agentic engineering operates at a higher level of abstraction: it’s a control plane that orchestrates cross-team workflows, maintains long-term memory across agents, and manages state and traceability across the full software delivery lifecycle [7].
End users
System users experience fewer service disruptions and faster resolution times when incidents occur. Automated troubleshooting reduces the human error factor that contributes to extended outages.
Investors
The framework represents Microsoft’s investment in agentic AI for enterprise operations, demonstrating practical applications of LLM technology in critical infrastructure management.
How to use StepFly today
StepFly is available as an open source project with implementation guidance for immediate deployment.
- Clone the repository from GitHub at microsoft/StepFly
- Install required dependencies including LLM access credentials
- Prepare existing troubleshooting guides using the TSG Mentor tool
- Configure Query Preparation Plugins for your data sources
- Deploy the DAG-guided scheduler-executor framework in your environment
- Test automation with non-critical troubleshooting scenarios
- Gradually expand to production incident management workflows
StepFly vs competitors
StepFly competes with other AI-powered incident management solutions in the enterprise market.
| Feature | StepFly | AWS DevOps Agent | Neubird AI SRE |
|---|---|---|---|
| Open Source | Yes | No | No |
| Parallel Execution | Yes, DAG-guided | Not yet disclosed | Not yet disclosed |
| TSG Quality Tools | TSG Mentor included | Not yet disclosed | Not yet disclosed |
| Success Rate | 94% on GPT-4.1 | Not yet disclosed | Not yet disclosed |
| Platform Integration | Multi-platform | AWS-focused | Multi-platform |
Risks, limits, and myths
- LLM dependency creates potential points of failure if model services become unavailable
- Complex troubleshooting scenarios may require human oversight despite automation capabilities
- Initial setup requires significant investment in guide preparation and system configuration
- Parallel execution benefits only apply to troubleshooting guides with independent steps
- Success rates may vary significantly based on troubleshooting guide quality and complexity
- Myth: Complete replacement of human SREs – StepFly augments rather than replaces human expertise
- Myth: Universal compatibility – System requires structured input and may not work with all existing guides
FAQ
What is StepFly and how does it work?
StepFly is an AI-powered framework that automates IT troubleshooting guide execution through a three-stage workflow involving guide quality enhancement, offline preprocessing, and online execution with parallel processing capabilities.
How accurate is StepFly for troubleshooting automation?
StepFly achieves a 94% success rate when using GPT-4.1, based on empirical evaluation with 92 real-world troubleshooting guides.
Can StepFly reduce incident resolution time?
Yes, StepFly reduces execution time by 32.9% to 70.4% for troubleshooting guides that support parallel processing of independent steps.
Is StepFly available for free?
StepFly is open source and available at no cost through the GitHub repository at microsoft/StepFly.
What makes StepFly different from other AI incident management tools?
StepFly provides specialized support for troubleshooting guide quality management, complex control flow interpretation, data-intensive queries, and parallel execution that existing LLM-based solutions lack.
Do I need technical expertise to implement StepFly?
Implementation requires software engineering knowledge for system integration, LLM configuration, and troubleshooting guide preparation using the included TSG Mentor tool.
Can StepFly work with existing troubleshooting documentation?
StepFly transforms unstructured troubleshooting guides into structured execution graphs, but guides may require quality improvements using the TSG Mentor tool before automation.
What are the main limitations of StepFly?
StepFly depends on LLM availability, requires structured input preparation, and may need human oversight for complex scenarios despite its 94% success rate.
How does StepFly handle parallel troubleshooting steps?
StepFly uses directed acyclic graphs to identify and execute independent troubleshooting steps in parallel, reducing overall execution time significantly.
What infrastructure is needed to run StepFly?
StepFly requires access to large language models, integration with existing monitoring systems, and deployment of the DAG-guided scheduler-executor framework.
Glossary
- Agentic AI
- AI systems with complex goal structures, natural language interfaces, and capacity to act independently with integrated software tools
- DAG (Directed Acyclic Graph)
- Structured representation of workflow steps that enables parallel execution of independent tasks
- TSG (Troubleshooting Guide)
- Documentation that provides step-by-step instructions for diagnosing and resolving IT system issues
- SRE (Site Reliability Engineer)
- IT professional responsible for maintaining system reliability, availability, and performance
- Query Preparation Plugins
- Specialized components that handle data-intensive operations within troubleshooting workflows
- TSG Mentor
- Tool within StepFly that assists engineers in improving troubleshooting guide quality before automation
Sources
- AI SRE – Autonomous Incident Resolution – Neubird. https://neubird.ai/
- Best practices for building agentic systems | InfoWorld. https://www.infoworld.com/article/4154570/best-practices-for-building-agentic-systems.html
- What is Agentic AI? – Agentic AI Explained – AWS. https://aws.amazon.com/what-is/agentic-ai/
- AWS Announces General Availability of DevOps Agent for Automated Incident Investigation – InfoQ. https://www.infoq.com/news/2026/04/aws-devops-agent-ga/
- An MSP’s guide to agentic AI. https://superops.com/blog/an-msps-guide-to-agentic-ai
- AI agent – Wikipedia. https://en.wikipedia.org/wiki/AI_agent
- Agentic Engineering: How Swarms of AI Agents Are Redefining Software Engineering. https://www.langchain.com/blog/agentic-engineering-redefining-software-engineering
- Incident response for AI: Same fire, different fuel | Microsoft Security Blog. https://www.microsoft.com/en-us/security/blog/2026/04/15/incident-response-for-ai-same-fire-different-fuel/