StepFly is Microsoft’s agentic AI framework that automates IT troubleshooting guides, achieving 94% success rate on GPT-4.1 while reducing execution time by 32.9-70.4% through parallel processing of independent troubleshooting steps.
| Released by | Microsoft Research |
|---|---|
| Release date | |
| What it is | Agentic AI framework for automating IT troubleshooting guides |
| Who it’s for | Site reliability engineers and IT operations teams |
| Where to get it | GitHub (microsoft/StepFly) |
| Price | Open source |
- StepFly automates troubleshooting guide execution with 94% success rate on GPT-4.1
- Three-stage workflow includes guide quality improvement, offline preprocessing, and online execution
- Reduces execution time by 32.9-70.4% through parallel processing of independent steps
- Tested on 92 real-world troubleshooting guides from enterprise IT environments
- Open source framework available on GitHub with sample data and documentation
- Manual troubleshooting guide execution is slow and error-prone in large-scale IT systems
- StepFly addresses key challenges including TSG quality issues and complex control flow interpretation
- The framework uses directed acyclic graphs (DAGs) to structure unstructured troubleshooting guides
- Parallel execution capabilities significantly reduce time-to-resolution for IT incidents
- Agentic AI systems are increasingly deployed in software engineering and IT operations
What is StepFly
StepFly is an end-to-end agentic framework that automates troubleshooting guide execution for IT incident management. The system addresses manual execution challenges that are slow and error-prone in large-scale IT environments.
The framework emerged from an empirical study of 92 real-world troubleshooting guides (TSGs). Agentic AI enhances incident response speed while also providing more specific and in-depth post-incident analysis [3]. StepFly specifically targets site reliability engineers (SREs) who manage complex IT infrastructure troubleshooting workflows.
Microsoft Research developed StepFly to handle specialized challenges including TSG quality management, complex control flow interpretation, data-intensive queries, and execution parallelism. The system integrates with existing IT monitoring and management tools through its plugin architecture.
What is new vs previous solutions
StepFly introduces specialized capabilities that existing LLM-based incident management solutions lack.
| Feature | Previous Solutions | StepFly |
|---|---|---|
| TSG Quality Management | Manual review required | TSG Mentor tool for automated quality assistance |
| Control Flow Handling | Linear execution only | DAG-guided scheduler with complex workflow support |
| Data Query Processing | Generic LLM queries | Dedicated Query Preparation Plugins (QPPs) |
| Execution Model | Sequential step processing | Parallel execution of independent steps |
| Memory System | Limited context retention | Comprehensive memory system for workflow state |
How does StepFly work
StepFly operates through a three-stage workflow that transforms unstructured troubleshooting guides into automated execution systems.
- Guide Quality Improvement Stage: TSG Mentor tool assists SREs in identifying and fixing quality issues in existing troubleshooting guides before automation.
- Offline Preprocessing Stage: LLMs extract structured directed acyclic graphs (DAGs) from unstructured TSGs and create dedicated Query Preparation Plugins for data-intensive operations.
- Online Execution Stage: DAG-guided scheduler-executor framework with memory system ensures correct workflow execution and supports parallel processing of independent troubleshooting steps.
- Parallel Processing: Independent troubleshooting steps execute simultaneously, reducing overall incident resolution time by 32.9-70.4% for parallelizable TSGs.
- Memory Management: Comprehensive memory system maintains workflow state and execution context across complex multi-step troubleshooting procedures.
Benchmarks and evidence
StepFly demonstrates superior performance across multiple metrics in real-world troubleshooting scenarios.
| Metric | StepFly Performance | Source |
|---|---|---|
| Success Rate | 94% on GPT-4.1 | Microsoft Research evaluation [Source] |
| Time Reduction | 32.9-70.4% for parallelizable TSGs | Empirical evaluation results [Source] |
| Token Consumption | Lower than baseline methods | Comparative analysis [Source] |
| Real-world TSGs Tested | 92 troubleshooting guides | Empirical study dataset [Source] |
| Execution Time | Faster than baseline approaches | Performance benchmarking [Source] |
Who should care
Builders
Software engineers building incident management systems can leverage StepFly’s open-source framework. AI agents are most commonly deployed in software engineering, accounting for roughly half of use cases [2]. The GitHub repository provides implementation guidance and sample data for integration projects.
Enterprise
Large-scale IT operations teams managing complex infrastructure can reduce incident resolution time significantly. AI agents can detect issues like CPU spikes or failed processes, analyze root causes using real-time and historical data, and apply fixes automatically [5]. Enterprise adoption requires integration with existing monitoring and ticketing systems.
End users
Business users experience faster service restoration when IT incidents occur. StepFly’s automation reduces mean time to resolution (MTTR) for common infrastructure issues. End users benefit indirectly through improved system reliability and reduced downtime duration.
Investors
The incident management automation market represents significant opportunity as enterprises seek to reduce operational costs. Microsoft’s open-source approach may accelerate adoption while building ecosystem partnerships. Investment opportunities exist in complementary tooling and managed service providers.
How to use StepFly today
StepFly is available as an open-source framework through Microsoft’s GitHub repository.
- Clone Repository: Access the codebase at
https://github.com/microsoft/StepFlywith sample data and documentation included. - Install Dependencies: Follow setup instructions for Python environment and required LLM API access (GPT-4.1 recommended for optimal performance).
- Prepare TSGs: Use TSG Mentor tool to review and improve existing troubleshooting guide quality before automation.
- Configure Plugins: Set up Query Preparation Plugins (QPPs) for your specific monitoring tools and data sources.
- Test Execution: Run sample troubleshooting scenarios to validate DAG extraction and parallel execution capabilities.
- Deploy Framework: Integrate with existing incident management workflows and monitoring systems for production use.
StepFly vs competitors
StepFly competes with other AI-powered incident management solutions in the enterprise market.
| Feature | StepFly | ServiceNow AI | AWS DevOps Agent |
|---|---|---|---|
| Open Source | Yes | No | No |
| Parallel Execution | Yes (32.9-70.4% time reduction) | Limited | Not yet disclosed |
| TSG Quality Tools | TSG Mentor included | Manual process | Not yet disclosed |
| Success Rate | 94% on GPT-4.1 | Not yet disclosed | Not yet disclosed |
| Deployment Model | Self-hosted | SaaS platform | AWS cloud service |
| Pricing | Free (open source) | Enterprise licensing | Pay-per-use |
Risks, limits, and myths
- Quality Dependency: StepFly performance relies heavily on troubleshooting guide quality, requiring upfront investment in TSG improvement.
- LLM Costs: High-performance models like GPT-4.1 may incur significant API costs for large-scale deployments.
- Integration Complexity: Connecting with existing monitoring tools and ticketing systems requires custom plugin development.
- False Automation Myth: StepFly requires human oversight and cannot replace all manual troubleshooting expertise.
- Parallel Processing Limits: Not all troubleshooting guides benefit from parallelization due to sequential dependencies.
- Data Security Concerns: Sending sensitive infrastructure data to external LLM APIs may violate security policies.
- Training Requirements: SRE teams need training on TSG Mentor tools and DAG-based workflow concepts.
FAQ
What is StepFly and how does it work?
StepFly is Microsoft’s agentic AI framework that automates IT troubleshooting guide execution through a three-stage workflow: quality improvement, offline preprocessing, and online execution with parallel processing capabilities.
How much faster is StepFly compared to manual troubleshooting?
StepFly reduces execution time by 32.9-70.4% for parallelizable troubleshooting guides while achieving 94% success rate on GPT-4.1 compared to baseline methods.
Is StepFly free to use?
Yes, StepFly is open source and available free on GitHub at microsoft/StepFly, though LLM API costs may apply depending on usage volume.
What troubleshooting guides work best with StepFly?
StepFly works best with structured troubleshooting guides that have clear steps and decision points, particularly those with independent steps that can execute in parallel.
Do I need GPT-4.1 to use StepFly effectively?
While StepFly achieved 94% success rate on GPT-4.1, the framework supports other LLMs though performance metrics for alternative models are not yet disclosed.
How does StepFly handle sensitive IT infrastructure data?
StepFly processes troubleshooting workflows through LLM APIs, so organizations must evaluate data privacy policies and consider on-premises LLM deployment for sensitive environments.
What skills do SREs need to implement StepFly?
SREs need familiarity with Python development, LLM API integration, and understanding of directed acyclic graph (DAG) concepts for workflow management.
Can StepFly replace human site reliability engineers?
No, StepFly automates routine troubleshooting guide execution but requires human oversight, TSG quality management, and expertise for complex incident scenarios.
How does StepFly compare to ServiceNow or AWS incident management?
StepFly offers open-source deployment and specialized parallel execution capabilities, while ServiceNow and AWS provide managed platform solutions with different pricing models.
What monitoring tools integrate with StepFly?
StepFly uses Query Preparation Plugins (QPPs) to integrate with various monitoring tools, though specific supported platforms are not yet disclosed in available documentation.
How long does it take to implement StepFly in production?
Implementation timeline depends on troubleshooting guide quality, existing tool integrations, and team familiarity with agentic AI frameworks – specific timeframes are not yet disclosed.
What happens if StepFly fails to resolve an incident?
With 94% success rate, StepFly may escalate unresolved incidents to human SREs through standard incident management workflows and maintains execution logs for analysis.
Glossary
- Agentic AI
- AI systems that can act independently with complex goal structures, natural language interfaces, and integration of software tools for autonomous task execution.
- Directed Acyclic Graph (DAG)
- A graph structure with directed edges and no cycles, used by StepFly to represent troubleshooting workflow dependencies and enable parallel execution.
- Query Preparation Plugins (QPPs)
- Specialized components in StepFly that handle data-intensive queries by preparing and formatting requests to monitoring tools and data sources.
- Site Reliability Engineer (SRE)
- IT professionals responsible for maintaining system reliability, availability, and performance through monitoring, incident response, and automation practices.
- Troubleshooting Guide (TSG)
- Structured documentation that provides step-by-step procedures for diagnosing and resolving specific IT system issues and incidents.
- TSG Mentor
- StepFly’s tool that assists site reliability engineers in identifying and improving troubleshooting guide quality before automation implementation.
Sources
- AI SRE – Autonomous Incident Resolution – Neubird. https://neubird.ai/
- Best practices for building agentic systems | InfoWorld. https://www.infoworld.com/article/4154570/best-practices-for-building-agentic-systems.html
- What is Agentic AI? – Agentic AI Explained – AWS. https://aws.amazon.com/what-is/agentic-ai/
- AWS Announces General Availability of DevOps Agent for Automated Incident Investigation – InfoQ. https://www.infoq.com/news/2026/04/aws-devops-agent-ga/
- An MSP’s guide to agentic AI. https://superops.com/blog/an-msps-guide-to-agentic-ai
- AI agent – Wikipedia. https://en.wikipedia.org/wiki/AI_agent
- Agentic Engineering: How Swarms of AI Agents Are Redefining Software Engineering. https://www.langchain.com/blog/agentic-engineering-redefining-software-engineering
- Incident response for AI: Same fire, different fuel | Microsoft Security Blog. https://www.microsoft.com/en-us/security/blog/2026/04/15/incident-response-for-ai-same-fire-different-fuel/