Skip to main content
Frontier Signal

StepFly: AI Agent Automates IT Troubleshooting Guides

StepFly achieves 94% success rate automating troubleshooting guides with AI agents, reducing execution time by 32.9-70.4% through parallel processing and DAG workflows.

Operator Briefing

Turn this article into a repeatable weekly edge.

Get implementation-minded writeups on frontier tools, systems, and income opportunities built for professionals.

No fluff. No generic AI listicles. Unsubscribe anytime.

StepFly is Microsoft’s AI agent framework that automates troubleshooting guide execution for IT incident management. The system achieves 94% success rates while reducing execution time by 32.9-70.4% through parallel processing and structured workflow automation.

Released by Microsoft Research
Release date
What it is AI agent framework for automating IT troubleshooting guides
Who it is for Site reliability engineers and IT operations teams
Where to get it GitHub open source
Price Free
  • StepFly automates troubleshooting guide execution with 94% success rate on GPT-4.1
  • Three-stage workflow includes guide quality improvement, offline preprocessing, and online execution
  • Parallel execution reduces troubleshooting time by 32.9% to 70.4% for compatible guides
  • Framework converts unstructured guides into directed acyclic graphs for systematic execution
  • Open source code and sample data available on GitHub
  • Manual troubleshooting guide execution is slow and error-prone in large-scale IT systems
  • StepFly addresses key challenges including guide quality, complex control flow, and data-intensive queries
  • The framework enables parallel execution of independent troubleshooting steps
  • Microsoft’s empirical study analyzed 92 real-world troubleshooting guides to inform design
  • DAG-guided scheduler-executor framework ensures correct workflow execution with memory system

What is StepFly

StepFly is an end-to-end agentic framework that automates troubleshooting guide execution for IT incident management. The system converts unstructured troubleshooting guides into structured workflows that AI agents can execute automatically. AI agents possess complex goal structures, natural language interfaces, and capacity to act independently of user supervision [6]. StepFly specifically targets the challenges of managing troubleshooting guide quality, interpreting complex control flow, and handling data-intensive queries in large-scale IT environments.

What is new vs previous approaches

StepFly introduces specialized capabilities that existing LLM-based incident management solutions lack.

Feature Previous LLM Solutions StepFly
Guide Quality Management Limited support TSG Mentor tool for quality improvement
Control Flow Interpretation Basic sequential processing Structured DAG extraction and execution
Parallel Execution Not supported Independent step parallelization
Data Query Handling Generic approach Dedicated Query Preparation Plugins
Memory System Limited context retention Comprehensive workflow state management

How does StepFly work

StepFly operates through a three-stage workflow that transforms manual troubleshooting into automated execution.

  1. Guide Quality Enhancement: TSG Mentor assists site reliability engineers in improving troubleshooting guide quality and structure
  2. Offline Preprocessing: LLMs extract structured execution directed acyclic graphs from unstructured guides and create Query Preparation Plugins
  3. Online Execution: DAG-guided scheduler-executor framework with memory system ensures correct workflow and enables parallel execution of independent steps

The framework converts troubleshooting guides into directed acyclic graphs that represent step dependencies and execution order. Agentic engineering operates at a higher level of abstraction as a control plane that orchestrates cross-team workflows and maintains long-term memory across agents [7].

Benchmarks and evidence

Microsoft’s evaluation demonstrates StepFly’s effectiveness across multiple performance metrics.

Metric Result Source
Success Rate on GPT-4.1 94% Microsoft Research evaluation
Execution Time Reduction 32.9% to 70.4% Parallelizable TSG performance
Real-world TSGs Analyzed 92 guides Empirical study foundation
Token Consumption Lower than baselines Comparative evaluation

Who should care

Builders

Software engineers building incident management systems can leverage StepFly’s open source framework for automated troubleshooting. AI agents are most commonly deployed in software engineering, accounting for roughly half of use cases [2]. The DAG-based execution model provides a foundation for building reliable automation systems.

Enterprise

Large organizations with complex IT infrastructure benefit from StepFly’s ability to standardize and accelerate incident response. Agentic AI enhances incident response speed while providing more specific and in-depth post-incident analysis [3]. The framework reduces dependency on manual troubleshooting expertise.

End users

Site reliability engineers and IT operations teams gain tools for improving troubleshooting guide quality and execution efficiency. The TSG Mentor component specifically addresses guide quality issues that impact automation success rates.

Investors

The incident management automation market represents significant opportunity as organizations seek to reduce operational costs and improve system reliability. StepFly’s open source availability accelerates adoption and ecosystem development.

How to use StepFly today

StepFly is available as open source software with complete implementation and sample data.

  1. Clone the repository: git clone https://github.com/microsoft/StepFly
  2. Install dependencies according to the project requirements
  3. Prepare troubleshooting guides using the TSG Mentor tool for quality improvement
  4. Run offline preprocessing to convert guides into structured DAGs
  5. Configure the scheduler-executor framework for your environment
  6. Execute troubleshooting workflows through the agentic system

StepFly vs competitors

StepFly competes with other AI-powered incident management solutions in the market.

Feature StepFly AWS DevOps Agent Neubird AI SRE
Open Source Yes No No
Parallel Execution Yes Not disclosed Not disclosed
Success Rate 94% on GPT-4.1 Not disclosed Not disclosed
Guide Quality Tools TSG Mentor Not disclosed Not disclosed
Structured Workflows DAG-based Not disclosed Telemetry analysis focus

Risks, limits, and myths

  • Quality Dependency: Success rates depend heavily on troubleshooting guide quality and structure
  • LLM Limitations: Performance varies across different language models and may require model-specific tuning
  • Complex Dependencies: Some troubleshooting scenarios may have dependencies too complex for DAG representation
  • Data Requirements: Query Preparation Plugins require access to relevant data sources and APIs
  • Myth: Complete Automation: StepFly assists rather than replaces human expertise in complex incident scenarios
  • Myth: Universal Application: Framework works best with well-structured guides rather than ad-hoc troubleshooting

FAQ

What is StepFly and how does it work?

StepFly is Microsoft’s AI agent framework that automates troubleshooting guide execution through a three-stage workflow including guide quality improvement, offline preprocessing into DAGs, and online execution with parallel processing capabilities.

What success rate does StepFly achieve?

StepFly achieves approximately 94% success rate on GPT-4.1 while outperforming baseline approaches with reduced time and token consumption.

How much faster is StepFly compared to manual troubleshooting?

StepFly reduces execution time by 32.9% to 70.4% for parallelizable troubleshooting guides through its DAG-guided parallel execution framework.

Is StepFly open source and free to use?

Yes, StepFly is available as open source software on GitHub with complete code and sample data at no cost.

What makes StepFly different from other AI incident management tools?

StepFly provides specialized support for troubleshooting guide quality management, complex control flow interpretation, parallel execution, and dedicated query preparation plugins.

Who should use StepFly for incident management?

Site reliability engineers, IT operations teams, and organizations with large-scale IT infrastructure benefit most from StepFly’s automated troubleshooting capabilities.

What are the main components of StepFly’s architecture?

StepFly includes TSG Mentor for guide quality improvement, offline preprocessing for DAG extraction, Query Preparation Plugins, and a scheduler-executor framework with memory system.

How many real-world troubleshooting guides did Microsoft analyze for StepFly?

Microsoft conducted an empirical study on 92 real-world troubleshooting guides to inform StepFly’s design and identify key automation challenges.

Can StepFly handle complex troubleshooting scenarios with dependencies?

StepFly converts troubleshooting guides into directed acyclic graphs to manage step dependencies and enable parallel execution of independent operations.

What language models does StepFly support?

StepFly demonstrates 94% success rate on GPT-4.1, though specific support for other language models is not yet disclosed in available documentation.

Glossary

Agentic AI
AI systems that can act independently with complex goal structures, natural language interfaces, and autonomous decision-making capabilities
DAG (Directed Acyclic Graph)
A graph structure with directed edges and no cycles, used by StepFly to represent troubleshooting step dependencies and execution order
Query Preparation Plugins (QPPs)
Specialized components in StepFly that handle data-intensive queries during troubleshooting guide execution
Site Reliability Engineer (SRE)
IT professionals responsible for maintaining system reliability, availability, and performance in large-scale environments
TSG Mentor
StepFly’s tool that assists site reliability engineers in improving troubleshooting guide quality and structure
Troubleshooting Guide (TSG)
Structured documentation that provides step-by-step procedures for diagnosing and resolving IT system incidents

Visit the StepFly GitHub repository at https://github.com/microsoft/StepFly to access the open source code and begin implementing automated troubleshooting in your environment.

Sources

  1. Neubird AI SRE – Autonomous Incident Resolution. https://neubird.ai/
  2. Best practices for building agentic systems. InfoWorld. https://www.infoworld.com/article/4154570/best-practices-for-building-agentic-systems.html
  3. What is Agentic AI? AWS. https://aws.amazon.com/what-is/agentic-ai/
  4. AWS Announces General Availability of DevOps Agent for Automated Incident Investigation. InfoQ. https://www.infoq.com/news/2026/04/aws-devops-agent-ga/
  5. An MSP’s guide to agentic AI. SuperOps. https://superops.com/blog/an-msps-guide-to-agentic-ai
  6. AI agent. Wikipedia. https://en.wikipedia.org/wiki/AI_agent
  7. Agentic Engineering: How Swarms of AI Agents Are Redefining Software Engineering. LangChain. https://www.langchain.com/blog/agentic-engineering-redefining-software-engineering
  8. Incident response for AI: Same fire, different fuel. Microsoft Security Blog. https://www.microsoft.com/en-us/security/blog/2026/04/15/incident-response-for-ai-same-fire-different-fuel/

Author

  • siego237

    Writes for FrontierWisdom on AI systems, automation, decentralized identity, and frontier infrastructure, with a focus on turning emerging technology into practical playbooks, implementation roadmaps, and monetization strategies for operators, builders, and consultants.

Keep Compounding Signal

Get the next blueprint before it becomes common advice.

Join the newsletter for future-economy playbooks, tactical prompts, and high-margin tool recommendations.

  • Actionable execution blueprints
  • High-signal tool and infrastructure breakdowns
  • New monetization angles before they saturate

No fluff. No generic AI listicles. Unsubscribe anytime.

Leave a Reply

Your email address will not be published. Required fields are marked *