StepFly: AI Agent Automates IT Troubleshooting Guides

StepFly is Microsoft’s AI agent framework that automates IT troubleshooting guides, achieving 94% success rate on GPT-4.1 while reducing execution time by 32.9-70.4% through parallel processing and structured workflow automation.

Released by	Microsoft Research
Release date	October 22, 2024
What it is	AI agent framework for automating IT troubleshooting guides
Who it is for	Site reliability engineers and IT operations teams
Where to get it	GitHub repository
Price	Open source

StepFly automates manual troubleshooting guide execution with 94% success rate on real-world incidents
Three-stage workflow includes guide quality improvement, offline preprocessing, and parallel execution
Reduces execution time by 32.9-70.4% for parallelizable troubleshooting scenarios
Features TSG Mentor tool to help site reliability engineers improve guide quality
Open source framework available on GitHub with sample data and implementation code

What is StepFly
What is new vs previous approaches
How does StepFly work
Benchmarks and evidence
Who should care
How to use StepFly today
StepFly vs competitors
Risks, limits, and myths

StepFly addresses critical gaps in existing LLM-based incident management solutions through specialized troubleshooting guide automation
The framework processes unstructured troubleshooting guides into structured execution graphs for reliable automation
Parallel execution capabilities significantly reduce incident resolution time compared to sequential manual processes
TSG Mentor component helps improve troubleshooting guide quality before automation deployment
Real-world evaluation on 92 troubleshooting guides demonstrates practical applicability in enterprise environments

What is StepFly

StepFly is an end-to-end agentic framework that automates troubleshooting guide execution for IT incident management. The system transforms manual troubleshooting processes into automated workflows using large language models and structured execution graphs. AI agents can detect issues like CPU spikes or failed processes, analyze root causes using real-time and historical data, and apply fixes automatically [5].

The framework addresses four key challenges in troubleshooting guide automation: managing guide quality issues, interpreting complex control flow, handling data-intensive queries, and exploiting execution parallelism. Microsoft researchers developed StepFly after conducting an empirical study on 92 real-world troubleshooting guides to identify common automation barriers.

What is new vs previous approaches

StepFly introduces specialized capabilities not found in existing LLM-based incident management solutions.

Feature	Previous LLM Solutions	StepFly
Guide Quality Management	No specialized support	TSG Mentor tool for quality improvement
Control Flow Interpretation	Limited structured processing	Directed acyclic graph extraction
Data Query Handling	Basic query processing	Dedicated Query Preparation Plugins
Execution Parallelism	Sequential processing	DAG-guided parallel execution
Memory System	Limited workflow state	Comprehensive memory for workflow continuity

How does StepFly work

StepFly operates through a three-stage workflow that transforms manual troubleshooting into automated execution.

Guide Quality Improvement: TSG Mentor assists site reliability engineers in identifying and fixing quality issues in existing troubleshooting guides before automation
Offline Preprocessing: Large language models extract structured directed acyclic graphs from unstructured troubleshooting guides and create Query Preparation Plugins for data-intensive operations
Online Execution: DAG-guided scheduler-executor framework runs troubleshooting steps with memory system support and parallel execution of independent operations

Agentic AI enhances incident response speed while also providing a more specific and in-depth post-incident analysis to prevent the same errors from recurring in the future [3]. The framework maintains state and traceability across the full incident resolution lifecycle.

Benchmarks and evidence

StepFly demonstrates superior performance across multiple metrics in real-world troubleshooting scenarios.

Metric	StepFly Performance	Source
Success Rate	94% on GPT-4.1	Microsoft Research evaluation
Execution Time Reduction	32.9% to 70.4% for parallelizable guides	Empirical study on real-world TSGs
Token Consumption	Lower than baseline approaches	Comparative analysis
Guide Coverage	92 real-world troubleshooting guides tested	Empirical validation study

Who should care

Builders

Software engineers and DevOps practitioners can integrate StepFly’s open-source framework into existing incident management workflows. According to Anthropic, a provider of large language models, AI agents are most commonly deployed in software engineering, accounting for roughly half of use cases [2]. The GitHub repository provides implementation code and sample data for development teams.

Enterprise

Large-scale IT organizations benefit from StepFly’s automation of manual troubleshooting processes that are typically slow and error-prone. A concrete example is in IT incident resolution [2], where enterprises can reduce mean time to resolution while improving consistency across incident response teams.

End users

Site reliability engineers and IT operations teams gain tools for improving troubleshooting guide quality and automating routine incident response tasks. The TSG Mentor component specifically addresses guide quality issues that prevent effective automation.

Investors

The incident management automation market represents significant opportunity as organizations seek to reduce operational costs and improve system reliability. StepFly’s demonstrated performance metrics indicate commercial viability for agentic AI solutions in enterprise IT operations.

How to use StepFly today

StepFly is available as an open-source framework through Microsoft’s GitHub repository.

Clone the repository: git clone https://github.com/microsoft/StepFly
Install dependencies and configure the framework according to documentation
Use TSG Mentor to assess and improve existing troubleshooting guide quality
Run offline preprocessing to convert guides into structured execution graphs
Deploy the DAG-guided executor for automated incident response

The repository includes sample troubleshooting guides and incident data for testing and development purposes.

StepFly vs competitors

StepFly competes with other AI-powered incident management solutions in the enterprise market.

Solution	Approach	Parallel Execution	Guide Quality Tools	Open Source
StepFly	DAG-guided agentic framework	Yes	TSG Mentor included	Yes
AWS DevOps Agent	Serverless incident investigation	Not disclosed	Not disclosed	No
Neubird AI SRE	Telemetry analysis and correlation	Not disclosed	Not disclosed	No

Risks, limits, and myths

StepFly requires high-quality troubleshooting guides as input; poor guide quality limits automation effectiveness
The framework depends on large language model performance, which can vary across different incident types
Complex troubleshooting scenarios may still require human intervention despite automation capabilities
Implementation requires technical expertise in AI systems and incident management processes
Token consumption costs may be significant for organizations with high incident volumes
The 94% success rate, while high, means 6% of incidents may still require manual intervention

FAQ

What is StepFly and how does it automate troubleshooting?

StepFly is Microsoft’s AI agent framework that converts manual troubleshooting guides into automated workflows using directed acyclic graphs and parallel execution capabilities.

How accurate is StepFly for incident resolution?

StepFly achieves a 94% success rate on GPT-4.1 when tested on real-world troubleshooting guides and incidents.

Can StepFly reduce incident resolution time?

Yes, StepFly reduces execution time by 32.9% to 70.4% for parallelizable troubleshooting guides through automated parallel processing.

Is StepFly available for commercial use?

StepFly is open source and available through Microsoft’s GitHub repository with sample data and implementation code.

What makes StepFly different from other AI incident management tools?

StepFly provides specialized support for troubleshooting guide quality management, complex control flow interpretation, and parallel execution capabilities.

Do I need existing troubleshooting guides to use StepFly?

Yes, StepFly requires existing troubleshooting guides as input and includes TSG Mentor to help improve guide quality before automation.

What technical skills are needed to implement StepFly?

Implementation requires expertise in AI systems, large language models, and incident management processes for effective deployment.

How does StepFly handle complex incident scenarios?

StepFly uses directed acyclic graphs to manage complex control flows and Query Preparation Plugins for data-intensive troubleshooting operations.

What are the costs of running StepFly?

While the framework is open source, organizations incur large language model token consumption costs during automated troubleshooting execution.

Can StepFly work with existing monitoring tools?

StepFly focuses on troubleshooting guide automation rather than direct monitoring tool integration, though it can process data from various sources through Query Preparation Plugins.

Glossary

Agentic AI: AI systems that can act independently with complex goal structures, natural language interfaces, and integrated software tools
Directed Acyclic Graph (DAG): A structured representation of workflow steps that prevents circular dependencies and enables parallel execution
Query Preparation Plugins (QPPs): Specialized components that handle data-intensive queries during troubleshooting guide execution
Site Reliability Engineer (SRE): IT professionals responsible for maintaining system reliability, availability, and performance in production environments
TSG Mentor: StepFly’s tool for assisting site reliability engineers in improving troubleshooting guide quality before automation
Troubleshooting Guide (TSG): Structured documentation that provides step-by-step procedures for diagnosing and resolving IT incidents

Visit the StepFly GitHub repository at https://github.com/microsoft/StepFly to access the open-source framework, sample data, and implementation documentation.

Sources

AI SRE – Autonomous Incident Resolution – Neubird. https://neubird.ai/
Best practices for building agentic systems | InfoWorld. https://www.infoworld.com/article/4154570/best-practices-for-building-agentic-systems.html
What is Agentic AI? – Agentic AI Explained – AWS. https://aws.amazon.com/what-is/agentic-ai/
AWS Announces General Availability of DevOps Agent for Automated Incident Investigation – InfoQ. https://www.infoq.com/news/2026/04/aws-devops-agent-ga/
An MSP’s guide to agentic AI. https://superops.com/blog/an-msps-guide-to-agentic-ai
AI agent – Wikipedia. https://en.wikipedia.org/wiki/AI_agent
Agentic Engineering: How Swarms of AI Agents Are Redefining Software Engineering. https://www.langchain.com/blog/agentic-engineering-redefining-software-engineering
Incident response for AI: Same fire, different fuel | Microsoft Security Blog. https://www.microsoft.com/en-us/security/blog/2026/04/15/incident-response-for-ai-same-fire-different-fuel/

Author

Siegfried Kamgo

Founder and editorial lead at FrontierWisdom. Engineer turned operator-analyst writing about AI systems, automation infrastructure, decentralised stacks, and the practical economics of frontier technology. Focus: turning fast-moving releases into durable, implementation-ready playbooks.