StepFly: AI Agent Automates IT Troubleshooting Guides

StepFly is Microsoft’s agentic AI framework that automates IT troubleshooting guide execution, achieving 94% success rates while reducing execution time by 32.9-70.4% through parallel processing and structured workflow management for site reliability engineers.

Released by	Microsoft Research
Release date	April 22, 2026
What it is	AI agent framework for automating IT troubleshooting guides
Who it is for	Site reliability engineers and IT operations teams
Where to get it	GitHub repository at microsoft/StepFly
Price	Open source

StepFly automates troubleshooting guide execution with 94% success rate on GPT-4.1
Three-stage workflow includes quality improvement, preprocessing, and parallel execution
Reduces execution time by 32.9-70.4% for parallelizable troubleshooting guides
Converts unstructured guides into directed acyclic graphs for systematic execution
Open source framework available on GitHub with sample data and documentation

What is StepFly
What is new vs previous approaches
How does StepFly work
Benchmarks and evidence
Who should care
How to use StepFly today
StepFly vs competitors
Risks, limits, and myths

Manual troubleshooting guide execution is slow and error-prone in large-scale IT systems
StepFly addresses TSG quality issues, complex control flow, and data-intensive queries
The framework enables parallel execution of independent troubleshooting steps
Empirical study analyzed 92 real-world troubleshooting guides to inform design
DAG-guided scheduler-executor framework ensures correct workflow execution

What is StepFly

StepFly is an end-to-end agentic framework that automates troubleshooting guide execution for IT incident management. The system converts unstructured troubleshooting guides into structured execution workflows using large language models. Agentic AI enhances incident response speed while also providing a more specific and in-depth post-incident analysis to prevent the same errors from recurring in the future [3]. StepFly specifically targets the challenges of manual troubleshooting guide execution, which is traditionally slow and error-prone in large-scale IT environments.

What is new vs previous approaches

StepFly introduces specialized capabilities that existing LLM-based incident management solutions lack.

Feature	Previous LLM Solutions	StepFly
TSG Quality Management	Limited support	TSG Mentor tool for quality improvement
Control Flow Interpretation	Basic sequential processing	Directed acyclic graph extraction and execution
Data-Intensive Queries	Generic handling	Dedicated Query Preparation Plugins
Parallel Execution	Not supported	Scheduler-executor framework with memory system
Workflow Structure	Ad-hoc processing	Three-stage systematic workflow

How does StepFly work

StepFly operates through a three-stage workflow that transforms unstructured troubleshooting guides into automated execution systems.

Quality Improvement Stage: TSG Mentor tool assists site reliability engineers in improving troubleshooting guide quality and completeness
Offline Preprocessing Stage: LLMs extract structured directed acyclic graphs from unstructured guides and create Query Preparation Plugins for data handling
Online Execution Stage: DAG-guided scheduler-executor framework with memory system executes workflows and supports parallel processing of independent steps

The system maintains workflow correctness through its memory system while enabling parallel execution of independent troubleshooting steps. AI agents can detect issues like CPU spikes or failed processes, analyze root causes using real-time and historical data, and apply fixes automatically [6].

Benchmarks and evidence

StepFly demonstrates superior performance across multiple metrics in real-world troubleshooting scenarios.

Metric	StepFly Performance	Source
Success Rate	94% on GPT-4.1	Microsoft Research evaluation
Execution Time Reduction	32.9% to 70.4% for parallelizable TSGs	Microsoft Research evaluation
Token Consumption	Lower than baseline approaches	Microsoft Research evaluation
Real-world TSGs Analyzed	92 troubleshooting guides	Empirical study foundation

Who should care

Builders

Software engineers and DevOps professionals can integrate StepFly’s open-source framework into existing incident management workflows. The system provides APIs and tools for customizing troubleshooting guide automation. According to Anthropic, a provider of large language models (LLMs), AI agents are most commonly deployed in software engineering, accounting for roughly half of use cases, followed by back-office automation, marketing, sales, finance, and data analysis [2].

Enterprise

Large organizations with complex IT infrastructures can reduce incident resolution time and human error through automated troubleshooting. The framework addresses scalability challenges in manual incident management processes. Automation is essential here; manual review cannot keep pace [7] with the volume of incidents in enterprise environments.

End users

Site reliability engineers and IT operations teams benefit from reduced manual workload and faster incident resolution. The system maintains human oversight while automating repetitive troubleshooting tasks.

Investors

The incident management automation market represents significant opportunity as organizations seek to reduce operational costs and improve system reliability through AI-powered solutions.

How to use StepFly today

StepFly is available as an open-source framework with complete implementation and sample data.

Clone the repository: git clone https://github.com/microsoft/StepFly
Install dependencies according to the provided requirements file
Prepare troubleshooting guides using the TSG Mentor tool for quality improvement
Run offline preprocessing to convert guides into directed acyclic graphs
Configure the scheduler-executor framework for your IT environment
Deploy the system for automated troubleshooting guide execution

StepFly vs competitors

StepFly competes with other AI-powered incident management and troubleshooting automation solutions.

Feature	StepFly	Traditional LLM Solutions	Manual TSG Execution
Success Rate	94% on GPT-4.1	Not yet disclosed	Variable, error-prone
Parallel Execution	Supported with 32.9-70.4% time reduction	Not supported	Not supported
Quality Management	TSG Mentor tool included	Limited	Manual review
Structured Workflow	DAG-based execution	Sequential processing	Ad-hoc execution
Open Source	Yes	Varies	N/A

Risks, limits, and myths

System performance depends on troubleshooting guide quality and completeness
Complex IT environments may require extensive customization and configuration
LLM accuracy limitations can affect automated decision-making in critical incidents
Parallel execution benefits only apply to troubleshooting guides with independent steps
Human oversight remains necessary for high-stakes incident resolution scenarios
Integration complexity may require significant engineering resources for deployment

FAQ

What is StepFly and how does it automate troubleshooting?

StepFly is Microsoft’s agentic AI framework that converts unstructured IT troubleshooting guides into automated workflows, achieving 94% success rates through structured execution and parallel processing capabilities.

How much faster is StepFly compared to manual troubleshooting?

StepFly reduces execution time by 32.9% to 70.4% for parallelizable troubleshooting guides while consuming fewer tokens than baseline approaches according to Microsoft’s evaluation.

What makes StepFly different from other AI incident management tools?

StepFly provides specialized support for TSG quality management, complex control flow interpretation, data-intensive queries, and parallel execution that existing LLM-based solutions lack.

Is StepFly available for commercial use?

StepFly is open source and available on GitHub at microsoft/StepFly with complete implementation code and sample data for commercial and research use.

What are the three stages of StepFly’s workflow?

StepFly operates through quality improvement using TSG Mentor, offline preprocessing with DAG extraction, and online execution with parallel processing capabilities.

How does StepFly handle complex troubleshooting guide structures?

StepFly extracts directed acyclic graphs from unstructured guides and uses a DAG-guided scheduler-executor framework with memory system to ensure correct workflow execution.

What types of IT environments can benefit from StepFly?

Large-scale IT systems with complex troubleshooting procedures benefit most from StepFly’s automation, particularly environments with parallelizable troubleshooting steps and quality management needs.

Does StepFly require human oversight for incident management?

StepFly automates troubleshooting guide execution but maintains integration points for human oversight, particularly for high-stakes incidents and quality improvement processes through TSG Mentor.

What LLM models does StepFly support?

StepFly achieved 94% success rate on GPT-4.1 according to Microsoft’s evaluation, though specific support for other LLM models is not yet disclosed.

How does StepFly improve troubleshooting guide quality?

StepFly includes TSG Mentor, a dedicated tool that assists site reliability engineers in improving troubleshooting guide quality and completeness before automated execution.

Glossary

Agentic AI: AI systems that can act independently with complex goal structures, natural language interfaces, and integration of software tools or planning systems
DAG (Directed Acyclic Graph): A structured representation of workflows where tasks have dependencies but no circular references, enabling parallel execution of independent steps
TSG (Troubleshooting Guide): Structured documentation that provides step-by-step procedures for diagnosing and resolving IT incidents and system issues
SRE (Site Reliability Engineer): IT professionals responsible for maintaining system reliability, performance, and incident response in large-scale technology environments
Query Preparation Plugins (QPPs): Specialized components in StepFly that handle data-intensive queries during troubleshooting guide execution
TSG Mentor: StepFly’s tool for assisting site reliability engineers in improving troubleshooting guide quality and completeness

Visit the StepFly GitHub repository at microsoft/StepFly to download the open-source framework and explore sample troubleshooting guide automation implementations.

Sources

Neubird AI SRE – Autonomous Incident Resolution
InfoWorld – Best practices for building agentic systems
AWS – What is Agentic AI?
InfoQ – AWS Announces General Availability of DevOps Agent
Wikipedia – AI agent
SuperOps – An MSP’s guide to agentic AI
Microsoft Security Blog – Incident response for AI
Automation Anywhere – What is Agentic AI?

Author

Siegfried Kamgo

Founder and editorial lead at FrontierWisdom. Engineer turned operator-analyst writing about AI systems, automation infrastructure, decentralised stacks, and the practical economics of frontier technology. Focus: turning fast-moving releases into durable, implementation-ready playbooks.