StepFly: AI-Powered Troubleshooting Guide Automation

StepFly is an AI-powered framework that automates troubleshooting guide execution for IT incident management, achieving 94% success rate with GPT-4.1 while reducing execution time by up to 70.4% through parallel processing and structured workflow automation.

Released by	Microsoft Research
Release date	October 22, 2024
What it is	AI-powered framework for automating IT troubleshooting guides
Who it is for	Site reliability engineers and IT operations teams
Where to get it	GitHub repository at microsoft/StepFly
Price	Open source

StepFly automates manual troubleshooting guide execution using a three-stage agentic AI workflow
The framework achieves 94% success rate on GPT-4.1, outperforming existing baseline approaches
Parallel execution capabilities reduce troubleshooting time by 32.9% to 70.4% for compatible guides
TSG Mentor tool helps site reliability engineers improve troubleshooting guide quality
System processes unstructured guides into structured execution graphs for automated workflow management

What is StepFly
What is new vs previous approaches
How does StepFly work
Benchmarks and evidence
Who should care
How to use StepFly today
StepFly vs competitors
Risks, limits, and myths

Manual troubleshooting guide execution is slow and error-prone in large-scale IT systems
StepFly addresses key challenges including TSG quality issues, complex control flow, and data-intensive queries
The framework uses directed acyclic graphs to structure and parallelize troubleshooting workflows
Empirical evaluation on 92 real-world troubleshooting guides demonstrates significant performance improvements
Open source availability enables widespread adoption across IT operations teams

What is StepFly

StepFly is an end-to-end agentic framework that automates troubleshooting guide execution for IT incident management. The system transforms manual, error-prone troubleshooting processes into automated workflows using large language models and structured execution graphs. AI agents can detect issues like CPU spikes or failed processes, analyze root causes using real-time and historical data, and apply fixes automatically [5].

The framework addresses critical challenges in incident management where manual execution of troubleshooting guides creates bottlenecks and introduces human error. StepFly leverages agentic AI principles to provide autonomous decision-making capabilities while maintaining oversight and control for site reliability engineers.

What is new vs previous approaches

StepFly introduces specialized support for troubleshooting guide automation that existing LLM-based solutions lack.

Feature	Previous LLM Solutions	StepFly
TSG Quality Management	No specialized support	TSG Mentor tool for guide improvement
Control Flow Interpretation	Limited complex workflow handling	Structured DAG extraction and execution
Data-Intensive Queries	Basic query processing	Dedicated Query Preparation Plugins
Parallel Execution	Sequential processing only	DAG-guided parallel step execution
Memory System	Limited workflow state tracking	Comprehensive memory for workflow continuity

How does StepFly work

StepFly operates through a three-stage workflow that transforms unstructured troubleshooting guides into automated execution systems.

Guide Quality Enhancement: TSG Mentor assists site reliability engineers in improving troubleshooting guide quality and structure before automation
Offline Preprocessing: LLMs extract structured execution directed acyclic graphs from unstructured guides and create Query Preparation Plugins for data operations
Online Execution: DAG-guided scheduler-executor framework with memory system ensures correct workflow execution and supports parallel processing of independent steps

Agentic AI enhances incident response speed while also providing a more specific and in-depth post-incident analysis to prevent the same errors from recurring in the future [3]. The system maintains state and traceability across the full troubleshooting lifecycle.

Benchmarks and evidence

StepFly demonstrates significant performance improvements across multiple metrics in empirical evaluations.

Metric	StepFly Performance	Source
Success Rate on GPT-4.1	94%	Microsoft Research evaluation
Execution Time Reduction	32.9% to 70.4% for parallelizable TSGs	Microsoft Research evaluation
Real-world TSGs Analyzed	92 troubleshooting guides	Empirical study dataset
Token Consumption	Lower than baseline approaches	Microsoft Research comparison

Who should care

Builders

Software engineers and DevOps professionals can integrate StepFly into existing incident management workflows. According to Anthropic, a provider of large language models, AI agents are most commonly deployed in software engineering, accounting for roughly half of use cases [2]. The open source framework enables customization for specific infrastructure requirements.

Enterprise

Large-scale IT operations teams benefit from reduced mean time to resolution and improved incident response consistency. Agentic engineering operates at a higher level of abstraction: it’s a control plane that orchestrates cross-team workflows, maintains long-term memory across agents, and manages state and traceability across the full software delivery lifecycle [7].

End users

System users experience fewer service disruptions and faster resolution times when incidents occur. Automated troubleshooting reduces the human error factor that contributes to extended outages.

Investors

The framework represents Microsoft’s investment in agentic AI for enterprise operations, demonstrating practical applications of LLM technology in critical infrastructure management.

How to use StepFly today

StepFly is available as an open source project with implementation guidance for immediate deployment.

Clone the repository from GitHub at microsoft/StepFly
Install required dependencies including LLM access credentials
Prepare existing troubleshooting guides using the TSG Mentor tool
Configure Query Preparation Plugins for your data sources
Deploy the DAG-guided scheduler-executor framework in your environment
Test automation with non-critical troubleshooting scenarios
Gradually expand to production incident management workflows

StepFly vs competitors

StepFly competes with other AI-powered incident management solutions in the enterprise market.

Feature	StepFly	AWS DevOps Agent	Neubird AI SRE
Open Source	Yes	No	No
Parallel Execution	Yes, DAG-guided	Not yet disclosed	Not yet disclosed
TSG Quality Tools	TSG Mentor included	Not yet disclosed	Not yet disclosed
Success Rate	94% on GPT-4.1	Not yet disclosed	Not yet disclosed
Platform Integration	Multi-platform	AWS-focused	Multi-platform

Risks, limits, and myths

LLM dependency creates potential points of failure if model services become unavailable
Complex troubleshooting scenarios may require human oversight despite automation capabilities
Initial setup requires significant investment in guide preparation and system configuration
Parallel execution benefits only apply to troubleshooting guides with independent steps
Success rates may vary significantly based on troubleshooting guide quality and complexity
Myth: Complete replacement of human SREs – StepFly augments rather than replaces human expertise
Myth: Universal compatibility – System requires structured input and may not work with all existing guides

FAQ

What is StepFly and how does it work?

StepFly is an AI-powered framework that automates IT troubleshooting guide execution through a three-stage workflow involving guide quality enhancement, offline preprocessing, and online execution with parallel processing capabilities.

How accurate is StepFly for troubleshooting automation?

StepFly achieves a 94% success rate when using GPT-4.1, based on empirical evaluation with 92 real-world troubleshooting guides.

Can StepFly reduce incident resolution time?

Yes, StepFly reduces execution time by 32.9% to 70.4% for troubleshooting guides that support parallel processing of independent steps.

Is StepFly available for free?

StepFly is open source and available at no cost through the GitHub repository at microsoft/StepFly.

What makes StepFly different from other AI incident management tools?

StepFly provides specialized support for troubleshooting guide quality management, complex control flow interpretation, data-intensive queries, and parallel execution that existing LLM-based solutions lack.

Do I need technical expertise to implement StepFly?

Implementation requires software engineering knowledge for system integration, LLM configuration, and troubleshooting guide preparation using the included TSG Mentor tool.

Can StepFly work with existing troubleshooting documentation?

StepFly transforms unstructured troubleshooting guides into structured execution graphs, but guides may require quality improvements using the TSG Mentor tool before automation.

What are the main limitations of StepFly?

StepFly depends on LLM availability, requires structured input preparation, and may need human oversight for complex scenarios despite its 94% success rate.

How does StepFly handle parallel troubleshooting steps?

StepFly uses directed acyclic graphs to identify and execute independent troubleshooting steps in parallel, reducing overall execution time significantly.

What infrastructure is needed to run StepFly?

StepFly requires access to large language models, integration with existing monitoring systems, and deployment of the DAG-guided scheduler-executor framework.

Glossary

Agentic AI: AI systems with complex goal structures, natural language interfaces, and capacity to act independently with integrated software tools
DAG (Directed Acyclic Graph): Structured representation of workflow steps that enables parallel execution of independent tasks
TSG (Troubleshooting Guide): Documentation that provides step-by-step instructions for diagnosing and resolving IT system issues
SRE (Site Reliability Engineer): IT professional responsible for maintaining system reliability, availability, and performance
Query Preparation Plugins: Specialized components that handle data-intensive operations within troubleshooting workflows
TSG Mentor: Tool within StepFly that assists engineers in improving troubleshooting guide quality before automation

Visit the GitHub repository at microsoft/StepFly to download the open source framework and begin implementing automated troubleshooting in your IT environment.

Sources

AI SRE – Autonomous Incident Resolution – Neubird. https://neubird.ai/
Best practices for building agentic systems | InfoWorld. https://www.infoworld.com/article/4154570/best-practices-for-building-agentic-systems.html
What is Agentic AI? – Agentic AI Explained – AWS. https://aws.amazon.com/what-is/agentic-ai/
AWS Announces General Availability of DevOps Agent for Automated Incident Investigation – InfoQ. https://www.infoq.com/news/2026/04/aws-devops-agent-ga/
An MSP’s guide to agentic AI. https://superops.com/blog/an-msps-guide-to-agentic-ai
AI agent – Wikipedia. https://en.wikipedia.org/wiki/AI_agent
Agentic Engineering: How Swarms of AI Agents Are Redefining Software Engineering. https://www.langchain.com/blog/agentic-engineering-redefining-software-engineering
Incident response for AI: Same fire, different fuel | Microsoft Security Blog. https://www.microsoft.com/en-us/security/blog/2026/04/15/incident-response-for-ai-same-fire-different-fuel/

Author

Siegfried Kamgo

Founder and editorial lead at FrontierWisdom. Engineer turned operator-analyst writing about AI systems, automation infrastructure, decentralised stacks, and the practical economics of frontier technology. Focus: turning fast-moving releases into durable, implementation-ready playbooks.