StepFly: AI Agent Framework for IT Incident Troubleshooting

StepFly is an AI agent framework that automates IT troubleshooting guides for incident management, achieving 94% success rates while reducing execution time by 32.9-70.4% through parallel processing and structured workflow automation.

Released by	Microsoft Research
Release date	October 22, 2024
What it is	AI agent framework for automating IT troubleshooting guides
Who it is for	Site reliability engineers and IT operations teams
Where to get it	GitHub open source repository
Price	Free

StepFly automates manual troubleshooting guides that are typically slow and error-prone for IT incidents
The framework uses a three-stage workflow with guide quality improvement, offline preprocessing, and online execution
It achieves 94% success rate on GPT-4.1 while consuming fewer tokens than baseline approaches
Parallel execution capabilities reduce troubleshooting time by 32.9% to 70.4% for compatible guides
The system is open-sourced on GitHub with sample data for implementation

What is StepFly
What is new vs previous approaches
How does StepFly work
Benchmarks and evidence
Who should care
How to use StepFly today
StepFly vs competitors
Risks, limits, and myths

StepFly addresses four key challenges in automated incident management: TSG quality issues, complex control flow interpretation, data-intensive queries, and execution parallelism
The framework was developed based on empirical analysis of 92 real-world troubleshooting guides
It features TSG Mentor tool to help site reliability engineers improve guide quality before automation
The system extracts structured execution DAGs from unstructured troubleshooting guides using LLMs
Query Preparation Plugins handle data-intensive operations while maintaining workflow integrity

What is StepFly

StepFly is an end-to-end agentic framework that automates troubleshooting guide execution for IT incident management. The system addresses the manual, slow, and error-prone nature of traditional troubleshooting guide execution in large-scale IT systems.

Traditional troubleshooting guides require manual execution by site reliability engineers, leading to delays and human errors during critical incidents. Agentic AI enhances incident response speed while also providing a more specific and in-depth post-incident analysis to prevent the same errors from recurring in the future [3].

The framework leverages large language models to interpret unstructured troubleshooting documentation and convert it into automated workflows. AI agents possess several key attributes, including complex goal structures, natural language interfaces, the capacity to act independently of user supervision, and the integration of software tools or planning systems [6].

What is new vs previous approaches

StepFly introduces specialized capabilities that existing LLM-based incident management solutions lack.

Feature	Previous LLM Solutions	StepFly
TSG Quality Management	No specialized support	TSG Mentor tool for quality improvement
Control Flow Interpretation	Limited complex workflow handling	Structured DAG extraction from unstructured guides
Data-Intensive Queries	Basic query processing	Dedicated Query Preparation Plugins (QPPs)
Execution Parallelism	Sequential execution only	DAG-guided scheduler with parallel step execution
Memory System	Limited workflow state tracking	Comprehensive memory system for workflow integrity

How does StepFly work

StepFly operates through a three-stage workflow that transforms manual troubleshooting guides into automated execution systems.

Guide Quality Improvement Stage: TSG Mentor tool assists site reliability engineers in identifying and fixing quality issues in existing troubleshooting guides before automation begins.
Offline Preprocessing Stage: Large language models extract structured execution directed acyclic graphs (DAGs) from unstructured troubleshooting guides and create dedicated Query Preparation Plugins for data-intensive operations.
Online Execution Stage: DAG-guided scheduler-executor framework with memory system ensures correct workflow execution and supports parallel processing of independent troubleshooting steps.

The system maintains workflow integrity through its memory system while enabling parallel execution of independent troubleshooting steps. Agentic engineering operates at a higher level of abstraction: it’s a control plane that orchestrates cross-team workflows, maintains long-term memory across agents, and manages state and traceability across the full software delivery lifecycle [7].

Benchmarks and evidence

StepFly demonstrates superior performance across multiple metrics compared to baseline approaches.

Metric	StepFly Performance	Source
Success Rate on GPT-4.1	~94%	Microsoft Research evaluation [Source]
Execution Time Reduction	32.9% to 70.4% for parallelizable TSGs	Microsoft Research evaluation [Source]
Token Consumption	Lower than baseline approaches	Microsoft Research evaluation [Source]
Real-world TSG Analysis	92 troubleshooting guides studied	Microsoft Research empirical study [Source]

Who should care

Builders

Software engineers and DevOps professionals can integrate StepFly’s open-source framework into existing incident management workflows. According to Anthropic, a provider of large language models (LLMs), AI agents are most commonly deployed in software engineering, accounting for roughly half of use cases, followed by back-office automation, marketing, sales, finance, and data analysis [2].

Enterprise

Large-scale IT organizations can reduce incident response times and human errors through automated troubleshooting guide execution. AI agents can detect issues like CPU spikes or failed processes, analyze root causes using real-time and historical data, and apply fixes automatically [5].

End users

Site reliability engineers and IT operations teams benefit from reduced manual workload and faster incident resolution times during critical system outages.

Investors

The framework represents Microsoft’s investment in agentic AI for enterprise operations, potentially reducing operational costs and improving system reliability across cloud infrastructure.

How to use StepFly today

StepFly is available as an open-source framework with implementation guidance and sample data.

Access the repository: Visit https://github.com/microsoft/StepFly to download the framework code and documentation.
Review sample data: Examine provided troubleshooting guide examples to understand the expected input format and structure.
Install dependencies: Set up required Python packages and large language model access according to the repository documentation.
Prepare troubleshooting guides: Use TSG Mentor tool to improve existing guide quality before automation implementation.
Configure execution environment: Set up the DAG-guided scheduler-executor framework with appropriate memory system configuration.
Test with sample incidents: Run the framework against provided sample incidents to validate installation and configuration.

StepFly vs competitors

StepFly competes with other AI-powered incident management and troubleshooting automation solutions.

Feature	StepFly	AWS DevOps Agent	Neubird AI SRE
Open Source	Yes	No	No
Parallel Execution	Yes, DAG-guided	Not yet disclosed	Not yet disclosed
Success Rate	~94% on GPT-4.1	Not yet disclosed	Not yet disclosed
Guide Quality Tools	TSG Mentor included	Not yet disclosed	Not yet disclosed
Execution Time Reduction	32.9-70.4%	Not yet disclosed	Not yet disclosed

Risks, limits, and myths

Quality dependency: StepFly’s effectiveness depends on the quality of input troubleshooting guides, requiring initial manual review and improvement.
LLM limitations: The framework inherits potential biases and errors from underlying large language models used for guide interpretation.
Complex incident handling: Highly complex or novel incidents may require human intervention beyond automated troubleshooting capabilities.
Integration complexity: Organizations need existing monitoring and telemetry systems to provide data for automated troubleshooting execution.
Myth – Complete automation: StepFly augments rather than replaces human site reliability engineers, requiring oversight for critical incidents.
Myth – Universal applicability: Not all troubleshooting guides are suitable for automation, particularly those requiring subjective judgment or manual hardware intervention.

FAQ

What is StepFly and how does it work for IT incidents?: StepFly is an AI agent framework that automates troubleshooting guide execution for IT incidents using a three-stage workflow with guide improvement, preprocessing, and parallel execution capabilities.
How much faster is StepFly compared to manual troubleshooting?: StepFly reduces execution time by 32.9% to 70.4% for parallelizable troubleshooting guides while achieving a 94% success rate on GPT-4.1.
Is StepFly open source and free to use?: Yes, StepFly is available as an open-source framework on GitHub with sample data and implementation documentation at no cost.
What makes StepFly different from other AI incident management tools?: StepFly provides specialized support for troubleshooting guide quality management, complex control flow interpretation, data-intensive queries, and parallel execution that other solutions lack.
Who developed StepFly and when was it released?: Microsoft Research developed StepFly, releasing it as an open-source framework on October 22, 2024.
What are the main components of StepFly’s architecture?: StepFly includes TSG Mentor for guide quality improvement, DAG extraction for workflow structuring, Query Preparation Plugins for data operations, and a scheduler-executor with memory system.
Can StepFly handle all types of IT troubleshooting scenarios?: StepFly works best with structured troubleshooting guides but may require human intervention for highly complex incidents or those requiring subjective judgment.
How does StepFly ensure troubleshooting workflow accuracy?: StepFly uses a comprehensive memory system and DAG-guided execution to maintain workflow integrity while supporting parallel processing of independent troubleshooting steps.
What prerequisites are needed to implement StepFly?: Organizations need existing troubleshooting guides, monitoring systems for data input, Python environment setup, and access to large language models for framework operation.
How was StepFly’s performance validated?: Microsoft Research conducted empirical evaluation using 92 real-world troubleshooting guides and incidents, demonstrating superior performance compared to baseline approaches.

Glossary

Agentic AI: Artificial intelligence systems that can act autonomously to achieve goals without constant human supervision, using natural language interfaces and integrated tools.
DAG (Directed Acyclic Graph): A structured representation of workflow steps and dependencies that prevents circular execution loops while enabling parallel processing of independent tasks.
Query Preparation Plugins (QPPs): Specialized components in StepFly that handle data-intensive operations and queries during troubleshooting guide execution.
Site Reliability Engineer (SRE): IT professionals responsible for maintaining system reliability, availability, and performance through monitoring, incident response, and automation practices.
TSG (Troubleshooting Guide): Documented procedures that provide step-by-step instructions for diagnosing and resolving specific IT system issues or incidents.
TSG Mentor: A tool within StepFly that assists site reliability engineers in identifying and improving quality issues in troubleshooting guides before automation.

Visit the StepFly GitHub repository at https://github.com/microsoft/StepFly to download the framework and explore sample troubleshooting guide implementations.

Sources

Neubird AI. “AI SRE – Autonomous Incident Resolution.” https://neubird.ai/
InfoWorld. “Best practices for building agentic systems.” https://www.infoworld.com/article/4154570/best-practices-for-building-agentic-systems.html
AWS. “What is Agentic AI? – Agentic AI Explained.” https://aws.amazon.com/what-is/agentic-ai/
InfoQ. “AWS Announces General Availability of DevOps Agent for Automated Incident Investigation.” https://www.infoq.com/news/2026/04/aws-devops-agent-ga/
SuperOps. “An MSP’s guide to agentic AI.” https://superops.com/blog/an-msps-guide-to-agentic-ai
Wikipedia. “AI agent.” https://en.wikipedia.org/wiki/AI_agent
LangChain. “Agentic Engineering: How Swarms of AI Agents Are Redefining Software Engineering.” https://www.langchain.com/blog/agentic-engineering-redefining-software-engineering
Microsoft Security Blog. “Incident response for AI: Same fire, different fuel.” https://www.microsoft.com/en-us/security/blog/2026/04/15/incident-response-for-ai-same-fire-different-fuel/

Author

Siegfried Kamgo

Founder and editorial lead at FrontierWisdom. Engineer turned operator-analyst writing about AI systems, automation infrastructure, decentralised stacks, and the practical economics of frontier technology. Focus: turning fast-moving releases into durable, implementation-ready playbooks.

StepFly: AI Agent Framework for IT Incident Troubleshooting

Turn this article into a repeatable weekly edge.

What is StepFly

What is new vs previous approaches

How does StepFly work

Benchmarks and evidence

Who should care

Builders

Enterprise

End users

Investors

How to use StepFly today

StepFly vs competitors

Risks, limits, and myths

FAQ

Glossary

Sources

Author

Siegfried Kamgo

Get the next blueprint before it becomes common advice.

Related Articles

AI Chatbots Leak Real Phone Numbers, Raising Privacy Concerns

GitHub Copilot App Enters Technical Preview for Agentic Development

Together AI Releases Violin: Open-Source Video Translation Tool

Leave a Reply Cancel reply