Skip to main content
Frontier Signal

StepFly: AI Agent Automates IT Troubleshooting Guides

StepFly achieves 94% success rate automating IT troubleshooting guides with AI agents, reducing execution time by 32.9-70.4% through parallel processing and DAG workflows.

Operator Briefing

Turn this article into a repeatable weekly edge.

Get implementation-minded writeups on frontier tools, systems, and income opportunities built for professionals.

No fluff. No generic AI listicles. Unsubscribe anytime.

StepFly is Microsoft’s AI agent framework that automates IT troubleshooting guides, achieving 94% success rate on GPT-4.1 while reducing execution time by 32.9-70.4% through parallel processing and structured workflow automation.

Released by Microsoft Research
Release date
What it is AI agent framework for automating IT troubleshooting guides
Who it is for Site reliability engineers and IT operations teams
Where to get it GitHub repository
Price Open source
  • StepFly automates manual troubleshooting guide execution with 94% success rate on real-world incidents
  • Three-stage workflow includes guide quality improvement, offline preprocessing, and parallel execution
  • Reduces execution time by 32.9-70.4% for parallelizable troubleshooting scenarios
  • Features TSG Mentor tool to help site reliability engineers improve guide quality
  • Open source framework available on GitHub with sample data and implementation code
  • StepFly addresses critical gaps in existing LLM-based incident management solutions through specialized troubleshooting guide automation
  • The framework processes unstructured troubleshooting guides into structured execution graphs for reliable automation
  • Parallel execution capabilities significantly reduce incident resolution time compared to sequential manual processes
  • TSG Mentor component helps improve troubleshooting guide quality before automation deployment
  • Real-world evaluation on 92 troubleshooting guides demonstrates practical applicability in enterprise environments

What is StepFly

StepFly is an end-to-end agentic framework that automates troubleshooting guide execution for IT incident management. The system transforms manual troubleshooting processes into automated workflows using large language models and structured execution graphs. AI agents can detect issues like CPU spikes or failed processes, analyze root causes using real-time and historical data, and apply fixes automatically [5].

The framework addresses four key challenges in troubleshooting guide automation: managing guide quality issues, interpreting complex control flow, handling data-intensive queries, and exploiting execution parallelism. Microsoft researchers developed StepFly after conducting an empirical study on 92 real-world troubleshooting guides to identify common automation barriers.

What is new vs previous approaches

StepFly introduces specialized capabilities not found in existing LLM-based incident management solutions.

Feature Previous LLM Solutions StepFly
Guide Quality Management No specialized support TSG Mentor tool for quality improvement
Control Flow Interpretation Limited structured processing Directed acyclic graph extraction
Data Query Handling Basic query processing Dedicated Query Preparation Plugins
Execution Parallelism Sequential processing DAG-guided parallel execution
Memory System Limited workflow state Comprehensive memory for workflow continuity

How does StepFly work

StepFly operates through a three-stage workflow that transforms manual troubleshooting into automated execution.

  1. Guide Quality Improvement: TSG Mentor assists site reliability engineers in identifying and fixing quality issues in existing troubleshooting guides before automation
  2. Offline Preprocessing: Large language models extract structured directed acyclic graphs from unstructured troubleshooting guides and create Query Preparation Plugins for data-intensive operations
  3. Online Execution: DAG-guided scheduler-executor framework runs troubleshooting steps with memory system support and parallel execution of independent operations

Agentic AI enhances incident response speed while also providing a more specific and in-depth post-incident analysis to prevent the same errors from recurring in the future [3]. The framework maintains state and traceability across the full incident resolution lifecycle.

Benchmarks and evidence

StepFly demonstrates superior performance across multiple metrics in real-world troubleshooting scenarios.

Metric StepFly Performance Source
Success Rate 94% on GPT-4.1 Microsoft Research evaluation
Execution Time Reduction 32.9% to 70.4% for parallelizable guides Empirical study on real-world TSGs
Token Consumption Lower than baseline approaches Comparative analysis
Guide Coverage 92 real-world troubleshooting guides tested Empirical validation study

Who should care

Builders

Software engineers and DevOps practitioners can integrate StepFly’s open-source framework into existing incident management workflows. According to Anthropic, a provider of large language models, AI agents are most commonly deployed in software engineering, accounting for roughly half of use cases [2]. The GitHub repository provides implementation code and sample data for development teams.

Enterprise

Large-scale IT organizations benefit from StepFly’s automation of manual troubleshooting processes that are typically slow and error-prone. A concrete example is in IT incident resolution [2], where enterprises can reduce mean time to resolution while improving consistency across incident response teams.

End users

Site reliability engineers and IT operations teams gain tools for improving troubleshooting guide quality and automating routine incident response tasks. The TSG Mentor component specifically addresses guide quality issues that prevent effective automation.

Investors

The incident management automation market represents significant opportunity as organizations seek to reduce operational costs and improve system reliability. StepFly’s demonstrated performance metrics indicate commercial viability for agentic AI solutions in enterprise IT operations.

How to use StepFly today

StepFly is available as an open-source framework through Microsoft’s GitHub repository.

  1. Clone the repository: git clone https://github.com/microsoft/StepFly
  2. Install dependencies and configure the framework according to documentation
  3. Use TSG Mentor to assess and improve existing troubleshooting guide quality
  4. Run offline preprocessing to convert guides into structured execution graphs
  5. Deploy the DAG-guided executor for automated incident response

The repository includes sample troubleshooting guides and incident data for testing and development purposes.

StepFly vs competitors

StepFly competes with other AI-powered incident management solutions in the enterprise market.

Solution Approach Parallel Execution Guide Quality Tools Open Source
StepFly DAG-guided agentic framework Yes TSG Mentor included Yes
AWS DevOps Agent Serverless incident investigation Not disclosed Not disclosed No
Neubird AI SRE Telemetry analysis and correlation Not disclosed Not disclosed No

Risks, limits, and myths

  • StepFly requires high-quality troubleshooting guides as input; poor guide quality limits automation effectiveness
  • The framework depends on large language model performance, which can vary across different incident types
  • Complex troubleshooting scenarios may still require human intervention despite automation capabilities
  • Implementation requires technical expertise in AI systems and incident management processes
  • Token consumption costs may be significant for organizations with high incident volumes
  • The 94% success rate, while high, means 6% of incidents may still require manual intervention

FAQ

What is StepFly and how does it automate troubleshooting?

StepFly is Microsoft’s AI agent framework that converts manual troubleshooting guides into automated workflows using directed acyclic graphs and parallel execution capabilities.

How accurate is StepFly for incident resolution?

StepFly achieves a 94% success rate on GPT-4.1 when tested on real-world troubleshooting guides and incidents.

Can StepFly reduce incident resolution time?

Yes, StepFly reduces execution time by 32.9% to 70.4% for parallelizable troubleshooting guides through automated parallel processing.

Is StepFly available for commercial use?

StepFly is open source and available through Microsoft’s GitHub repository with sample data and implementation code.

What makes StepFly different from other AI incident management tools?

StepFly provides specialized support for troubleshooting guide quality management, complex control flow interpretation, and parallel execution capabilities.

Do I need existing troubleshooting guides to use StepFly?

Yes, StepFly requires existing troubleshooting guides as input and includes TSG Mentor to help improve guide quality before automation.

What technical skills are needed to implement StepFly?

Implementation requires expertise in AI systems, large language models, and incident management processes for effective deployment.

How does StepFly handle complex incident scenarios?

StepFly uses directed acyclic graphs to manage complex control flows and Query Preparation Plugins for data-intensive troubleshooting operations.

What are the costs of running StepFly?

While the framework is open source, organizations incur large language model token consumption costs during automated troubleshooting execution.

Can StepFly work with existing monitoring tools?

StepFly focuses on troubleshooting guide automation rather than direct monitoring tool integration, though it can process data from various sources through Query Preparation Plugins.

Glossary

Agentic AI
AI systems that can act independently with complex goal structures, natural language interfaces, and integrated software tools
Directed Acyclic Graph (DAG)
A structured representation of workflow steps that prevents circular dependencies and enables parallel execution
Query Preparation Plugins (QPPs)
Specialized components that handle data-intensive queries during troubleshooting guide execution
Site Reliability Engineer (SRE)
IT professionals responsible for maintaining system reliability, availability, and performance in production environments
TSG Mentor
StepFly’s tool for assisting site reliability engineers in improving troubleshooting guide quality before automation
Troubleshooting Guide (TSG)
Structured documentation that provides step-by-step procedures for diagnosing and resolving IT incidents

Visit the StepFly GitHub repository at https://github.com/microsoft/StepFly to access the open-source framework, sample data, and implementation documentation.

Sources

  1. AI SRE – Autonomous Incident Resolution – Neubird. https://neubird.ai/
  2. Best practices for building agentic systems | InfoWorld. https://www.infoworld.com/article/4154570/best-practices-for-building-agentic-systems.html
  3. What is Agentic AI? – Agentic AI Explained – AWS. https://aws.amazon.com/what-is/agentic-ai/
  4. AWS Announces General Availability of DevOps Agent for Automated Incident Investigation – InfoQ. https://www.infoq.com/news/2026/04/aws-devops-agent-ga/
  5. An MSP’s guide to agentic AI. https://superops.com/blog/an-msps-guide-to-agentic-ai
  6. AI agent – Wikipedia. https://en.wikipedia.org/wiki/AI_agent
  7. Agentic Engineering: How Swarms of AI Agents Are Redefining Software Engineering. https://www.langchain.com/blog/agentic-engineering-redefining-software-engineering
  8. Incident response for AI: Same fire, different fuel | Microsoft Security Blog. https://www.microsoft.com/en-us/security/blog/2026/04/15/incident-response-for-ai-same-fire-different-fuel/

Author

  • siego237

    Writes for FrontierWisdom on AI systems, automation, decentralized identity, and frontier infrastructure, with a focus on turning emerging technology into practical playbooks, implementation roadmaps, and monetization strategies for operators, builders, and consultants.

Keep Compounding Signal

Get the next blueprint before it becomes common advice.

Join the newsletter for future-economy playbooks, tactical prompts, and high-margin tool recommendations.

  • Actionable execution blueprints
  • High-signal tool and infrastructure breakdowns
  • New monetization angles before they saturate

No fluff. No generic AI listicles. Unsubscribe anytime.

Leave a Reply

Your email address will not be published. Required fields are marked *