Skip to main content
Frontier Signal

StepFly: Microsoft’s AI Agent for Automated IT Troubleshooting

StepFly automates IT troubleshooting guides using AI agents, achieving 94% success rate on GPT-4.1 with 32.9-70.4% faster execution for parallel tasks.

Operator Briefing

Turn this article into a repeatable weekly edge.

Get implementation-minded writeups on frontier tools, systems, and income opportunities built for professionals.

No fluff. No generic AI listicles. Unsubscribe anytime.

StepFly is Microsoft’s agentic AI framework that automates IT troubleshooting guides, achieving 94% success rates while reducing execution time by 32.9-70.4% for parallelizable tasks through structured workflow automation and intelligent scheduling.

Released by Microsoft Research
Release date
What it is End-to-end agentic framework for automating IT troubleshooting guides
Who it’s for Site reliability engineers and IT operations teams
Where to get it GitHub repository at microsoft/StepFly
Price Open source
  • StepFly automates manual troubleshooting guide execution using a three-stage agentic workflow
  • The system achieves 94% success rate on GPT-4.1 with reduced time and token consumption
  • Parallel execution capabilities deliver 32.9-70.4% faster performance for compatible troubleshooting guides
  • TSG Mentor tool helps site reliability engineers improve troubleshooting guide quality
  • Framework processes unstructured guides into structured execution graphs with dedicated query plugins
  • StepFly addresses critical gaps in LLM-based incident management through specialized troubleshooting guide automation
  • The framework’s three-stage approach combines guide quality improvement, offline preprocessing, and online execution
  • Empirical study of 92 real-world troubleshooting guides informed the system’s design and capabilities
  • DAG-guided scheduler-executor framework enables parallel processing of independent troubleshooting steps
  • Open source availability accelerates adoption across enterprise IT operations teams

What is StepFly

StepFly is an end-to-end agentic framework that automates troubleshooting guide execution for large-scale IT incident management. The system addresses manual execution challenges that make traditional troubleshooting slow and error-prone in enterprise environments.

Microsoft Research developed StepFly after conducting an empirical study on 92 real-world troubleshooting guides. Agentic AI offers autonomy, adaptability, and the ability to handle complex, dynamic environments [8], making it suitable for IT operations automation.

The framework specifically targets troubleshooting guide quality issues, complex control flow interpretation, data-intensive query handling, and execution parallelism exploitation. AI agents can detect issues like CPU spikes or failed processes, analyze root causes using real-time and historical data, and apply fixes automatically [6].

What is new vs previous approaches

StepFly introduces specialized capabilities that existing LLM-based incident management solutions lack.

Feature Previous LLM Solutions StepFly
Guide Quality Management Limited support TSG Mentor tool for quality improvement
Control Flow Handling Basic sequential processing Structured DAG extraction and execution
Data Query Processing Generic query handling Dedicated Query Preparation Plugins (QPPs)
Execution Parallelism Sequential execution only Parallel processing of independent steps
Memory System Limited context retention Comprehensive workflow memory system

How does StepFly work

StepFly operates through a three-stage workflow that transforms unstructured troubleshooting guides into automated execution systems.

  1. Guide Quality Improvement: TSG Mentor assists site reliability engineers in identifying and fixing troubleshooting guide quality issues before automation
  2. Offline Preprocessing: LLMs extract structured execution directed acyclic graphs (DAGs) from unstructured guides and create dedicated Query Preparation Plugins
  3. Online Execution: DAG-guided scheduler-executor framework with memory system ensures correct workflow execution and supports parallel processing of independent steps

The system’s DAG-guided approach enables intelligent scheduling of troubleshooting steps. Independent operations execute in parallel while maintaining dependency relationships between sequential steps.

Query Preparation Plugins handle data-intensive operations by preprocessing common query patterns. The memory system tracks execution state and maintains context across complex troubleshooting workflows.

Benchmarks and evidence

StepFly demonstrates superior performance across multiple metrics in real-world troubleshooting scenarios.

Metric StepFly Performance Source
Success Rate on GPT-4.1 94% Microsoft Research evaluation [Source]
Execution Time Reduction 32.9% to 70.4% for parallelizable TSGs Empirical evaluation [Source]
Token Consumption Reduced vs baselines Performance comparison [Source]
Real-world TSGs Studied 92 troubleshooting guides Empirical study [Source]

The evaluation used real-world troubleshooting guides and incidents to validate StepFly’s effectiveness. Performance improvements were most significant for troubleshooting guides with parallelizable steps.

Who should care

Builders

Software engineers developing incident management systems can leverage StepFly’s open source framework. The system provides structured approaches for automating complex operational workflows beyond troubleshooting guides.

Enterprise

Agentic AI enhances incident response speed while also providing a more specific and in-depth post-incident analysis to prevent the same errors from recurring in the future [3]. Large-scale IT operations teams benefit from reduced manual troubleshooting overhead and faster incident resolution.

End users

Site reliability engineers gain access to TSG Mentor for improving troubleshooting guide quality. IT operations staff experience faster incident resolution through automated workflow execution.

Investors

The framework addresses the growing market for AI-powered IT operations. AI agents are most commonly deployed in software engineering, accounting for roughly half of use cases [2], indicating strong market demand.

How to use StepFly today

StepFly is available as an open source project with sample data and implementation guidance.

  1. Access the GitHub repository at https://github.com/microsoft/StepFly
  2. Review the sample troubleshooting guides and incident data provided
  3. Install the framework dependencies and configure your LLM integration
  4. Use TSG Mentor to assess and improve existing troubleshooting guide quality
  5. Process troubleshooting guides through the offline preprocessing stage
  6. Deploy the online execution system with your IT monitoring infrastructure

The repository includes documentation for integrating StepFly with existing incident management workflows. Sample data demonstrates the framework’s capabilities across different troubleshooting scenarios.

StepFly vs competitors

StepFly competes with other AI-powered incident management and troubleshooting automation solutions.

Feature StepFly AWS DevOps Agent Neubird AI SRE
Open Source Yes No No
Parallel Execution Yes, 32.9-70.4% improvement Not yet disclosed Not yet disclosed
Guide Quality Tools TSG Mentor included Not yet disclosed Not yet disclosed
Success Rate 94% on GPT-4.1 Not yet disclosed Not yet disclosed
Structured Workflow DAG-based execution Agentic investigation Autonomous analysis

Risks, limits, and myths

  • Quality Dependency: StepFly’s effectiveness depends on troubleshooting guide quality, requiring TSG Mentor preprocessing for optimal results
  • LLM Limitations: Performance varies with underlying language model capabilities and may require GPT-4.1 or equivalent for best results
  • Parallel Processing Scope: Execution time improvements only apply to troubleshooting guides with parallelizable steps
  • Integration Complexity: Deployment requires integration with existing IT monitoring and incident management systems
  • Myth – Universal Automation: Not all troubleshooting scenarios are suitable for automated execution without human oversight
  • Myth – Immediate Deployment: Organizations need guide quality assessment and preprocessing before production deployment

FAQ

What is StepFly and how does it automate troubleshooting?

StepFly is Microsoft’s agentic AI framework that automates IT troubleshooting guide execution through a three-stage workflow: guide quality improvement, offline preprocessing into structured DAGs, and online execution with parallel processing capabilities.

How accurate is StepFly for automated incident resolution?

StepFly achieves a 94% success rate on GPT-4.1 when evaluated on real-world troubleshooting guides and incidents, outperforming baseline approaches with reduced time and token consumption.

What performance improvements does StepFly provide?

StepFly delivers 32.9% to 70.4% execution time reduction for parallelizable troubleshooting guides through its DAG-guided scheduler-executor framework that processes independent steps simultaneously.

Is StepFly available for commercial use?

Yes, StepFly is open source and available through the GitHub repository at microsoft/StepFly, including sample data and implementation documentation for enterprise deployment.

What is TSG Mentor in StepFly?

TSG Mentor is StepFly’s tool that assists site reliability engineers in identifying and improving troubleshooting guide quality issues before automated execution, ensuring better automation outcomes.

How does StepFly handle complex troubleshooting workflows?

StepFly uses directed acyclic graphs (DAGs) to represent troubleshooting workflows, enabling proper dependency management while allowing parallel execution of independent steps through its scheduler-executor framework.

What LLM requirements does StepFly have?

StepFly achieved its 94% success rate using GPT-4.1, though the framework is designed to work with other large language models with appropriate capabilities for workflow understanding and execution.

Can StepFly integrate with existing IT monitoring systems?

Yes, StepFly includes Query Preparation Plugins (QPPs) for handling data-intensive queries and can integrate with existing IT monitoring infrastructure through its online execution framework.

What types of troubleshooting guides work best with StepFly?

StepFly works with various troubleshooting guides but provides maximum benefit for guides with parallelizable steps and well-structured workflows that can be represented as directed acyclic graphs.

How was StepFly validated for real-world use?

Microsoft Research conducted an empirical study on 92 real-world troubleshooting guides and validated StepFly’s performance using actual incidents, demonstrating its effectiveness in production-like scenarios.

What are the main components of StepFly’s architecture?

StepFly consists of TSG Mentor for guide quality improvement, offline preprocessing for DAG extraction and QPP creation, and online execution with a DAG-guided scheduler-executor framework plus memory system.

How does StepFly compare to manual troubleshooting approaches?

StepFly addresses the slow and error-prone nature of manual troubleshooting guide execution by automating workflow interpretation, enabling parallel processing, and maintaining consistent execution quality across incidents.

Glossary

Agentic AI
AI systems that can act autonomously, make decisions, and execute tasks without constant human supervision, often using large language models for control flow
DAG (Directed Acyclic Graph)
A structured representation of workflows where tasks have directional dependencies but no circular references, enabling parallel execution of independent paths
Query Preparation Plugins (QPPs)
Specialized components in StepFly that handle data-intensive queries by preprocessing common patterns and optimizing data retrieval operations
Site Reliability Engineer (SRE)
IT professionals responsible for maintaining system reliability, availability, and performance through operational practices and automation
TSG (Troubleshooting Guide)
Structured documentation that provides step-by-step procedures for diagnosing and resolving specific IT system issues and incidents
TSG Mentor
StepFly’s tool that helps site reliability engineers identify and improve troubleshooting guide quality before automated execution

Visit the StepFly GitHub repository at microsoft/StepFly to access the open source framework, sample data, and implementation documentation for your IT operations automation needs.

Sources

  1. Neubird AI SRE – Autonomous Incident Resolution. https://neubird.ai/
  2. Best practices for building agentic systems | InfoWorld. https://www.infoworld.com/article/4154570/best-practices-for-building-agentic-systems.html
  3. What is Agentic AI? – Agentic AI Explained – AWS. https://aws.amazon.com/what-is/agentic-ai/
  4. AWS Announces General Availability of DevOps Agent for Automated Incident Investigation – InfoQ. https://www.infoq.com/news/2026/04/aws-devops-agent-ga/
  5. AI agent – Wikipedia. https://en.wikipedia.org/wiki/AI_agent
  6. An MSP’s guide to agentic AI. https://superops.com/blog/an-msps-guide-to-agentic-ai
  7. Incident response for AI: Same fire, different fuel | Microsoft Security Blog. https://www.microsoft.com/en-us/security/blog/2026/04/15/incident-response-for-ai-same-fire-different-fuel/
  8. What is Agentic AI? Key Benefits & Features. https://www.automationanywhere.com/rpa/agentic-ai

Author

  • siego237

    Writes for FrontierWisdom on AI systems, automation, decentralized identity, and frontier infrastructure, with a focus on turning emerging technology into practical playbooks, implementation roadmaps, and monetization strategies for operators, builders, and consultants.

Keep Compounding Signal

Get the next blueprint before it becomes common advice.

Join the newsletter for future-economy playbooks, tactical prompts, and high-margin tool recommendations.

  • Actionable execution blueprints
  • High-signal tool and infrastructure breakdowns
  • New monetization angles before they saturate

No fluff. No generic AI listicles. Unsubscribe anytime.

Leave a Reply

Your email address will not be published. Required fields are marked *