Skip to main content
Frontier Signal

StepFly: AI Agent Automates IT Troubleshooting Guides

StepFly achieves 94% success rate automating IT troubleshooting guides with AI agents, reducing execution time by 32.9-70.4% through parallel processing and DAG workflows.

Operator Briefing

Turn this article into a repeatable weekly edge.

Get implementation-minded writeups on frontier tools, systems, and income opportunities built for professionals.

No fluff. No generic AI listicles. Unsubscribe anytime.

StepFly is Microsoft’s agentic AI framework that automates IT troubleshooting guide execution, achieving 94% success rates while reducing execution time by 32.9-70.4% through parallel processing and structured workflow management for site reliability engineers.

Released by Microsoft Research
Release date
What it is AI agent framework for automating IT troubleshooting guides
Who it is for Site reliability engineers and IT operations teams
Where to get it GitHub repository at microsoft/StepFly
Price Open source
  • StepFly automates troubleshooting guide execution with 94% success rate on GPT-4.1
  • Three-stage workflow includes quality improvement, preprocessing, and parallel execution
  • Reduces execution time by 32.9-70.4% for parallelizable troubleshooting guides
  • Converts unstructured guides into directed acyclic graphs for systematic execution
  • Open source framework available on GitHub with sample data and documentation
  • Manual troubleshooting guide execution is slow and error-prone in large-scale IT systems
  • StepFly addresses TSG quality issues, complex control flow, and data-intensive queries
  • The framework enables parallel execution of independent troubleshooting steps
  • Empirical study analyzed 92 real-world troubleshooting guides to inform design
  • DAG-guided scheduler-executor framework ensures correct workflow execution

What is StepFly

StepFly is an end-to-end agentic framework that automates troubleshooting guide execution for IT incident management. The system converts unstructured troubleshooting guides into structured execution workflows using large language models. Agentic AI enhances incident response speed while also providing a more specific and in-depth post-incident analysis to prevent the same errors from recurring in the future [3]. StepFly specifically targets the challenges of manual troubleshooting guide execution, which is traditionally slow and error-prone in large-scale IT environments.

What is new vs previous approaches

StepFly introduces specialized capabilities that existing LLM-based incident management solutions lack.

Feature Previous LLM Solutions StepFly
TSG Quality Management Limited support TSG Mentor tool for quality improvement
Control Flow Interpretation Basic sequential processing Directed acyclic graph extraction and execution
Data-Intensive Queries Generic handling Dedicated Query Preparation Plugins
Parallel Execution Not supported Scheduler-executor framework with memory system
Workflow Structure Ad-hoc processing Three-stage systematic workflow

How does StepFly work

StepFly operates through a three-stage workflow that transforms unstructured troubleshooting guides into automated execution systems.

  1. Quality Improvement Stage: TSG Mentor tool assists site reliability engineers in improving troubleshooting guide quality and completeness
  2. Offline Preprocessing Stage: LLMs extract structured directed acyclic graphs from unstructured guides and create Query Preparation Plugins for data handling
  3. Online Execution Stage: DAG-guided scheduler-executor framework with memory system executes workflows and supports parallel processing of independent steps

The system maintains workflow correctness through its memory system while enabling parallel execution of independent troubleshooting steps. AI agents can detect issues like CPU spikes or failed processes, analyze root causes using real-time and historical data, and apply fixes automatically [6].

Benchmarks and evidence

StepFly demonstrates superior performance across multiple metrics in real-world troubleshooting scenarios.

Metric StepFly Performance Source
Success Rate 94% on GPT-4.1 Microsoft Research evaluation
Execution Time Reduction 32.9% to 70.4% for parallelizable TSGs Microsoft Research evaluation
Token Consumption Lower than baseline approaches Microsoft Research evaluation
Real-world TSGs Analyzed 92 troubleshooting guides Empirical study foundation

Who should care

Builders

Software engineers and DevOps professionals can integrate StepFly’s open-source framework into existing incident management workflows. The system provides APIs and tools for customizing troubleshooting guide automation. According to Anthropic, a provider of large language models (LLMs), AI agents are most commonly deployed in software engineering, accounting for roughly half of use cases, followed by back-office automation, marketing, sales, finance, and data analysis [2].

Enterprise

Large organizations with complex IT infrastructures can reduce incident resolution time and human error through automated troubleshooting. The framework addresses scalability challenges in manual incident management processes. Automation is essential here; manual review cannot keep pace [7] with the volume of incidents in enterprise environments.

End users

Site reliability engineers and IT operations teams benefit from reduced manual workload and faster incident resolution. The system maintains human oversight while automating repetitive troubleshooting tasks.

Investors

The incident management automation market represents significant opportunity as organizations seek to reduce operational costs and improve system reliability through AI-powered solutions.

How to use StepFly today

StepFly is available as an open-source framework with complete implementation and sample data.

  1. Clone the repository: git clone https://github.com/microsoft/StepFly
  2. Install dependencies according to the provided requirements file
  3. Prepare troubleshooting guides using the TSG Mentor tool for quality improvement
  4. Run offline preprocessing to convert guides into directed acyclic graphs
  5. Configure the scheduler-executor framework for your IT environment
  6. Deploy the system for automated troubleshooting guide execution

StepFly vs competitors

StepFly competes with other AI-powered incident management and troubleshooting automation solutions.

Feature StepFly Traditional LLM Solutions Manual TSG Execution
Success Rate 94% on GPT-4.1 Not yet disclosed Variable, error-prone
Parallel Execution Supported with 32.9-70.4% time reduction Not supported Not supported
Quality Management TSG Mentor tool included Limited Manual review
Structured Workflow DAG-based execution Sequential processing Ad-hoc execution
Open Source Yes Varies N/A

Risks, limits, and myths

  • System performance depends on troubleshooting guide quality and completeness
  • Complex IT environments may require extensive customization and configuration
  • LLM accuracy limitations can affect automated decision-making in critical incidents
  • Parallel execution benefits only apply to troubleshooting guides with independent steps
  • Human oversight remains necessary for high-stakes incident resolution scenarios
  • Integration complexity may require significant engineering resources for deployment

FAQ

What is StepFly and how does it automate troubleshooting?

StepFly is Microsoft’s agentic AI framework that converts unstructured IT troubleshooting guides into automated workflows, achieving 94% success rates through structured execution and parallel processing capabilities.

How much faster is StepFly compared to manual troubleshooting?

StepFly reduces execution time by 32.9% to 70.4% for parallelizable troubleshooting guides while consuming fewer tokens than baseline approaches according to Microsoft’s evaluation.

What makes StepFly different from other AI incident management tools?

StepFly provides specialized support for TSG quality management, complex control flow interpretation, data-intensive queries, and parallel execution that existing LLM-based solutions lack.

Is StepFly available for commercial use?

StepFly is open source and available on GitHub at microsoft/StepFly with complete implementation code and sample data for commercial and research use.

What are the three stages of StepFly’s workflow?

StepFly operates through quality improvement using TSG Mentor, offline preprocessing with DAG extraction, and online execution with parallel processing capabilities.

How does StepFly handle complex troubleshooting guide structures?

StepFly extracts directed acyclic graphs from unstructured guides and uses a DAG-guided scheduler-executor framework with memory system to ensure correct workflow execution.

What types of IT environments can benefit from StepFly?

Large-scale IT systems with complex troubleshooting procedures benefit most from StepFly’s automation, particularly environments with parallelizable troubleshooting steps and quality management needs.

Does StepFly require human oversight for incident management?

StepFly automates troubleshooting guide execution but maintains integration points for human oversight, particularly for high-stakes incidents and quality improvement processes through TSG Mentor.

What LLM models does StepFly support?

StepFly achieved 94% success rate on GPT-4.1 according to Microsoft’s evaluation, though specific support for other LLM models is not yet disclosed.

How does StepFly improve troubleshooting guide quality?

StepFly includes TSG Mentor, a dedicated tool that assists site reliability engineers in improving troubleshooting guide quality and completeness before automated execution.

Glossary

Agentic AI
AI systems that can act independently with complex goal structures, natural language interfaces, and integration of software tools or planning systems
DAG (Directed Acyclic Graph)
A structured representation of workflows where tasks have dependencies but no circular references, enabling parallel execution of independent steps
TSG (Troubleshooting Guide)
Structured documentation that provides step-by-step procedures for diagnosing and resolving IT incidents and system issues
SRE (Site Reliability Engineer)
IT professionals responsible for maintaining system reliability, performance, and incident response in large-scale technology environments
Query Preparation Plugins (QPPs)
Specialized components in StepFly that handle data-intensive queries during troubleshooting guide execution
TSG Mentor
StepFly’s tool for assisting site reliability engineers in improving troubleshooting guide quality and completeness

Visit the StepFly GitHub repository at microsoft/StepFly to download the open-source framework and explore sample troubleshooting guide automation implementations.

Sources

  1. Neubird AI SRE – Autonomous Incident Resolution
  2. InfoWorld – Best practices for building agentic systems
  3. AWS – What is Agentic AI?
  4. InfoQ – AWS Announces General Availability of DevOps Agent
  5. Wikipedia – AI agent
  6. SuperOps – An MSP’s guide to agentic AI
  7. Microsoft Security Blog – Incident response for AI
  8. Automation Anywhere – What is Agentic AI?

Author

  • siego237

    Writes for FrontierWisdom on AI systems, automation, decentralized identity, and frontier infrastructure, with a focus on turning emerging technology into practical playbooks, implementation roadmaps, and monetization strategies for operators, builders, and consultants.

Keep Compounding Signal

Get the next blueprint before it becomes common advice.

Join the newsletter for future-economy playbooks, tactical prompts, and high-margin tool recommendations.

  • Actionable execution blueprints
  • High-signal tool and infrastructure breakdowns
  • New monetization angles before they saturate

No fluff. No generic AI listicles. Unsubscribe anytime.

Leave a Reply

Your email address will not be published. Required fields are marked *