Skip to main content
Frontier Signal

StepFly: AI-Powered Troubleshooting Guide Automation

StepFly automates IT troubleshooting guides using AI agents, achieving 94% success rate with GPT-4.1 and reducing execution time by up to 70.4% for parallel tasks.

Operator Briefing

Turn this article into a repeatable weekly edge.

Get implementation-minded writeups on frontier tools, systems, and income opportunities built for professionals.

No fluff. No generic AI listicles. Unsubscribe anytime.

StepFly is an AI-powered framework that automates troubleshooting guide execution for IT incident management, achieving 94% success rate with GPT-4.1 while reducing execution time by up to 70.4% through parallel processing and structured workflow automation.

Released by Microsoft Research
Release date
What it is AI-powered framework for automating IT troubleshooting guides
Who it is for Site reliability engineers and IT operations teams
Where to get it GitHub repository at microsoft/StepFly
Price Open source
  • StepFly automates manual troubleshooting guide execution using a three-stage agentic AI workflow
  • The framework achieves 94% success rate on GPT-4.1, outperforming existing baseline approaches
  • Parallel execution capabilities reduce troubleshooting time by 32.9% to 70.4% for compatible guides
  • TSG Mentor tool helps site reliability engineers improve troubleshooting guide quality
  • System processes unstructured guides into structured execution graphs for automated workflow management
  • Manual troubleshooting guide execution is slow and error-prone in large-scale IT systems
  • StepFly addresses key challenges including TSG quality issues, complex control flow, and data-intensive queries
  • The framework uses directed acyclic graphs to structure and parallelize troubleshooting workflows
  • Empirical evaluation on 92 real-world troubleshooting guides demonstrates significant performance improvements
  • Open source availability enables widespread adoption across IT operations teams

What is StepFly

StepFly is an end-to-end agentic framework that automates troubleshooting guide execution for IT incident management. The system transforms manual, error-prone troubleshooting processes into automated workflows using large language models and structured execution graphs. AI agents can detect issues like CPU spikes or failed processes, analyze root causes using real-time and historical data, and apply fixes automatically [5].

The framework addresses critical challenges in incident management where manual execution of troubleshooting guides creates bottlenecks and introduces human error. StepFly leverages agentic AI principles to provide autonomous decision-making capabilities while maintaining oversight and control for site reliability engineers.

What is new vs previous approaches

StepFly introduces specialized support for troubleshooting guide automation that existing LLM-based solutions lack.

Feature Previous LLM Solutions StepFly
TSG Quality Management No specialized support TSG Mentor tool for guide improvement
Control Flow Interpretation Limited complex workflow handling Structured DAG extraction and execution
Data-Intensive Queries Basic query processing Dedicated Query Preparation Plugins
Parallel Execution Sequential processing only DAG-guided parallel step execution
Memory System Limited workflow state tracking Comprehensive memory for workflow continuity

How does StepFly work

StepFly operates through a three-stage workflow that transforms unstructured troubleshooting guides into automated execution systems.

  1. Guide Quality Enhancement: TSG Mentor assists site reliability engineers in improving troubleshooting guide quality and structure before automation
  2. Offline Preprocessing: LLMs extract structured execution directed acyclic graphs from unstructured guides and create Query Preparation Plugins for data operations
  3. Online Execution: DAG-guided scheduler-executor framework with memory system ensures correct workflow execution and supports parallel processing of independent steps

Agentic AI enhances incident response speed while also providing a more specific and in-depth post-incident analysis to prevent the same errors from recurring in the future [3]. The system maintains state and traceability across the full troubleshooting lifecycle.

Benchmarks and evidence

StepFly demonstrates significant performance improvements across multiple metrics in empirical evaluations.

Metric StepFly Performance Source
Success Rate on GPT-4.1 94% Microsoft Research evaluation
Execution Time Reduction 32.9% to 70.4% for parallelizable TSGs Microsoft Research evaluation
Real-world TSGs Analyzed 92 troubleshooting guides Empirical study dataset
Token Consumption Lower than baseline approaches Microsoft Research comparison

Who should care

Builders

Software engineers and DevOps professionals can integrate StepFly into existing incident management workflows. According to Anthropic, a provider of large language models, AI agents are most commonly deployed in software engineering, accounting for roughly half of use cases [2]. The open source framework enables customization for specific infrastructure requirements.

Enterprise

Large-scale IT operations teams benefit from reduced mean time to resolution and improved incident response consistency. Agentic engineering operates at a higher level of abstraction: it’s a control plane that orchestrates cross-team workflows, maintains long-term memory across agents, and manages state and traceability across the full software delivery lifecycle [7].

End users

System users experience fewer service disruptions and faster resolution times when incidents occur. Automated troubleshooting reduces the human error factor that contributes to extended outages.

Investors

The framework represents Microsoft’s investment in agentic AI for enterprise operations, demonstrating practical applications of LLM technology in critical infrastructure management.

How to use StepFly today

StepFly is available as an open source project with implementation guidance for immediate deployment.

  1. Clone the repository from GitHub at microsoft/StepFly
  2. Install required dependencies including LLM access credentials
  3. Prepare existing troubleshooting guides using the TSG Mentor tool
  4. Configure Query Preparation Plugins for your data sources
  5. Deploy the DAG-guided scheduler-executor framework in your environment
  6. Test automation with non-critical troubleshooting scenarios
  7. Gradually expand to production incident management workflows

StepFly vs competitors

StepFly competes with other AI-powered incident management solutions in the enterprise market.

Feature StepFly AWS DevOps Agent Neubird AI SRE
Open Source Yes No No
Parallel Execution Yes, DAG-guided Not yet disclosed Not yet disclosed
TSG Quality Tools TSG Mentor included Not yet disclosed Not yet disclosed
Success Rate 94% on GPT-4.1 Not yet disclosed Not yet disclosed
Platform Integration Multi-platform AWS-focused Multi-platform

Risks, limits, and myths

  • LLM dependency creates potential points of failure if model services become unavailable
  • Complex troubleshooting scenarios may require human oversight despite automation capabilities
  • Initial setup requires significant investment in guide preparation and system configuration
  • Parallel execution benefits only apply to troubleshooting guides with independent steps
  • Success rates may vary significantly based on troubleshooting guide quality and complexity
  • Myth: Complete replacement of human SREs – StepFly augments rather than replaces human expertise
  • Myth: Universal compatibility – System requires structured input and may not work with all existing guides

FAQ

What is StepFly and how does it work?

StepFly is an AI-powered framework that automates IT troubleshooting guide execution through a three-stage workflow involving guide quality enhancement, offline preprocessing, and online execution with parallel processing capabilities.

How accurate is StepFly for troubleshooting automation?

StepFly achieves a 94% success rate when using GPT-4.1, based on empirical evaluation with 92 real-world troubleshooting guides.

Can StepFly reduce incident resolution time?

Yes, StepFly reduces execution time by 32.9% to 70.4% for troubleshooting guides that support parallel processing of independent steps.

Is StepFly available for free?

StepFly is open source and available at no cost through the GitHub repository at microsoft/StepFly.

What makes StepFly different from other AI incident management tools?

StepFly provides specialized support for troubleshooting guide quality management, complex control flow interpretation, data-intensive queries, and parallel execution that existing LLM-based solutions lack.

Do I need technical expertise to implement StepFly?

Implementation requires software engineering knowledge for system integration, LLM configuration, and troubleshooting guide preparation using the included TSG Mentor tool.

Can StepFly work with existing troubleshooting documentation?

StepFly transforms unstructured troubleshooting guides into structured execution graphs, but guides may require quality improvements using the TSG Mentor tool before automation.

What are the main limitations of StepFly?

StepFly depends on LLM availability, requires structured input preparation, and may need human oversight for complex scenarios despite its 94% success rate.

How does StepFly handle parallel troubleshooting steps?

StepFly uses directed acyclic graphs to identify and execute independent troubleshooting steps in parallel, reducing overall execution time significantly.

What infrastructure is needed to run StepFly?

StepFly requires access to large language models, integration with existing monitoring systems, and deployment of the DAG-guided scheduler-executor framework.

Glossary

Agentic AI
AI systems with complex goal structures, natural language interfaces, and capacity to act independently with integrated software tools
DAG (Directed Acyclic Graph)
Structured representation of workflow steps that enables parallel execution of independent tasks
TSG (Troubleshooting Guide)
Documentation that provides step-by-step instructions for diagnosing and resolving IT system issues
SRE (Site Reliability Engineer)
IT professional responsible for maintaining system reliability, availability, and performance
Query Preparation Plugins
Specialized components that handle data-intensive operations within troubleshooting workflows
TSG Mentor
Tool within StepFly that assists engineers in improving troubleshooting guide quality before automation

Visit the GitHub repository at microsoft/StepFly to download the open source framework and begin implementing automated troubleshooting in your IT environment.

Sources

  1. AI SRE – Autonomous Incident Resolution – Neubird. https://neubird.ai/
  2. Best practices for building agentic systems | InfoWorld. https://www.infoworld.com/article/4154570/best-practices-for-building-agentic-systems.html
  3. What is Agentic AI? – Agentic AI Explained – AWS. https://aws.amazon.com/what-is/agentic-ai/
  4. AWS Announces General Availability of DevOps Agent for Automated Incident Investigation – InfoQ. https://www.infoq.com/news/2026/04/aws-devops-agent-ga/
  5. An MSP’s guide to agentic AI. https://superops.com/blog/an-msps-guide-to-agentic-ai
  6. AI agent – Wikipedia. https://en.wikipedia.org/wiki/AI_agent
  7. Agentic Engineering: How Swarms of AI Agents Are Redefining Software Engineering. https://www.langchain.com/blog/agentic-engineering-redefining-software-engineering
  8. Incident response for AI: Same fire, different fuel | Microsoft Security Blog. https://www.microsoft.com/en-us/security/blog/2026/04/15/incident-response-for-ai-same-fire-different-fuel/

Author

  • siego237

    Writes for FrontierWisdom on AI systems, automation, decentralized identity, and frontier infrastructure, with a focus on turning emerging technology into practical playbooks, implementation roadmaps, and monetization strategies for operators, builders, and consultants.

Keep Compounding Signal

Get the next blueprint before it becomes common advice.

Join the newsletter for future-economy playbooks, tactical prompts, and high-margin tool recommendations.

  • Actionable execution blueprints
  • High-signal tool and infrastructure breakdowns
  • New monetization angles before they saturate

No fluff. No generic AI listicles. Unsubscribe anytime.

Leave a Reply

Your email address will not be published. Required fields are marked *