Skip to main content
Frontier Signal

StepFly: AI Agent Automates IT Troubleshooting Guides

StepFly achieves 94% success rate automating IT troubleshooting guides using agentic AI, reducing execution time by 32.9-70.4% through parallel processing.

Operator Briefing

Turn this article into a repeatable weekly edge.

Get implementation-minded writeups on frontier tools, systems, and income opportunities built for professionals.

No fluff. No generic AI listicles. Unsubscribe anytime.

StepFly is Microsoft’s agentic AI framework that automates IT troubleshooting guides, achieving 94% success rate on GPT-4.1 while reducing execution time by 32.9-70.4% through parallel processing of independent troubleshooting steps.

Released by Microsoft Research
Release date
What it is Agentic AI framework for automating IT troubleshooting guides
Who it’s for Site reliability engineers and IT operations teams
Where to get it GitHub (microsoft/StepFly)
Price Open source
  • StepFly automates troubleshooting guide execution with 94% success rate on GPT-4.1
  • Three-stage workflow includes guide quality improvement, offline preprocessing, and online execution
  • Reduces execution time by 32.9-70.4% through parallel processing of independent steps
  • Tested on 92 real-world troubleshooting guides from enterprise IT environments
  • Open source framework available on GitHub with sample data and documentation
  • Manual troubleshooting guide execution is slow and error-prone in large-scale IT systems
  • StepFly addresses key challenges including TSG quality issues and complex control flow interpretation
  • The framework uses directed acyclic graphs (DAGs) to structure unstructured troubleshooting guides
  • Parallel execution capabilities significantly reduce time-to-resolution for IT incidents
  • Agentic AI systems are increasingly deployed in software engineering and IT operations

What is StepFly

StepFly is an end-to-end agentic framework that automates troubleshooting guide execution for IT incident management. The system addresses manual execution challenges that are slow and error-prone in large-scale IT environments.

The framework emerged from an empirical study of 92 real-world troubleshooting guides (TSGs). Agentic AI enhances incident response speed while also providing more specific and in-depth post-incident analysis [3]. StepFly specifically targets site reliability engineers (SREs) who manage complex IT infrastructure troubleshooting workflows.

Microsoft Research developed StepFly to handle specialized challenges including TSG quality management, complex control flow interpretation, data-intensive queries, and execution parallelism. The system integrates with existing IT monitoring and management tools through its plugin architecture.

What is new vs previous solutions

StepFly introduces specialized capabilities that existing LLM-based incident management solutions lack.

Feature Previous Solutions StepFly
TSG Quality Management Manual review required TSG Mentor tool for automated quality assistance
Control Flow Handling Linear execution only DAG-guided scheduler with complex workflow support
Data Query Processing Generic LLM queries Dedicated Query Preparation Plugins (QPPs)
Execution Model Sequential step processing Parallel execution of independent steps
Memory System Limited context retention Comprehensive memory system for workflow state

How does StepFly work

StepFly operates through a three-stage workflow that transforms unstructured troubleshooting guides into automated execution systems.

  1. Guide Quality Improvement Stage: TSG Mentor tool assists SREs in identifying and fixing quality issues in existing troubleshooting guides before automation.
  2. Offline Preprocessing Stage: LLMs extract structured directed acyclic graphs (DAGs) from unstructured TSGs and create dedicated Query Preparation Plugins for data-intensive operations.
  3. Online Execution Stage: DAG-guided scheduler-executor framework with memory system ensures correct workflow execution and supports parallel processing of independent troubleshooting steps.
  4. Parallel Processing: Independent troubleshooting steps execute simultaneously, reducing overall incident resolution time by 32.9-70.4% for parallelizable TSGs.
  5. Memory Management: Comprehensive memory system maintains workflow state and execution context across complex multi-step troubleshooting procedures.

Benchmarks and evidence

StepFly demonstrates superior performance across multiple metrics in real-world troubleshooting scenarios.

Metric StepFly Performance Source
Success Rate 94% on GPT-4.1 Microsoft Research evaluation [Source]
Time Reduction 32.9-70.4% for parallelizable TSGs Empirical evaluation results [Source]
Token Consumption Lower than baseline methods Comparative analysis [Source]
Real-world TSGs Tested 92 troubleshooting guides Empirical study dataset [Source]
Execution Time Faster than baseline approaches Performance benchmarking [Source]

Who should care

Builders

Software engineers building incident management systems can leverage StepFly’s open-source framework. AI agents are most commonly deployed in software engineering, accounting for roughly half of use cases [2]. The GitHub repository provides implementation guidance and sample data for integration projects.

Enterprise

Large-scale IT operations teams managing complex infrastructure can reduce incident resolution time significantly. AI agents can detect issues like CPU spikes or failed processes, analyze root causes using real-time and historical data, and apply fixes automatically [5]. Enterprise adoption requires integration with existing monitoring and ticketing systems.

End users

Business users experience faster service restoration when IT incidents occur. StepFly’s automation reduces mean time to resolution (MTTR) for common infrastructure issues. End users benefit indirectly through improved system reliability and reduced downtime duration.

Investors

The incident management automation market represents significant opportunity as enterprises seek to reduce operational costs. Microsoft’s open-source approach may accelerate adoption while building ecosystem partnerships. Investment opportunities exist in complementary tooling and managed service providers.

How to use StepFly today

StepFly is available as an open-source framework through Microsoft’s GitHub repository.

  1. Clone Repository: Access the codebase at https://github.com/microsoft/StepFly with sample data and documentation included.
  2. Install Dependencies: Follow setup instructions for Python environment and required LLM API access (GPT-4.1 recommended for optimal performance).
  3. Prepare TSGs: Use TSG Mentor tool to review and improve existing troubleshooting guide quality before automation.
  4. Configure Plugins: Set up Query Preparation Plugins (QPPs) for your specific monitoring tools and data sources.
  5. Test Execution: Run sample troubleshooting scenarios to validate DAG extraction and parallel execution capabilities.
  6. Deploy Framework: Integrate with existing incident management workflows and monitoring systems for production use.

StepFly vs competitors

StepFly competes with other AI-powered incident management solutions in the enterprise market.

Feature StepFly ServiceNow AI AWS DevOps Agent
Open Source Yes No No
Parallel Execution Yes (32.9-70.4% time reduction) Limited Not yet disclosed
TSG Quality Tools TSG Mentor included Manual process Not yet disclosed
Success Rate 94% on GPT-4.1 Not yet disclosed Not yet disclosed
Deployment Model Self-hosted SaaS platform AWS cloud service
Pricing Free (open source) Enterprise licensing Pay-per-use

Risks, limits, and myths

  • Quality Dependency: StepFly performance relies heavily on troubleshooting guide quality, requiring upfront investment in TSG improvement.
  • LLM Costs: High-performance models like GPT-4.1 may incur significant API costs for large-scale deployments.
  • Integration Complexity: Connecting with existing monitoring tools and ticketing systems requires custom plugin development.
  • False Automation Myth: StepFly requires human oversight and cannot replace all manual troubleshooting expertise.
  • Parallel Processing Limits: Not all troubleshooting guides benefit from parallelization due to sequential dependencies.
  • Data Security Concerns: Sending sensitive infrastructure data to external LLM APIs may violate security policies.
  • Training Requirements: SRE teams need training on TSG Mentor tools and DAG-based workflow concepts.

FAQ

What is StepFly and how does it work?

StepFly is Microsoft’s agentic AI framework that automates IT troubleshooting guide execution through a three-stage workflow: quality improvement, offline preprocessing, and online execution with parallel processing capabilities.

How much faster is StepFly compared to manual troubleshooting?

StepFly reduces execution time by 32.9-70.4% for parallelizable troubleshooting guides while achieving 94% success rate on GPT-4.1 compared to baseline methods.

Is StepFly free to use?

Yes, StepFly is open source and available free on GitHub at microsoft/StepFly, though LLM API costs may apply depending on usage volume.

What troubleshooting guides work best with StepFly?

StepFly works best with structured troubleshooting guides that have clear steps and decision points, particularly those with independent steps that can execute in parallel.

Do I need GPT-4.1 to use StepFly effectively?

While StepFly achieved 94% success rate on GPT-4.1, the framework supports other LLMs though performance metrics for alternative models are not yet disclosed.

How does StepFly handle sensitive IT infrastructure data?

StepFly processes troubleshooting workflows through LLM APIs, so organizations must evaluate data privacy policies and consider on-premises LLM deployment for sensitive environments.

What skills do SREs need to implement StepFly?

SREs need familiarity with Python development, LLM API integration, and understanding of directed acyclic graph (DAG) concepts for workflow management.

Can StepFly replace human site reliability engineers?

No, StepFly automates routine troubleshooting guide execution but requires human oversight, TSG quality management, and expertise for complex incident scenarios.

How does StepFly compare to ServiceNow or AWS incident management?

StepFly offers open-source deployment and specialized parallel execution capabilities, while ServiceNow and AWS provide managed platform solutions with different pricing models.

What monitoring tools integrate with StepFly?

StepFly uses Query Preparation Plugins (QPPs) to integrate with various monitoring tools, though specific supported platforms are not yet disclosed in available documentation.

How long does it take to implement StepFly in production?

Implementation timeline depends on troubleshooting guide quality, existing tool integrations, and team familiarity with agentic AI frameworks – specific timeframes are not yet disclosed.

What happens if StepFly fails to resolve an incident?

With 94% success rate, StepFly may escalate unresolved incidents to human SREs through standard incident management workflows and maintains execution logs for analysis.

Glossary

Agentic AI
AI systems that can act independently with complex goal structures, natural language interfaces, and integration of software tools for autonomous task execution.
Directed Acyclic Graph (DAG)
A graph structure with directed edges and no cycles, used by StepFly to represent troubleshooting workflow dependencies and enable parallel execution.
Query Preparation Plugins (QPPs)
Specialized components in StepFly that handle data-intensive queries by preparing and formatting requests to monitoring tools and data sources.
Site Reliability Engineer (SRE)
IT professionals responsible for maintaining system reliability, availability, and performance through monitoring, incident response, and automation practices.
Troubleshooting Guide (TSG)
Structured documentation that provides step-by-step procedures for diagnosing and resolving specific IT system issues and incidents.
TSG Mentor
StepFly’s tool that assists site reliability engineers in identifying and improving troubleshooting guide quality before automation implementation.

Visit the StepFly GitHub repository at microsoft/StepFly to download the open-source framework and explore sample troubleshooting guide automation implementations.

Sources

  1. AI SRE – Autonomous Incident Resolution – Neubird. https://neubird.ai/
  2. Best practices for building agentic systems | InfoWorld. https://www.infoworld.com/article/4154570/best-practices-for-building-agentic-systems.html
  3. What is Agentic AI? – Agentic AI Explained – AWS. https://aws.amazon.com/what-is/agentic-ai/
  4. AWS Announces General Availability of DevOps Agent for Automated Incident Investigation – InfoQ. https://www.infoq.com/news/2026/04/aws-devops-agent-ga/
  5. An MSP’s guide to agentic AI. https://superops.com/blog/an-msps-guide-to-agentic-ai
  6. AI agent – Wikipedia. https://en.wikipedia.org/wiki/AI_agent
  7. Agentic Engineering: How Swarms of AI Agents Are Redefining Software Engineering. https://www.langchain.com/blog/agentic-engineering-redefining-software-engineering
  8. Incident response for AI: Same fire, different fuel | Microsoft Security Blog. https://www.microsoft.com/en-us/security/blog/2026/04/15/incident-response-for-ai-same-fire-different-fuel/

Author

  • Siegfried Kamgo

    Founder and editorial lead at FrontierWisdom. Engineer turned operator-analyst writing about AI systems, automation infrastructure, decentralised stacks, and the practical economics of frontier technology. Focus: turning fast-moving releases into durable, implementation-ready playbooks.

Keep Compounding Signal

Get the next blueprint before it becomes common advice.

Join the newsletter for future-economy playbooks, tactical prompts, and high-margin tool recommendations.

  • Actionable execution blueprints
  • High-signal tool and infrastructure breakdowns
  • New monetization angles before they saturate

No fluff. No generic AI listicles. Unsubscribe anytime.

Leave a Reply

Your email address will not be published. Required fields are marked *