📣 XDevOps will open for public access soon! Join the launch list: feedback@xdevops.ai Book a call Request a demo

Research-Driven AI DevOps & Platform Engineering

We bridge the gap between AI capabilities and infrastructure governance. As creators of the DevOps SRE AI Atlas 2025, xdevops.ai helps enterprises move beyond basic AIOps to solve Integration Sprawl, Governance, and Adoption Velocity.

Download 2025 Atlas

Based on analysis of 35+ AI Agent Platforms

Natural-Language Infrastructure, Ontology‑Driven

XDevOps is a cognitive agent that turns plain English into safe, auditable cloud & on‑prem operations. Every action is validated against a knowledge graph (SHACL), executed via mTLS backends, and traced end‑to‑end.

Do more with: diagnose incidents

🤖 Cognitive Agent 🧭 Ontology + SHACL 🔐 mTLS Connectors ⚡ Event-Driven 🔎 Explainable Plans
Built on RDF/SHACL
OpenTelemetry-native
Milvus‑powered RAG
Redis Streams
All mutating commands automatically carry a unique correlationId tag for lineage and audits.

At a glance

Start with a goal. The Task engine (LLM) infers intent and emits a structured Task JSON; XDevOps validates with SHACL and routes to the right cognitive agents (Scenario Engine, Shell Coach, Provisioning, Compliance). Execution is mTLS and fully auditable.

  • 🧭 Intent Entry Point (Task): LLM infers mode/stream/owners → Task(JSON) → routed to agents
  • 🧯 Real-time Shell Troubleshooting: diagnose failures, explain root cause, propose safe fixes
  • 🚦 Scenario Engine: runbooks with pre-change policy gates & automatic re-planning
  • 📈 NLP Observability: ask Prometheus/Loki in English; get charts, trends, anomaly alerts
  • ⚙️ Event-Driven Provisioning: materialise infra when events fire—no polling
  • 🧪 Autonomous Diagnostics (ADO): multi‑agent triage with CID, hypotheses → verification → fix

🧠 Explainable fixes
🔏 Policy-first execution
🕵️ Full audit trail

Overview

Natural language → plan → policy check → execution → events

🎯 The Objective of XDevOps

Natural language is the ultimate interface between humans and cognitive agents—rich with nuance, intent, and context.

  • No more YAML wrestling.
  • No rigid forms or brittle scripts.
  • Say what you want; the agent does the rest—safely.

⚡ Why now?

Breakthroughs in LLMs and knowledge graphs make this practical today—what was once sci‑fi is now operational reality.

  • Reasoning agents that understand policies & context.
  • RDF/SHACL graphs to enforce standards before change.
  • Event‑driven execution for real‑time reconciliation.

🚀 Our mission

Make natural language the fastest, safest way to run infrastructure—from design to troubleshooting.

  • Human‑centric, policy‑first automation.
  • Audit‑ready by design.
  • Portable across PaaS • Hybrid • On‑Prem.

Metrics & KPIs — Product-Aligned SRE/DevOps Subset

We focus on the subset of SRE/DevOps metrics that proves value for platform and infra teams.

① Flow unlocks value ⚡

Shift from project outputs to product value streams. Optimize how fast value flows from intent to production.

+0%Flow Velocity ↑ (features/week)
0%Lead Time ↓ (idea → prod)
0%Flow Efficiency ↑ (active / wait)
Flow Metrics: Velocity • Time • Efficiency • Load • Distribution

② Make work visible 🔎

Bottlenecks hide in handoffs. Use end-to-end telemetry and the knowledge graph to surface constraints early.

0mMTTR (median)
0xBlocked Work Detected
0%Scenario Pass Rate
0%Observability Coverage
Signals: WIP • Blockers • Queue time • Policy failures • Coverage

③ Govern by outcomes 🎯

Budget and governance follow product lines. Enforce policy pre-change and track customer impact.

0%Policy Compliance (pre-flight)
0%Change Failure Rate ↓
0%Cost per Change ↓
0xError Budget Burn (rate)
Outcomes: Compliance • Reliability • Unit economics • SLOs

Cognitive Agentic KPIs 🧠

Quantify autonomy, safety, and learning velocity of XDevOps agents.

0.00Autonomy Depth Index (ADI)
0mEvent→Goal (p50)
0%Graph-Validated Safety Coverage (GVSC)
0%Zero-Touch Rate (ZTR)
0%Preventive Opportunity Capture (POCR)

Key Capabilities

Each box is an agent skill with guard‑rails, explanations, and full lineage.

🧭 Intent Entry Point — Task Intelligence

Capture a goal in natural language—an LLM infers the intent and emits a structured Task (JSON) that safely routes to the right cognitive agents (Provisioning, Scenario Engine, Shell Coach, Compliance). Tasks = intent, Agents = action.

⚡ !task create micro "Enable canary for checkout"
→ infers Feature • micro → Scenario Engine

🧯 "Follow up on incident #1423"
→ infers Bug Fix • lite → checklist & owners → Shell Coach

📈 "Migrate our SLOs to 99.9%"
→ infers Tech Debt/Risk • full → dependencies & observability → Provisioning & Observability agents
LLM intent inference Agent routing JSON Schema SHACL pre-flight Milvus memory Redis events FastAPI engine

⚙️ Event‑Driven Provisioning

Autonomously creates & reconciles infra the moment a cloud or on‑prem event fires—zero polling with full audit.

ResourceCreated/DeletedIdempotent plansGraph lineage

🚦 Scenario Engine

Design repeatable runbooks; the agent executes, adapts & explains each step until policy passes—then re‑plans on failure.

Policy gates (SHACL)Rollback pathsExplainable steps

🖥️ Interactive Shell Coach

Run commands in your own terminal—the agent annotates, fixes & learns in real time. Safer changes, faster outcomes.

Command fixerCorrelation tagsRun‑command hygiene

📈 Observability via NLP

Query Prometheus & Loki in English—get instant charts, trends & anomaly alerts without memorising DSLs.

Time‑series insightsAnomaly alertsRoot‑cause prompts

🎓 Certification Learning Support

Accelerate certifications—the agent crawls fresh docs, builds adaptive study plans & quizzes you to mastery.

Adaptive quizzesDoc crawlingWeak‑spot drills

🧠 Multi‑RAG Personalisation

Capabilities, Knowledge & Story corpora tailor every answer to your standards, repos & runbooks.

Org‑specific answersVector searchContinuous learning

🔁 Knowledge Transfer

All chats & shell sessions are vectorised, searchable & replayable—perfect for onboarding & audits.

Session memoryReplay & shareEvidence packs

🧩 Git & IaC Intelligence

PRs, commits, Terraform & Helm live in vectors—ask for diffs, impact & drift instantly.

IaC parsingImpact analysisDrift checks

🛡️ Policy‑First Automation

A SHACL‑validated knowledge graph enforces tags, budgets & security before every change.

Pre‑flight checksStandards & tagsBudget guard‑rails

Autonomous Diagnostics Orchestrator (ADO)

Multi‑agent diagnostics for SRE/Platform teams. Hypotheses are generated, verified with data, and summarized with evidence.

How it works (at a glance)

  1. You launch: !diag checkout 5xx spike endpoint_id=42 window=45m
  2. Orchestrator normalizes context and issues tasks with a fresh CID.
  3. KB Agent enriches app context (owners, similar incidents, suspected patterns).
  4. Hypothesis Agent drafts 2–5 likely causes + test plan.
  5. Verification Agent runs PromQL/LogQL/K8s/Git checks and returns verdicts.
  6. Fix‑Proposal Agent synthesizes safe remediation steps with blast‑radius notes.
  7. Orchestrator streams progress and posts a final Markdown summary with confidence.
Traffic is coordinated via Redis Streams; agents resolve credentials locally (secrets masked as <MASKED>).

🧪 Diagnostics — Simulation

Pick an example and watch the orchestrator run a simulated investigation with a generated CID.

Select an example above to start the simulation.

    Provisioning & Requests — Ontology‑Driven

    Cognitive planning • SHACL validation • mTLS execution • event emission

    How a request flows

    1. Intent capture: You describe the outcome in natural language.
    2. Plan synthesis: Agent generates an ordered CLI plan with dependency checks.
    3. Ontology validation: Plan is validated in RDF via SHACL (tags, budgets, security).
    4. mTLS execution: Commands run via agent backends with no shell substitution.
    5. Event emission: Created/Deleted events flow to the graph for lineage & dashboards.

    Safety & governance

    • Run‑command hygiene: one script line per --scripts, no chaining (;/&&/|).
    • SSH key policy: provide --ssh-key-values or auto‑generate securely.
    • Non‑mutating ops: strip tags automatically to keep reads pure.
    • Fixer loop: if a step fails, the agent proposes corrected steps—no repetition of failures.
    GraphDB (RDF/SHACL) Milvus (RAG) Redis Streams mTLS Agent

    Product-Aligned SRE/DevOps Metrics Subset

    A subset of SRE/DevOps metrics tailored to infra & platform teams, organized by Flow Streams: Feature, Bug Fix, Risk, and Technical Debt.

    Flow Streams & Value Mapping

    Pick a stream to highlight its purpose, leading indicators, SRE/DORA metrics, cognitive KPIs, safeguards, and economics.

    Stream-to-metrics value mapping
    StreamPurposeLeading IndicatorsSRE/DORACognitive KPIsSafeguardsEconomics
    🧩 Feature Deliver new user value PR cycle time ↓, feature throughput ↑, review latency ↓ Deploy freq ↑, Lead time ↓, CFR stable, SLO impact ≤ 0 ADI ↑, E2GT ↓, GVSC ≥95%, ZTR ↑ Pre‑flight policy, canary, cost/tag gates $/feature ↓, NPS ↑
    🛠️ Bug Fix Restore reliability fast MTTA ↓, bug deflection ↑, duplicate pattern match MTTR ↓, CFR ↓, incident count ↓ Root‑cause precision ↑, ADI(runbooks) ↑, E2GT ↓ Safe rollback, change windows, postmortem required Incident minutes avoided ↑, cost‑of‑quality ↓
    🛡️ Risk Reduce exposure proactively Open risks ↓, policy failures ↓, patch lead time ↓ Error budget burn ↓, CFR —, compliance pass ↑ GVSC ≥99%, POCR ↑, ZTR ↑ SHACL policy gates, mandatory controls Risk $ avoided ↑, audit findings ↓
    🧱 Tech Debt Pay down toil & complexity Toil hours ↓, hotspot churn ↓, flaky tests ↓ Lead time ↑ short‑term, then ↓; CFR stable Self‑Improvement Rate ↑, ACS ↑ Contract tests, perf gates, backward‑compat Cost‑to‑serve ↓, infra efficiency ↑

    1) Discover → Frame

    • Map top user intents & compliance policies.
    • Connect observability & IaC repos to vectors.
    • Define ontology classes for your domain.
    • Tag Streams: Feature • Bug Fix • Risk • Tech Debt.

    2) Pilot → Govern

    • Enable SHACL policies for tags, budget, security.
    • Shadow run vs manual; diff plans & outcomes.
    • Instrument correlation IDs & event lineage.
    • Baseline SRE: Deploy Freq, Lead Time, MTTR, CFR, SLO burn.

    3) Operate → Scale

    • Promote runbooks to Scenario Engine.
    • Shell coaching by default for risky ops.
    • Onboard via replayable sessions & evidence.
    • Add Cognitive KPIs: ADI, E2GT, GVSC, ZTR, POCR.

    KPI Helper

    Quick glossary for Cognitive Agentic KPIs referenced above.

    ADI

    Autonomy Depth Index — how many steps agents complete without intervention.

    E2GT

    Event‑to‑Goal Time — median time from event ingestion to verified outcome.

    GVSC

    Graph‑Validated Safety Coverage — % actions passing SHACL gates.

    ZTR

    Zero‑Touch Rate — share of requests completed with no manual edits.

    POCR

    Preventive Opportunity Capture Rate — % predicted issues acted on early.

    ACS

    Agent Correctness Score — judged accuracy of agent decisions.

    Deployment Targets

    Same cognitive core, different footprints — PaaS • Hybrid • On‑Prem.

    ☁️ PaaS

    Managed multi‑tenant control‑plane. Fastest start, zero infra to manage.

    Free TierMulti‑tenantSLA
    • Org‑scoped tenant & keys.
    • mTLS agent connectors.
    • SLA & security hardening.
    Free Tier (PaaS):
    • Up to 2 agent connectors & 1 environment
    • 300 actions/month, 7‑day event retention
    • 3 seats (SSO ready), community support
    • No credit card during beta

    🏗️ Hybrid

    Cloud control‑plane + on‑prem agents for restricted or air‑gapped workloads.

    • Outbound‑only agents.
    • Private RAG stores.
    • Bring‑your‑KMS.

    🔒 On‑Prem

    Single‑tenant, fully isolated. All data & models within your perimeter.

    • Self‑host GraphDB/Milvus.
    • Offline updates.
    • Custom compliance packs.

    Talk to us

    We’re opening soon. Free Tier available for PaaS. Get early access or schedule time with the team.