Research-Driven AI DevOps & Platform Engineering
We bridge the gap between AI capabilities and infrastructure governance. As creators of the DevOps SRE AI Atlas 2025, xdevops.ai helps enterprises move beyond basic AIOps to solve Integration Sprawl, Governance, and Adoption Velocity.
Based on analysis of 35+ AI Agent Platforms
Natural-Language Infrastructure, Ontology‑Driven
XDevOps is a cognitive agent that turns plain English into safe, auditable cloud & on‑prem operations. Every action is validated against a knowledge graph (SHACL), executed via mTLS backends, and traced end‑to‑end.
Do more with: diagnose incidents
At a glance
Start with a goal. The Task engine (LLM) infers intent and emits a structured Task JSON; XDevOps validates with SHACL and routes to the right cognitive agents (Scenario Engine, Shell Coach, Provisioning, Compliance). Execution is mTLS and fully auditable.
- 🧭 Intent Entry Point (Task): LLM infers mode/stream/owners → Task(JSON) → routed to agents
- 🧯 Real-time Shell Troubleshooting: diagnose failures, explain root cause, propose safe fixes
- 🚦 Scenario Engine: runbooks with pre-change policy gates & automatic re-planning
- 📈 NLP Observability: ask Prometheus/Loki in English; get charts, trends, anomaly alerts
- ⚙️ Event-Driven Provisioning: materialise infra when events fire—no polling
- 🧪 Autonomous Diagnostics (ADO): multi‑agent triage with CID, hypotheses → verification → fix
Overview
Natural language → plan → policy check → execution → events
🎯 The Objective of XDevOps
Natural language is the ultimate interface between humans and cognitive agents—rich with nuance, intent, and context.
- No more YAML wrestling.
- No rigid forms or brittle scripts.
- Say what you want; the agent does the rest—safely.
⚡ Why now?
Breakthroughs in LLMs and knowledge graphs make this practical today—what was once sci‑fi is now operational reality.
- Reasoning agents that understand policies & context.
- RDF/SHACL graphs to enforce standards before change.
- Event‑driven execution for real‑time reconciliation.
🚀 Our mission
Make natural language the fastest, safest way to run infrastructure—from design to troubleshooting.
- Human‑centric, policy‑first automation.
- Audit‑ready by design.
- Portable across PaaS • Hybrid • On‑Prem.
Metrics & KPIs — Product-Aligned SRE/DevOps Subset
We focus on the subset of SRE/DevOps metrics that proves value for platform and infra teams.
① Flow unlocks value ⚡
Shift from project outputs to product value streams. Optimize how fast value flows from intent to production.
② Make work visible 🔎
Bottlenecks hide in handoffs. Use end-to-end telemetry and the knowledge graph to surface constraints early.
③ Govern by outcomes 🎯
Budget and governance follow product lines. Enforce policy pre-change and track customer impact.
Cognitive Agentic KPIs 🧠
Quantify autonomy, safety, and learning velocity of XDevOps agents.
Key Capabilities
Each box is an agent skill with guard‑rails, explanations, and full lineage.
🧭 Intent Entry Point — Task Intelligence
Capture a goal in natural language—an LLM infers the intent and emits a structured Task (JSON) that safely routes to the right cognitive agents (Provisioning, Scenario Engine, Shell Coach, Compliance). Tasks = intent, Agents = action.
⚡ !task create micro "Enable canary for checkout"
→ infers Feature • micro → Scenario Engine
🧯 "Follow up on incident #1423"
→ infers Bug Fix • lite → checklist & owners → Shell Coach
📈 "Migrate our SLOs to 99.9%"
→ infers Tech Debt/Risk • full → dependencies & observability → Provisioning & Observability agents
⚙️ Event‑Driven Provisioning
Autonomously creates & reconciles infra the moment a cloud or on‑prem event fires—zero polling with full audit.
🚦 Scenario Engine
Design repeatable runbooks; the agent executes, adapts & explains each step until policy passes—then re‑plans on failure.
🖥️ Interactive Shell Coach
Run commands in your own terminal—the agent annotates, fixes & learns in real time. Safer changes, faster outcomes.
📈 Observability via NLP
Query Prometheus & Loki in English—get instant charts, trends & anomaly alerts without memorising DSLs.
🎓 Certification Learning Support
Accelerate certifications—the agent crawls fresh docs, builds adaptive study plans & quizzes you to mastery.
🧠 Multi‑RAG Personalisation
Capabilities, Knowledge & Story corpora tailor every answer to your standards, repos & runbooks.
🔁 Knowledge Transfer
All chats & shell sessions are vectorised, searchable & replayable—perfect for onboarding & audits.
🧩 Git & IaC Intelligence
PRs, commits, Terraform & Helm live in vectors—ask for diffs, impact & drift instantly.
🛡️ Policy‑First Automation
A SHACL‑validated knowledge graph enforces tags, budgets & security before every change.
Autonomous Diagnostics Orchestrator (ADO)
Multi‑agent diagnostics for SRE/Platform teams. Hypotheses are generated, verified with data, and summarized with evidence.
How it works (at a glance)
- You launch: !diag checkout 5xx spike endpoint_id=42 window=45m
- Orchestrator normalizes context and issues tasks with a fresh CID.
- KB Agent enriches app context (owners, similar incidents, suspected patterns).
- Hypothesis Agent drafts 2–5 likely causes + test plan.
- Verification Agent runs PromQL/LogQL/K8s/Git checks and returns verdicts.
- Fix‑Proposal Agent synthesizes safe remediation steps with blast‑radius notes.
- Orchestrator streams progress and posts a final Markdown summary with confidence.
<MASKED>).🧪 Diagnostics — Simulation
Pick an example and watch the orchestrator run a simulated investigation with a generated CID.
Provisioning & Requests — Ontology‑Driven
Cognitive planning • SHACL validation • mTLS execution • event emission
How a request flows
- Intent capture: You describe the outcome in natural language.
- Plan synthesis: Agent generates an ordered CLI plan with dependency checks.
- Ontology validation: Plan is validated in RDF via SHACL (tags, budgets, security).
- mTLS execution: Commands run via agent backends with no shell substitution.
- Event emission: Created/Deleted events flow to the graph for lineage & dashboards.
Safety & governance
- Run‑command hygiene: one script line per --scripts, no chaining (;/&&/|).
- SSH key policy: provide --ssh-key-values or auto‑generate securely.
- Non‑mutating ops: strip tags automatically to keep reads pure.
- Fixer loop: if a step fails, the agent proposes corrected steps—no repetition of failures.
Product-Aligned SRE/DevOps Metrics Subset
A subset of SRE/DevOps metrics tailored to infra & platform teams, organized by Flow Streams: Feature, Bug Fix, Risk, and Technical Debt.
Flow Streams & Value Mapping
Pick a stream to highlight its purpose, leading indicators, SRE/DORA metrics, cognitive KPIs, safeguards, and economics.
| Stream | Purpose | Leading Indicators | SRE/DORA | Cognitive KPIs | Safeguards | Economics |
|---|---|---|---|---|---|---|
| 🧩 Feature | Deliver new user value | PR cycle time ↓, feature throughput ↑, review latency ↓ | Deploy freq ↑, Lead time ↓, CFR stable, SLO impact ≤ 0 | ADI ↑, E2GT ↓, GVSC ≥95%, ZTR ↑ | Pre‑flight policy, canary, cost/tag gates | $/feature ↓, NPS ↑ |
| 🛠️ Bug Fix | Restore reliability fast | MTTA ↓, bug deflection ↑, duplicate pattern match | MTTR ↓, CFR ↓, incident count ↓ | Root‑cause precision ↑, ADI(runbooks) ↑, E2GT ↓ | Safe rollback, change windows, postmortem required | Incident minutes avoided ↑, cost‑of‑quality ↓ |
| 🛡️ Risk | Reduce exposure proactively | Open risks ↓, policy failures ↓, patch lead time ↓ | Error budget burn ↓, CFR —, compliance pass ↑ | GVSC ≥99%, POCR ↑, ZTR ↑ | SHACL policy gates, mandatory controls | Risk $ avoided ↑, audit findings ↓ |
| 🧱 Tech Debt | Pay down toil & complexity | Toil hours ↓, hotspot churn ↓, flaky tests ↓ | Lead time ↑ short‑term, then ↓; CFR stable | Self‑Improvement Rate ↑, ACS ↑ | Contract tests, perf gates, backward‑compat | Cost‑to‑serve ↓, infra efficiency ↑ |
1) Discover → Frame
- Map top user intents & compliance policies.
- Connect observability & IaC repos to vectors.
- Define ontology classes for your domain.
- Tag Streams: Feature • Bug Fix • Risk • Tech Debt.
2) Pilot → Govern
- Enable SHACL policies for tags, budget, security.
- Shadow run vs manual; diff plans & outcomes.
- Instrument correlation IDs & event lineage.
- Baseline SRE: Deploy Freq, Lead Time, MTTR, CFR, SLO burn.
3) Operate → Scale
- Promote runbooks to Scenario Engine.
- Shell coaching by default for risky ops.
- Onboard via replayable sessions & evidence.
- Add Cognitive KPIs: ADI, E2GT, GVSC, ZTR, POCR.
KPI Helper
Quick glossary for Cognitive Agentic KPIs referenced above.
ADI
Autonomy Depth Index — how many steps agents complete without intervention.
E2GT
Event‑to‑Goal Time — median time from event ingestion to verified outcome.
GVSC
Graph‑Validated Safety Coverage — % actions passing SHACL gates.
ZTR
Zero‑Touch Rate — share of requests completed with no manual edits.
POCR
Preventive Opportunity Capture Rate — % predicted issues acted on early.
ACS
Agent Correctness Score — judged accuracy of agent decisions.
Deployment Targets
Same cognitive core, different footprints — PaaS • Hybrid • On‑Prem.
☁️ PaaS
Managed multi‑tenant control‑plane. Fastest start, zero infra to manage.
- Org‑scoped tenant & keys.
- mTLS agent connectors.
- SLA & security hardening.
- Up to 2 agent connectors & 1 environment
- 300 actions/month, 7‑day event retention
- 3 seats (SSO ready), community support
- No credit card during beta
🏗️ Hybrid
Cloud control‑plane + on‑prem agents for restricted or air‑gapped workloads.
- Outbound‑only agents.
- Private RAG stores.
- Bring‑your‑KMS.
🔒 On‑Prem
Single‑tenant, fully isolated. All data & models within your perimeter.
- Self‑host GraphDB/Milvus.
- Offline updates.
- Custom compliance packs.
Talk to us
We’re opening soon. Free Tier available for PaaS. Get early access or schedule time with the team.