> services / operate

24/7 SRE on-call for Web3 and AI infrastructure

We own the pager. PagerDuty rotations, signed SLA, post-mortem after every Sev-1. Mean time to acknowledge: minutes. Alerts go to engineers, not voicemail.

what's included

The operate phase starts after handoff from deploy, or when a retainer picks up existing infrastructure.

Scope Owned by us Owned by you
24/7 on-call, PagerDuty rotations, escalations ✓ Owned -
Incident triage and runbook-driven response ✓ Owned -
Runbook versioning and updates ✓ Owned -
OS patching, security updates, certificate rotation ✓ Owned -
Sev-1 post-mortems, trend reports ✓ Owned -
Application / contract release decisions - (we ship on your cue) ✓ Owned
Architectural change decisions - (we recommend) ✓ Owned
Provider and stakeholder billing - ✓ Owned

how we respond to incidents

Every alert hits a severity classification before it pages anyone. Sev-1 means money or keys are at risk: validator missing blocks, double-sign risk, GPU OOM in prod, a proof job stalled near a deadline. Sev-2 is SLO degradation without immediate loss. Sev-3 is a bug that can wait for business hours.

Each Sev-1 has a runbook with recovery steps, escalation owners and side-effects to watch for. Runbooks live in Git next to IaC: every change is reviewed, nothing happens "from memory". The library is currently 147 runbooks across the 4 ICPs we serve.

Mean time to acknowledge on Sev-1: <15 min p95. Mean time to restore depends on the failure mode but both numbers land in monthly SLA reports. After every Sev-1: blameless post-mortem within 24h with action items and owners.

Default observability stack: Prometheus + Grafana + Loki + OpenTelemetry. PagerDuty for on-call. If you already run Datadog, Honeycomb, or a custom setup, we work on top.

stack we ship with

Web3: Cosmos SDK Geth Reth OP Stack Arbitrum Orbit Polygon CDK EigenDA Celestia
AI / LLM: vLLM Triton TensorRT-LLM NVIDIA H100 / A100 Ray Kubeflow
ZK: SP1 RISC Zero Boundless Brevis Jolt Halo2
DePIN: Filecoin Akash Render io.net Gensyn
Platform: Kubernetes Terraform Ansible Prometheus Grafana Loki OpenTelemetry PagerDuty

engagement models

Operate is a long-running engagement by nature. We match the model to infra maturity and criticality.

severity matrix

What counts as which Sev and what targets apply by default.

Severity Examples Ack target (p95) Post-mortem
Sev-1 Validator missing blocks, double-sign risk, inference outage, prover offline against a deadline <15 min 24/7 Within 24h
Sev-2 SLO degradation without loss, single-node failure with redundancy in place, p95 latency drift <1h business hours, <2h overnight Within 5 business days
Sev-3 Backlog bug, pending upgrade, scheduled window Next business day -

what we'd build for you

Real coverage patterns across the four ICPs.

Web3 / Validators

24/7 validator monitoring with auto-failover and missed-block escalation. Alerts on double-sign signals, slashing watch, upgrade-window recommendations.

AI / LLM Inference

GPU health watch, OOM auto-recovery, model-rollback playbooks. p95 latency SLO, cost-per-token monitoring, capacity planning for peaks.

ZK / Prover Farms

Prover liveness checks tied to network deadlines. Proof-job queue monitoring, GPU utilization tracking, escalation when block deadlines are at risk.

DePIN / Distributed Networks

Per-node uptime SLA tracking with payout reconciliation. Auto-restarts, regional monitoring, weekly reward reconciliation against network dashboards.

SLA tiers

Three coverage levels. Pick by criticality and budget.

Tier Response p95 (Sev-1) Coverage Incident report Engineer hours / mo
Bronze 30 min Business hours, 5×8 Within 48h 40
Silver 15 min 24/7 on-call rotation Within 24h 80
Gold 5 min 24/7 with dedicated engineer Within 12h 160+

related services

FAQ

Yes. We start with an operate-readiness audit (48h): we check monitoring, runbooks, SLOs, escalation paths. Any gaps get closed before the SLA is signed.

Tiered: Bronze (5×8, 30 min), Silver (24/7, 15 min), Gold (24/7 dedicated, 5 min). Targets cover response, post-mortem timing and monthly hours. Compensation clauses are negotiated case-by-case.

24/7 means 24/7. Holidays and weekends are covered in Silver and Gold. No "we'll try to reach someone": there's a PagerDuty rotation with an explicit escalation chain.

Shared. They live in your Git, we write and maintain them. When the engagement ends, they stay with you, no lock-in to our tooling.

The retainer is re-quoted on meaningful scope changes (e.g. +50% nodes or a new ICP). Small changes inside a Silver/Gold tier are absorbed by the engineer hours included in the tier.

ready to hand over the pager?

Tell us about the workload. We reply within 24 hours.