> services / operate

24/7 SRE on-call for Web3 and AI infrastructure

We own the pager. PagerDuty rotations, signed SLA, post-mortem after every Sev-1. Mean time to acknowledge: minutes. Alerts go to engineers, not voicemail.

what's included

The operate phase starts after handoff from deploy, or when a retainer picks up existing infrastructure.

Scope	Owned by us	Owned by you
24/7 on-call, PagerDuty rotations, escalations	✓ Owned	-
Incident triage and runbook-driven response	✓ Owned	-
Runbook versioning and updates	✓ Owned	-
OS patching, security updates, certificate rotation	✓ Owned	-
Sev-1 post-mortems, trend reports	✓ Owned	-
Application / contract release decisions	- (we ship on your cue)	✓ Owned
Architectural change decisions	- (we recommend)	✓ Owned
Provider and stakeholder billing	-	✓ Owned

how we respond to incidents

Every alert hits a severity classification before it pages anyone. Sev-1 means money or keys are at risk: validator missing blocks, double-sign risk, GPU OOM in prod, a proof job stalled near a deadline. Sev-2 is SLO degradation without immediate loss. Sev-3 is a bug that can wait for business hours.

Each Sev-1 has a runbook with recovery steps, escalation owners and side-effects to watch for. Runbooks live in Git next to IaC: every change is reviewed, nothing happens "from memory". The library is currently 147 runbooks across the 4 ICPs we serve.

Mean time to acknowledge on Sev-1: <15 min p95. Mean time to restore depends on the failure mode but both numbers land in monthly SLA reports. After every Sev-1: blameless post-mortem within 24h with action items and owners.

Default observability stack: Prometheus + Grafana + Loki + OpenTelemetry. PagerDuty for on-call. If you already run Datadog, Honeycomb, or a custom setup, we work on top.

stack we ship with

Web3: Cosmos SDK Geth Reth OP Stack Arbitrum Orbit Polygon CDK EigenDA Celestia

AI / LLM: vLLM Triton TensorRT-LLM NVIDIA H100 / A100 Ray Kubeflow

ZK: SP1 RISC Zero Boundless Brevis Jolt Halo2

DePIN: Filecoin Akash Render io.net Gensyn

Platform: Kubernetes Terraform Ansible Prometheus Grafana Loki OpenTelemetry PagerDuty

engagement models

Operate is a long-running engagement by nature. We match the model to infra maturity and criticality.

[ FIXED PLAN ] →

Operate readiness audit. 48h: we assess current infra, runbook gaps, SLO readiness.

For when you're not sure your infra is ready for a signed SLA.

[ RETAINER ] →

Signed SLA + on-call. Bronze / Silver / Gold tiers. Monthly billing.

The default for operate. Most clients live here.

[ T&M ] →

Burst on-call. For one-off events: mainnet launches, hard-forks, incentivized testnets with deadlines.

When you need to reinforce your team for 2-8 weeks.

severity matrix

What counts as which Sev and what targets apply by default.

Severity	Examples	Ack target (p95)	Post-mortem
Sev-1	Validator missing blocks, double-sign risk, inference outage, prover offline against a deadline	<15 min 24/7	Within 24h
Sev-2	SLO degradation without loss, single-node failure with redundancy in place, p95 latency drift	<1h business hours, <2h overnight	Within 5 business days
Sev-3	Backlog bug, pending upgrade, scheduled window	Next business day	-

what we'd build for you

Real coverage patterns across the four ICPs.

24/7 validator monitoring with auto-failover and missed-block escalation. Alerts on double-sign signals, slashing watch, upgrade-window recommendations.

GPU health watch, OOM auto-recovery, model-rollback playbooks. p95 latency SLO, cost-per-token monitoring, capacity planning for peaks.

Prover liveness checks tied to network deadlines. Proof-job queue monitoring, GPU utilization tracking, escalation when block deadlines are at risk.

Per-node uptime SLA tracking with payout reconciliation. Auto-restarts, regional monitoring, weekly reward reconciliation against network dashboards.

SLA tiers

Three coverage levels. Pick by criticality and budget.

Tier	Response p95 (Sev-1)	Coverage	Incident report	Engineer hours / mo
Bronze	30 min	Business hours, 5×8	Within 48h	40
Silver	15 min	24/7 on-call rotation	Within 24h	80
Gold	5 min	24/7 with dedicated engineer	Within 12h	160+

related services

FAQ

Can on-call be added to existing infra?

Yes. We start with an operate-readiness audit (48h): we check monitoring, runbooks, SLOs, escalation paths. Any gaps get closed before the SLA is signed.

What SLA do you sign?

Tiered: Bronze (5×8, 30 min), Silver (24/7, 15 min), Gold (24/7 dedicated, 5 min). Targets cover response, post-mortem timing and monthly hours. Compensation clauses are negotiated case-by-case.

What about holiday coverage?

24/7 means 24/7. Holidays and weekends are covered in Silver and Gold. No "we'll try to reach someone": there's a PagerDuty rotation with an explicit escalation chain.

Whose runbooks, yours or ours?

Shared. They live in your Git, we write and maintain them. When the engagement ends, they stay with you, no lock-in to our tooling.

How does pricing change as infra grows?

The retainer is re-quoted on meaningful scope changes (e.g. +50% nodes or a new ICP). Small changes inside a Silver/Gold tier are absorbed by the engineer hours included in the tier.