~/zatva/blog
blog
Field notes on running DePIN infrastructure at scale: validator operations, GPU inference, decentralized storage, and Web3 DevOps. No fluff — just what we ship, what breaks, and what we learn fixing it.
-
Run infra yourself or hand it to a DevOps team: where the line is
The recurring decision: run your nodes, GPUs, and provers yourself or hand them to a team like us. There is a real line, and it is not where people usually draw it. We cover it honestly, including the cases where you should not hire us.
-
Bare metal supply chain 2026: lead times, vendors, deploy window
Teams planning their own infra underestimate one thing: hardware does not arrive when you order it. In 2026, lead times on GPUs and some server parts break deploy schedules. We cover how we plan around the supply chain.
-
Validator observability: what we alert on, and what we don't
Most validator monitoring alerts on 'the process is alive' and drowns the on-call in noise while missing what loses money. Building observability is choosing what NOT to alert on. We cover the discipline.
-
Multi-region in 72 hours: our burst-deploy runbook
A protocol announces an incentivized testnet and wants nodes in several regions within 72 hours. You cannot ship metal that fast. We cover the runbook we use to stand up a multi-region burst and tear it down without a trace.
-
Case study: incentivized testnet, 50-node burst in 72h, top-5 operator
Anonymized case study: a burst engagement for an incentivized testnet. 50 nodes in 72h, finished top-5 by uptime. And the hidden AWS quota that almost killed the first 24 hours.
-
DePIN nodes: what actually breaks in ops
DePIN operations look simple: run lots of small nodes. In reality reward is tied not to 'the node is alive' but to passing the network's checks, and it is usually not the node software that breaks but the operational glue. We cover what we monitor.
-
Case study: DePIN sub-operator, 200 nodes across 8 regions, 99.94% uptime
Anonymized case study: ongoing operations for a DePIN sub-operator. 200 nodes across 8 regions, 99.94% uptime, and a 28-hour DNS-cache incident that ate 1.8% of reward.
-
ZK prover farm: where the money leaks
A prover farm is the most expensive infrastructure in a ZK stack, and the money leaks where you do not expect. We cover what actually drives cost-per-proof: the queue, utilization, memory, and card availability.
-
Case study: vLLM inference across 3 regions, -60% cost per token
Anonymized case: an LLM startup, 4 months on inference infrastructure. -60% cost/token, p99 latency unchanged. And a 36-hour hunt for the missing 30% throughput.
-
Slashing is not about uptime
99.95% uptime is the metric everyone quotes and the wrong one for slashing risk. Uptime and slashing risk are nearly orthogonal. We cover where the danger actually lives and what we monitor instead of an uptime percentage.
-
Bare metal or cloud GPU: the cost-per-token reality
For validators we almost always land on bare metal. For GPU inference the calculus is different. We cover where cloud GPU honestly wins, where bare metal beats it on cost-per-token, and which cloud costs never show up in the price list.
-
Case study: ZK rollup, 6 months of validator ops, slashing: 0
Anonymized case study: validator ops + prover farm for a ZK rollup. 6 months in production, zero slashing, and a 48-hour incident where proof ETA drifted.
-
GPU inference: why vLLM is our default
When a client shows up with 'deploy our LLM,' the first infra question is the inference engine. We explain why our default is vLLM, what it actually wins on cost-per-token, and where we still reach for TensorRT-LLM.
-
Failover that can't double-sign
The scariest thing in validator ops is not a node going down, it is two nodes that both believe they are the active signer. We cover why ordinary failover is dangerous and how we build switching that cannot get slashed.
-
Cloud vs bare metal for L1 validators, mid-2025
By mid-2025 bare metal beats cloud on every axis that actually matters for an L1 validator, except one. Here are the numbers from our fleet and why we still keep some nodes in the cloud anyway.
-
What a node actually costs: TCO without the price list
The hoster's price is one of seven lines. We break down the real TCO of a node and show which lines stay with the hoster and which are carried by the team that runs your infra.
-
Welcome to XIMTRX
Why we built XIMTRX, how 85 nodes across 11 countries came together, and what we're publishing here.
-
Welcome to ZATVA
Why we built ZATVA, how 85 nodes across 11 countries came together, and what we're publishing here.
-
Why we publish our node inventory
Transparency is rare in node operations. Here's why we made our full 85-node inventory public and what it changes for clients.