~/zatva/blog

blog

Field notes on running DePIN infrastructure at scale: validator operations, GPU inference, decentralized storage, and Web3 DevOps. No fluff — just what we ship, what breaks, and what we learn fixing it.

2026-06-09 by ZATVA team

Run infra yourself or hand it to a DevOps team: where the line is

The recurring decision: run your nodes, GPUs, and provers yourself or hand them to a team like us. There is a real line, and it is not where people usually draw it. We cover it honestly, including the cases where you should not hire us.

#operations #devops #managed #web3 #ai
2026-05-28 by ZATVA team

Bare metal supply chain 2026: lead times, vendors, deploy window

Teams planning their own infra underestimate one thing: hardware does not arrive when you order it. In 2026, lead times on GPUs and some server parts break deploy schedules. We cover how we plan around the supply chain.

#infrastructure #supply-chain #bare-metal #deploy #gpu
2026-05-07 by ZATVA team

Validator observability: what we alert on, and what we don't

Most validator monitoring alerts on 'the process is alive' and drowns the on-call in noise while missing what loses money. Building observability is choosing what NOT to alert on. We cover the discipline.

#observability #monitoring #operations #sre #validators
2026-04-09 by ZATVA team

Multi-region in 72 hours: our burst-deploy runbook

A protocol announces an incentivized testnet and wants nodes in several regions within 72 hours. You cannot ship metal that fast. We cover the runbook we use to stand up a multi-region burst and tear it down without a trace.

#deploy #multi-region #burst #scale #automation
2026-03-20 by ZATVA team

Case study: incentivized testnet, 50-node burst in 72h, top-5 operator

Anonymized case study: a burst engagement for an incentivized testnet. 50 nodes in 72h, finished top-5 by uptime. And the hidden AWS quota that almost killed the first 24 hours.

#case-study #testnet #burst #multi-region
2026-02-11 by ZATVA team

DePIN nodes: what actually breaks in ops

DePIN operations look simple: run lots of small nodes. In reality reward is tied not to 'the node is alive' but to passing the network's checks, and it is usually not the node software that breaks but the operational glue. We cover what we monitor.

#depin #operations #monitoring #reward
2026-01-15 by ZATVA team

Case study: DePIN sub-operator, 200 nodes across 8 regions, 99.94% uptime

Anonymized case study: ongoing operations for a DePIN sub-operator. 200 nodes across 8 regions, 99.94% uptime, and a 28-hour DNS-cache incident that ate 1.8% of reward.

#case-study #depin #operations #dns
2025-12-03 by ZATVA team

ZK prover farm: where the money leaks

A prover farm is the most expensive infrastructure in a ZK stack, and the money leaks where you do not expect. We cover what actually drives cost-per-proof: the queue, utilization, memory, and card availability.

#zk #prover #gpu #infrastructure #scale
2025-11-05 by ZATVA team

Case study: vLLM inference across 3 regions, -60% cost per token

Anonymized case: an LLM startup, 4 months on inference infrastructure. -60% cost/token, p99 latency unchanged. And a 36-hour hunt for the missing 30% throughput.

#case-study #ai #llm #vllm #gpu
2025-10-08 by ZATVA team

Slashing is not about uptime

99.95% uptime is the metric everyone quotes and the wrong one for slashing risk. Uptime and slashing risk are nearly orthogonal. We cover where the danger actually lives and what we monitor instead of an uptime percentage.

#validators #slashing #observability #operations #web3
2025-09-24 by ZATVA team

Bare metal or cloud GPU: the cost-per-token reality

For validators we almost always land on bare metal. For GPU inference the calculus is different. We cover where cloud GPU honestly wins, where bare metal beats it on cost-per-token, and which cloud costs never show up in the price list.

#ai #llm #gpu #infrastructure #cost-per-token
2025-09-10 by ZATVA team

Case study: ZK rollup, 6 months of validator ops, slashing: 0

Anonymized case study: validator ops + prover farm for a ZK rollup. 6 months in production, zero slashing, and a 48-hour incident where proof ETA drifted.

#case-study #zk #validators #prover-farm
2025-08-12 by ZATVA team

GPU inference: why vLLM is our default

When a client shows up with 'deploy our LLM,' the first infra question is the inference engine. We explain why our default is vLLM, what it actually wins on cost-per-token, and where we still reach for TensorRT-LLM.

#ai #llm #vllm #gpu #inference
2025-07-30 by ZATVA team

Failover that can't double-sign

The scariest thing in validator ops is not a node going down, it is two nodes that both believe they are the active signer. We cover why ordinary failover is dangerous and how we build switching that cannot get slashed.

#validators #failover #slashing #operations #web3
2025-07-15 by @drxim

Cloud vs bare metal for L1 validators, mid-2025

By mid-2025 bare metal beats cloud on every axis that actually matters for an L1 validator, except one. Here are the numbers from our fleet and why we still keep some nodes in the cloud anyway.

#validators #infrastructure #cloud #bare-metal
2025-06-18 by ZATVA team

What a node actually costs: TCO without the price list

The hoster's price is one of seven lines. We break down the real TCO of a node and show which lines stay with the hoster and which are carried by the team that runs your infra.

#operations #infrastructure #tco #validators
2025-05-31 by XIMTRX team

Welcome to XIMTRX

Why we built XIMTRX, how 85 nodes across 11 countries came together, and what we're publishing here.

#depin #infrastructure #operations
2025-05-31 by ZATVA team

Welcome to ZATVA

Why we built ZATVA, how 85 nodes across 11 countries came together, and what we're publishing here.

#depin #infrastructure #operations
2025-05-28 by ZATVA team

Why we publish our node inventory

Transparency is rare in node operations. Here's why we made our full 85-node inventory public and what it changes for clients.

#transparency #depin #operations

blog

Run infra yourself or hand it to a DevOps team: where the line is

Bare metal supply chain 2026: lead times, vendors, deploy window

Validator observability: what we alert on, and what we don't

Multi-region in 72 hours: our burst-deploy runbook

Case study: incentivized testnet, 50-node burst in 72h, top-5 operator

DePIN nodes: what actually breaks in ops

Case study: DePIN sub-operator, 200 nodes across 8 regions, 99.94% uptime

ZK prover farm: where the money leaks

Case study: vLLM inference across 3 regions, -60% cost per token

Slashing is not about uptime

Bare metal or cloud GPU: the cost-per-token reality

Case study: ZK rollup, 6 months of validator ops, slashing: 0

GPU inference: why vLLM is our default

Failover that can't double-sign

Cloud vs bare metal for L1 validators, mid-2025

What a node actually costs: TCO without the price list

Welcome to XIMTRX

Welcome to ZATVA

Why we publish our node inventory