Building a Climate Hazard Pipeline: The Infrastructure Work Behind the Science
- 13 hours ago
- 3 min read

We model climate risk. That's easy to say. The harder part is that climate data means something very specific: large files from multiple regional climate models, one per emissions scenario per variable, organized into ensembles you have to process in parallel, aggregate, and serve as geospatial outputs. Everything we built this year follows directly from that data reality, not from abstract engineering preference.
What the pipeline actually does
For each climate hazard we run an ensemble of models across several emissions scenarios and time horizons. Each model gets its own parallel processing job: compute the features from raw climate input, write intermediary results to cloud storage, wait for the rest of the ensemble to finish. Once all models are done, a second stage reads everything, builds ensemble statistics, classifies results into severity tiers, and pushes geospatial outputs into our modeling database. Two output types (likelihood and severity) across multiple scenarios and years. It adds up fast.
DAGs: what they are and why they matter here
For orchestration we use Apache Airflow. The central concept in Airflow is the DAG (Directed Acyclic Graph) which is just a fancy way of saying: a set of tasks with defined dependencies between them, no cycles. Each task knows what it needs to wait for before it can start. The graph is the pipeline.
We have one DAG per hazard. Each one follows the same structure: a fan-out stage that spawns a task per climate model, running all of them in parallel; a synchronization point that waits for every model to complete; then ensemble aggregation, classification, and finally ingestion into the database.
The infrastructure problem all of this created
We already ran a full production platform on the same Kubernetes cluster providing services for authentication, risk management, content, and modeling, each with background workers. Adding climate processing without thinking it through would have been a mistake. Processing runs are bursty: heavy compute for the duration of a run, then silence. You don't want that on the same nodes serving API requests.
We added two dedicated node pools. One for Airflow's core components and one for the actual task pods that do the climate computation. The task pool autoscales to zero between runs, which matters both for cost and for not sitting on idle capacity. Kubernetes taints and tolerations enforce separation. Production workloads don't share resources with climate processing, so a heavy pipeline run can't affect API latency, and a failing task pod is contained.
Identity and secrets
The pipeline has two distinct access patterns: a secrets management operator that reads from the key vault, and task pods that read and write to blob storage. We gave them separate managed identities, each scoped to exactly one resource. We also had an older identity that had accumulated permissions over time. This project was the forcing function to delete it.
As a side note, we also replaced how secrets flow into the cluster. The previous approach tied secret syncing to pod lifecycle, which meant we had a workaround pod whose only job was to stay running so secrets stayed alive. Replacing it with an operator that reconciles continuously against the vault removed the workaround and simplified rotation: update a certificate in the vault, the operator picks it up, the ingress controller sees the new secret. No manual steps, no per-namespace coordination.
Where things stand
The infrastructure is live. Airflow is running, the node pools are in place, secrets management is sorted. What we have now, is a proper foundation for adding and maintaining climate hazard models going forward. Each hazard follows the same structure: a standardized interface for the scientific logic, a DAG that handles orchestration, and a CI/CD pipeline that validates, builds, and promotes changes from development to production through version control.




Comments