Introduction

Backing up a 50GB database is routine.

Backing up a 2TB PostGIS cluster is infrastructure engineering.

PostGIS changes backup planning because:

Rows are larger
WAL volume is higher
TOAST tables grow aggressively
Restore time is rarely linear

This post focuses purely on operational reality.

Why Logical Dumps Stop Making Sense

Check database size:

SELECT pg_size_pretty(pg_database_size(current_database()));

If your PostGIS database is:

300GB → pg_dump becomes slow

500GB → restore time becomes operationally risky

1TB → logical restore is rarely acceptable for RTO

Why?

Because logical dump:

Recreates schema
Replays inserts row by row
Rebuilds GIST indexes from scratch
Generates additional WAL during restore

Spatial indexes are expensive to rebuild.

In real systems, a 1.5TB logical restore can take 10–20 hours.

That is not DR. That is downtime.

Physical Backup Is the Baseline

For large PostGIS clusters, physical backup is mandatory.

Base Backup

pg_basebackup -D /backup/full -Fp -Xs -P

Important flags:

-Fp → plain format (filesystem copy)
-Xs → stream WAL
-P → progress

At multi-terabyte scale:

Network throughput becomes limiting factor
Disk IOPS on source matter
Backup duration must be measured under load

Do not measure base backup on idle systems.

WAL Archiving Is Not Optional

Enable WAL archiving:

archive_mode = on
archive_command = 'cp %p /archive/%f'

For PostGIS systems, monitor archive volume:

du -sh /archive/

Spatial workloads generate more WAL because:

Geometry updates log full rows
TOAST data logged
GIST index updates logged

If archive disk fills: Primary stops.

Always monitor:

Archive directory growth
Replication slot retention
WAL generation spikes during spatial batch jobs

Replication Slots: Hidden Disk Risk

If using streaming replication or logical slots:

SELECT slot_name,
       pg_size_pretty(
         pg_wal_lsn_diff(pg_current_wal_lsn(), restart_lsn)
       ) AS retained_wal
FROM pg_replication_slots;

If subscriber is slow:

WAL accumulates
Disk fills
Primary crashes

In PostGIS clusters, this can happen faster due to large row size.

Alert on retained WAL size, not just replication delay in seconds.

Restore Time Is the Real KPI

Everyone measures backup success.

Few measure restore time.

Restore consists of:

Copy base backup to new node
Start recovery
Replay WAL
Load extensions
Warm cache
Application reconnect

Measure actual restore time:

time rsync -av /backup/full/ /var/lib/pgsql/data/

Then measure WAL replay duration from logs.

For 2TB clusters, realistic restore time is often:

2–6 hours minimum
Longer under heavy WAL backlog

If your RTO expectation is 30 minutes, and you have never tested full restore, you are assuming.

PostGIS-Specific Restore Considerations

PostGIS is not just tables.

After restore, verify:

SELECT extname, extversion
FROM pg_extension
WHERE extname = 'postgis';

Confirm:

Extension loads successfully
GEOS / PROJ libraries match
No missing shared library errors

OS-level mismatch can break spatial functions even if restore succeeds.

Backup Strategy Pattern That Works

For large PostGIS clusters:

Weekly full base backup
Continuous WAL archiving
Replication standby for fast failover
Quarterly full restore testing
WAL retention sized for worst-case replay window

Never rely on a single method.

What Actually Fails in Real Incidents

In real production incidents, failures are rarely:

Corrupted geometry
Broken spatial queries

They are:

Disk full due to WAL retention
Backup incomplete due to IO bottleneck
Restore too slow for SLA
Archive directory filling silently

Backup strategy is capacity engineering, not checkbox compliance.

Final Operational Principle

If you have never:

Restored your full multi-TB PostGIS cluster
Measured end-to-end restore time
Validated extension loading post-restore
Simulated WAL-heavy recovery

Then you do not know your disaster recovery capability.

You only have backups.

In large PostGIS systems, restore predictability matters more than backup frequency.