This lightweight Change Data Capture (CDC) pipeline streams PostgreSQL logical replication events to BigQuery — no Debezium, no Kafka, just native GCP tools.
🧩 Architecture Overview
[Cloud SQL (PostgreSQL)]
└─ Logical replication via wal2json plugin
↓
[Cloud Scheduler] (runs every 12 minutes)
└─ Triggers →
[Cloud Run Job (custom CDC listener)]
└─ Pulls logical changes from slot
└─ Publishes change events (JSON) →
[Cloud Pub/Sub Topic]
↓
[Dataflow (Apache Beam Flex Template)]
↓
[BigQuery]
📌 GitHub: github.com/HenryXiloj/demos-gcp/tree/main/pg-cdc-to-bq-streaming
💡 Key Highlights
✅ CDC listener written in Python using psycopg2 with logical replication
✅ Deployed as a Cloud Run Job for short-lived, stateless execution
✅ Orchestrated by Cloud Scheduler (runs every 12 minutes)
✅ Publishes change events as JSON to Pub/Sub
✅ Real-time ingestion into BigQuery using Apache Beam (Dataflow Flex Template)
✅ Includes graceful shutdown with SIGTERM handling
🔄 Why Use Cloud Run Jobs?
- Precise control over frequency & duration
- No continuously running services
- Fully stateless and pay-as-you-go
- No external brokers or connectors required
⚖️ Compared to Traditional CDC (e.g., Debezium/Kafka)
🔹 No always-on infrastructure
🔹 No Zookeeper or Kafka to maintain
🔹 Native GCP integration for IAM, networking, and monitoring
🔹 Lower cost, easier to secure and deploy
⚡ Performance & Scalability
- Handles thousands of changes per minute
- Typical latency: 10–15 minutes
- Scales with workload by adjusting Cloud Scheduler frequency
This design offers full control, better security posture, and serverless simplicity, making it ideal for event-driven analytics pipelines.
👉 Full implementation:
🔗 github.com/HenryXiloj/demos-gcp/tree/main/pg-cdc-to-bq-streaming
No comments:
Post a Comment