Sunday, May 18, 2025

🚀 Streaming PostgreSQL Changes to BigQuery using Cloud Run Jobs + Cloud Scheduler 🔄

This lightweight Change Data Capture (CDC) pipeline streams PostgreSQL logical replication events to BigQuery — no Debezium, no Kafka, just native GCP tools.



🧩 Architecture Overview

[Cloud SQL (PostgreSQL)]
   └─ Logical replication via wal2json plugin
        ↓
[Cloud Scheduler] (runs every 12 minutes)
   └─ Triggers →
[Cloud Run Job (custom CDC listener)]
   └─ Pulls logical changes from slot
   └─ Publishes change events (JSON) →
[Cloud Pub/Sub Topic]
   ↓
[Dataflow (Apache Beam Flex Template)]
   ↓
[BigQuery] 


📌 GitHub: github.com/HenryXiloj/demos-gcp/tree/main/pg-cdc-to-bq-streaming

💡 Key Highlights

✅ CDC listener written in Python using psycopg2 with logical replication

✅ Deployed as a Cloud Run Job for short-lived, stateless execution

✅ Orchestrated by Cloud Scheduler (runs every 12 minutes)

✅ Publishes change events as JSON to Pub/Sub

✅ Real-time ingestion into BigQuery using Apache Beam (Dataflow Flex Template)

✅ Includes graceful shutdown with SIGTERM handling


🔄 Why Use Cloud Run Jobs?

  • Precise control over frequency & duration
  • No continuously running services
  • Fully stateless and pay-as-you-go
  • No external brokers or connectors required

⚖️ Compared to Traditional CDC (e.g., Debezium/Kafka)

🔹 No always-on infrastructure

🔹 No Zookeeper or Kafka to maintain

🔹 Native GCP integration for IAM, networking, and monitoring

🔹 Lower cost, easier to secure and deploy

⚡ Performance & Scalability

  • Handles thousands of changes per minute
  • Typical latency: 10–15 minutes
  • Scales with workload by adjusting Cloud Scheduler frequency

This design offers full control, better security posture, and serverless simplicity, making it ideal for event-driven analytics pipelines.

👉 Full implementation:

🔗 github.com/HenryXiloj/demos-gcp/tree/main/pg-cdc-to-bq-streaming


















Friday, May 2, 2025

Dataflow + Terraform: Secure Cloud SQL to BigQuery via PSA/PSC

 

Introduction

Two fully automated, production-grade pipelines demonstrate secure data ingestion from Cloud SQL PostgreSQL into BigQuery, using Apache Beam (Python) and Dataflow Flex Templates — all without public IP exposure.





✅ 1. Private Service Access (PSA) – Internal IP Connectivity

A Cloud SQL instance is provisioned with a private IP via PSA, and Dataflow connects over the internal VPC network.

📌 Key Highlights:

  • Cloud SQL + read replica
  • Private IP via PSA
  • VPC subnet + firewall on port 5432
  • Beam pipeline reads from PostgreSQL → BigQuery
  • Fully managed with Terraform
  • Optional GitHub Actions for CI/CD

🔌 JDBC Format: jdbc:postgresql://<private-ip>:5432/my-database2

🔗 Project Repo: 👉 github.com/.../cloud-sql-psa-with-bigquery


🔒 2. Private Service Connect (PSC) – DNS + SSL Certificate Access

PSC provides hostname-based, IP-less access to Cloud SQL, secured via SSL certificates stored in GCS.

📌 Key Highlights:

  • PSC forwarding rule + private DNS
  • SSL certs: server-ca.pem, client-cert.pem, client-key.pem
  • Enforced permissions (chmod 0600 on private key)
  • Beam pipeline loads certs at runtime from GCS
  • Flex Template accepts PSC host, cert path, and DB credentials
  • psycopg2 uses sslmode=verify-ca
  • Logs: success + errors written to BigQuery

🔗 Project Repo: 👉 github.com/.../cloud-sql-psc-with-bigquery


🛠️ Shared Architecture & Automation (Both Pipelines)

  • terraform/: VPC, Cloud SQL, BigQuery, IAM, DNS
  • sql/: PostgreSQL table schema + init script
  • postgresql_to_bq_flex_template/: Beam logic, Dockerfile, metadata
  • GCS: stores SSL certs + Dataflow staging artifacts
  • Flex Templates fully parameterized via metadata.json
  • Error handling + retry logic + scalable Dataflow workers


📊 Use Case Comparison


Both implementations align with cloud security and automation best practices:

✅ No public IPs

✅ IAM least-privilege access

✅ SSL/TLS enforcement (PSC)

✅ Infrastructure as Code with Terraform

✅ CI/CD extensibility


These blueprints are ideal for teams building secure, private, and scalable data pipelines on Google Cloud Platform using Dataflow, Cloud SQL, and BigQuery.
























🚀 Streaming PostgreSQL Changes to BigQuery using Cloud Run Jobs + Cloud Scheduler 🔄

This lightweight Change Data Capture (CDC) pipeline streams PostgreSQL logical replication events to BigQuery — no Debezium, no Kafka, just ...