SQR-107

TAP_SCHEMA Migration to CloudSQL - Architecture & Implementation Breakdown#

Abstract

Currently the TAP_SCHEMA database is packaged into a Docker image and pre-populated with schema metadata, built from the sdm_schemas YAML schema definitions using Felis. This image is deployed as a separate deployment (tap-schema-db) alongside the main TAP service. This approach requires rebuilding and pushing Docker images for every schema change, couples schema updates to application deployments, and results in ephemeral schema storage that is lost on pod restarts.

Proposed Solution: Migrate TAP_SCHEMA to a persistent CloudSQL instance, where Felis loads the schema metadata directly via a Kubernetes Job triggered by Helm hooks during our regular updates via ArgoCD.

This architecture simplifies schema management by eliminating the containerized TAP_SCHEMA database pod and removing the Docker image build cycle for schema updates. Instead of rebuilding containers for every schema change, a lightweight Helm hook job loads YAML schema definitions directly to CloudSQL during ArgoCD deployments. The TAP service connects to CloudSQL for schema metadata, gaining better backup, recovery, and monitoring capabilities. Although schema updates still flow through Phalanx (by updating schemaVersion), we improve the maintenance overhead and get less coupling between the tap application and the TAP_SCHEMA database.

Scope: This architecture applies to both the QServ and the Postgres backed TAP services (tap & ssotap applications).

2. Current State#

2.1 Existing Architecture#

TAP Service (`lsst-tap-service` & `tap-postgres`)#

CADC-based TAP implementation - https://github.com/opencadc/tap
Deployed via Phalanx/ArgoCD - https://github.com/lsst-sqre/phalanx
Uses:
- UWS database (PostgreSQL in CloudSQL, but with an option to configure with a cluster database)
- TAP_SCHEMA database (Cluster database deployment)

TAP_SCHEMA Container Image#

The TAP_SCHEMA container image is built from the sdm_schemas repository and includes a MySQL database pre-populated with TAP_SCHEMA metadata. Felis generates SQL files from YAML schema definitions, which are used to prepopulate the image with TAP_SCHEMA metadata. This image is deployed as a separate Kubernetes deployment, and the TAP service connects to it via a Kubernetes service endpoint using the JDBC URL jdbc:mysql://{{ tapSchemaAddress }}/.

Schema Generation Process#

Source: sdm_schemas/yml/*.yaml files (https://github.com/lsst/sdm_schemas/)
Tool: Felis (Python-based schema management tool) (https://github.com/lsst/felis)
Build script: sdm_schemas/tap-schema/build-all
Output: Docker image with pre-populated TAP_SCHEMA database

2.2 Problems with Current Approach#

The current containerized approach creates unnecessary friction in schema management. Every schema update requires rebuilding a Docker image, pushing it to a registry, and redeploying pods which is a time-consuming process for what is essentially metadata changes.

Also container storage is ephemeral, meaning schema data is lost on pod restarts and lacks the backup & recovery, logging and monitoring and robustness capabilities of CloudSQL databases.

3. Proposed Architecture#

3.1 High-Level Design#

Core Changes:

TAP_SCHEMA in CloudSQL: Move from containerized DB to persistent CloudSQL database

Helm Hook Automation: Trigger schema updates automatically during ArgoCD deployments

3.2 TAP_SCHEMA Structure#

TAP_SCHEMA consists of several metadata tables:

-- Core TAP_SCHEMA tables
tap_schema.schemas     -- List of schemas (e.g., dp02_dc2, apdb)
tap_schema.tables      -- Tables in each schema
tap_schema.columns     -- Columns in each table
tap_schema.keys        -- Foreign key relationships
tap_schema.key_columns -- Columns involved in foreign keys

How Felis Populates These:

Felis populates these by reading the YAML schema definition (e.g., yml/dp02_dc2.yaml), converting to TAP_SCHEMA INSERT/UPDATE statements which in the current setup are written as .sql scripts which are then mounted and executed during startup of the database pod.

3.3 Proposed Architecture Diagram#

[Developer] → [sdm_schemas repo] → [GitHub Release v1.2.4]
                                           ↓
[Phalanx values.yaml] ← [Manual PR] ← [schemaVersion: v1.2.4]
         ↓
    [ArgoCD Sync]
         ↓
    [Helm pre-upgrade hook]
         ↓
    [Job: felis-updater] → [CloudSQL Proxy] → [CloudSQL: tap_schema DB]
         ↓
    [TAP Service Deployment] → [CloudSQL Proxy] → [CloudSQL: tap_schema + uws DBs]

_images/diagram.png — Fig. 1 Proposed Architecture Diagram#

4. Detailed Design#

4.1 Schema Update Strategies#

One complexity that the new architecture introduces is how to handle schema updates. Previously we simply rebuilt the entire TAP_SCHEMA image from scratch for every change, and then during upgrades via GitOps the new image would replace the old one in a rolling update, transparent to the user.

With CloudSQL this is potentially a bit more complex because we have to consider how to handle changes to existing schemas, additions of new schemas, and removals of old schemas, but also how to do so while minimizing downtime for the user.

Option A: Full Replacement#

Drop existing TAP_SCHEMA tables (Or drop TAP_SCHEMA entirely)
Recreate from scratch
Simplest option
Brief TAP service interruption during update

Would have to test whether a full drop of the parent TAP_SCHEMA schema or individual deletes of each schema is faster. Dropping the entire schema is probably simpler, but would require re-initialization of the TAP_SCHEMA tables.

Option B: Incremental Updates (Potential Future Enhancement)#

Update only changed schemas
Keep existing schemas intact
No service interruption
More complex logic needed

Incremental updates would be more complex because they would potentially require changes to Felis.

With Felis, if each schema was loaded individually using felis load-tap-schema, a rough outline of what this may require could be:

Add --update-mode=incremental flag to felis load-tap-schema
When loading a schema, check if it exists:
- If exists, compare tables/columns
- ALTER existing tables to match new definition
- Add new tables/columns
- Drop removed tables/columns (optional)

If on the other hand we chose to generate the SQL files and run the updates in a single transaction, incremental updates would require a custom script that would:

Generate SQL for new schema: felis load-tap-schema --dry-run new_version.yaml > new.sql
Query current TAP_SCHEMA state to see what exists
Write custom diff logic to:
- Compare new schema definition against current database state
- Generate UPDATE statements for changed columns/tables
- Generate INSERT statements only for new tables/columns
- Generate DELETE statements for removed elements (optional)
Execute the diff-generated SQL instead of full DELETE + INSERT

This approach seems more complex because, as Felis seems more geared towards full INSERTs, so this would require much more custom logic and risk of errors.

Option C: Blue-Green Pattern (MVP)#

Another option is following the blue-green pattern where we maintain two sets of schema tables:

tap_schema (active)
tap_schema_staging (inactive - updated during deployments)

Update Flow:

Load new version into staging:
- Clear tap_schema_staging (DROP CASCADE + recreate, or DELETE all data)
- Populate tap_schema_staging with new schema metadata
- Validate staging
Atomic swap (just rename, no DROP):

   BEGIN;
   ALTER SCHEMA tap_schema RENAME TO tap_schema_temp;
   ALTER SCHEMA tap_schema_staging RENAME TO tap_schema;
   ALTER SCHEMA tap_schema_temp RENAME TO tap_schema_staging;
   COMMIT;

Result:
- tap_schema now has new version (active)
- tap_schema_staging now has old version (available for instant rollback)

Rollback (instant):

BEGIN;
ALTER SCHEMA tap_schema RENAME TO tap_schema_temp;
ALTER SCHEMA tap_schema_staging RENAME TO tap_schema;
ALTER SCHEMA tap_schema_temp RENAME TO tap_schema_staging;
COMMIT;

This provides some advantages:

Zero-downtime updates (only brief rename)
Instant rollback capability (just swap names again)

Postgres should in theory clear existing connections to the old schema during the rename, so queries should be automatically routed to the new schema without requiring a restart of the TAP service. However this would need to be tested to confirm.

Comparison of single transaction upgrade with DELETE vs blue-green#

In terms of service disruption, both approaches would likely result in no downtime. However the single transaction would block queries for a brief period while the update is in progress. Complexity-wise, single transaction is simpler to implement. A full replacement option would also use less storage compared to the blue-green deployment has to at least temporarily have two copies, although this probably is a minimal cost since the TAP_SCHEMA tables shouldn’t take up too much disk space.

Conclusion: For MVP, the plan is to implement Option C (blue-green). This gives us a good balance of minimal downtime and simplicity.

4.2 Schema Distribution Methods#

Another aspect of the design to consider is how the Helm hook job gets the schema files to load. A few options exist:

Option A: Download from GitHub Releases

Downloads the specified release from GitHub (sdm_schemas releases)
Extracts the schemas.tar.gz file containing all YAML schema definitions
Validates that the release contains the expected files

Pros: Simple to implement initially, no additional build steps

Cons: Runtime dependency on GitHub API

Option B: Bake into Container Image (Probably Preferred for Long-Term)

Build image: ghcr.io/lsst/tap-schemas:v1.2.4

Contains:

All YAML files
Felis tool
Update scripts

Pros: No runtime GitHub dependency.

Cons: Requires additional CI/CD build step

Although we could also modify the existing sdm_schemas GitHub Actions workflow to build and push this image whenever a new release is created instead of the MySQL database.

Option C: Mount from ConfigMap

Commit the Felis YAML (or SQL) to the Phalanx Git repo, render them into a ConfigMap, and Felis reads locally against that.

Pros: Fully GitOps native

Cons: ConfigMap size limits & probably clutters Phalanx repo

Option D: Store release artifacts in GCS bucket and download from there.

This would involve modifying the sdm_schemas CI/CD to upload the schemas.tar.gz to a GCS bucket whenever a new release is created. The Helm hook job would then download from this GCS bucket. The urls to the artifacts would be constructed based on the release version, and would be available publicly, and discoverable via Repertoire.

Pros: More control over availability, no GitHub dependency

Cons: Requires GCS bucket management, additional complexity

For our MVP, the current plan is to implement Option D and download from GCS.

4.3 Update Script Logic#

The update script (update-tap-schema.sh) will perform the following operations:

1. Fetch Schemas#

Based on selected distribution method (see 4.2)

2. Initialize TAP_SCHEMA Tables#

Creates the standard TAP_SCHEMA tables if they don’t exist:
- tap_schema.schemas
- tap_schema.tables
- tap_schema.columns
- tap_schema.keys
- tap_schema.key_columns
This probably would also be a call to a felis command init-tap-schema before we run the updates.
Set up appropriate indexes for query performance

3. Validate Configuration#

Parses the comma-separated list of schemas from SCHEMAS_TO_LOAD
Verifies that each configured schema exists in the downloaded release
Reports available schemas if any configured schema is missing
If there is a validation issue we probably want to revert the update process

4. Load Each Schema#

For each schema in the configuration:

Validates the YAML file using Felis (felis validate)
- Generates INSERT SQL using felis load-tap-schema --dry-run --output-file
Execute DELETE + INSERT in a single transaction (see Section 4.9 for details)
- Delete existing schema data from all TAP_SCHEMA tables
- Execute generated INSERT SQL

Note: All schemas DELETE + INSERT operations are wrapped in a single outer transaction to ensure atomicity across all schemas.

5. Optional Cleanup#

If CLEANUP_OLD_SCHEMAS is enabled:

Identifies schemas in CloudSQL not in the configured list
Removes obsolete schemas and their metadata

Whether this is necessary depends on if our full replacement strategy is to delete the metadata for each schema from the TAP_SCHEMA tabless, or to drop TAP_SCHEMA altogether and recreate it from scratch. If we drop and recreate then this step is not needed.

6. Report Results (Optional)#

Should we include some sort of verification step and report somehow?
Perhaps the simplest is to exit with appropriate status code and let the Helm job report success/failure.

Idempotency: The script is designed to be idempotent, running it multiple times with the same schema version is safe and should produce the same result.

Transparency: Since the TAP services query TAP_SCHEMA tables at runtime, updates to TAP_SCHEMA in CloudSQL should be transparent, and the next query from TAP will see the new schema.

4.4 Helm Hook Implementation#

We’ll use Helm hooks to trigger schema updates, following the pattern used by other Phalanx apps like wobbly.

Rough draft of the Job template:

# File: phalanx/charts/cadc-tap/templates/job-schema-update.yaml

apiVersion: batch/v1
kind: Job
metadata:
  name: tap-schema-update-{{ .Values.tapSchema.schemaVersion | replace "." "-" }}
  namespace: {{ .Release.Namespace }}
  annotations:
    helm.sh/hook: "pre-install,pre-upgrade"
    helm.sh/hook-delete-policy: "before-hook-creation"
    helm.sh/hook-weight: "-5"
  labels:
    {{- include "cadc-tap.labels" . | nindent 4 }}
    app.kubernetes.io/component: "schema-update"
spec:
  ttlSecondsAfterFinished: 86400
  backoffLimit: 3
  template:
    metadata:
      labels:
        {{- include "cadc-tap.selectorLabels" . | nindent 8 }}
        app.kubernetes.io/component: "schema-update"
    spec:
      serviceAccountName: {{ include "cadc-tap.serviceAccountName" . }}
      restartPolicy: OnFailure
      
      securityContext:
        runAsNonRoot: true
        runAsUser: 1000
        runAsGroup: 1000
      
      containers:
      - name: felis-updater
        image: ghcr.io/lsst/felis:{{ .Values.tapSchema.felisVersion | default "1.0.0" }}
        imagePullPolicy: IfNotPresent
        
        command: ["/bin/bash", "/scripts/update-tap-schema.sh"]
        
        env:
        - name: PGHOST
          value: "127.0.0.1"
        - name: PGPORT
          value: "5432"
        - name: PGDATABASE
          value: {{ .Values.tapSchema.cloudSqlDatabase | quote }}
        - name: SCHEMA_VERSION
          value: {{ .Values.tapSchema.schemaVersion | quote }}
        - name: SCHEMAS_TO_LOAD
          value: {{ .Values.tapSchema.schemas | join "," | quote }}
        - name: CLEANUP_OLD_SCHEMAS
          value: {{ .Values.tapSchema.cleanupOldSchemas | default "false" | quote }}
        
        volumeMounts:
        - name: update-script
          mountPath: /scripts
          readOnly: true
        
        resources:
          requests:
            memory: "512Mi"
            cpu: "250m"
          limits:
            memory: "2Gi"
            cpu: "1000m"
        
        securityContext:
          allowPrivilegeEscalation: false
          capabilities:
            drop:
              - all
          readOnlyRootFilesystem: true
      
      # Cloud SQL Proxy sidecar
      - name: cloud-sql-proxy
        image: {{ .Values.cloudsql.image.repository }}:{{ .Values.cloudsql.image.tag }}
        imagePullPolicy: {{ .Values.cloudsql.image.pullPolicy }}
        
        args:
        - "--structured-logs"
        - "--port=5432"
        - "--max-sigterm-delay=30s"
        - {{ .Values.cloudsql.instanceConnectionName | quote }}
        
        securityContext:
          runAsNonRoot: true
          runAsUser: 65532
          allowPrivilegeEscalation: false
          capabilities:
            drop:
              - all
          readOnlyRootFilesystem: true
        
        resources:
          {{- toYaml .Values.cloudsql.resources | nindent 10 }}
      
      volumes:
      - name: update-script
        configMap:
          name: tap-schema-update-script
          defaultMode: 0755

How it would work:

Developer updates schema version in Phalanx
Sync using ArgoCD
Helm hook executes automatically:
- Helm renders template with new schemaVersion: "v1.2.4"
- pre-upgrade hook ensures job runs BEFORE the main deployment updates
- Job tap-schema-update-v1-2-4 is created and runs
- Job loads schemas to CloudSQL
- After job succeeds, the main TAP deployment proceeds
Job cleanup:
- hook-delete-policy: "before-hook-creation" deletes previous schema update jobs before creating new ones
- ttlSecondsAfterFinished: 86400 keeps completed jobs for 24 hours for debugging

4.5 Phalanx Configuration#

Values File Structure#

(Ommmitting unchanged sections for conciseness)

# applications/tap/values.yaml (base configuration)

cloudSQL:
  enabled: false
  instanceConnectionName: ""
  serviceAccount: ""
  image:
    repository: ""
    tag: ""
    pullPolicy: ""

uws:
  useCloudSQL: true
  cloudSqlDatabase: "uws"
  image:
    repository: ""
    pullPolicy: ""
    tag: ""
  
tapSchema:
  useCloudSQL: true
  cloudSqlDatabase: "tap_schema"
  schemaVersion: "v1.2.3"
  felisVersion: "1.0.0"
  schemas:
    - dp02_dc2
    - apdb
  cleanupOldSchemas: false

  # Keep legacy image config for cases where CloudSQL is not enabled
  image:
    repository: ""
    pullPolicy: ""
    tag: ""

Environment-Specific Configuration#

Different environments may serve different data, so each needs different schemas loaded. The idea proposed here is to allow each env to specify which schemas to load in their respective values files.

# applications/tap/values-idfint.yaml
tapSchema:
  schemaVersion: "v1.2.4"
  
  # Define which schemas to load in this environment
  schemas:
    - dp02_dc2
    - dp1

Helm Chart Templates#

# applications/tap/templates/deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: {{ include "cadc-tap.fullname" . }}
  labels:
    {{- include "cadc-tap.labels" . | nindent 4 }}
spec:
  replicas: {{ .Values.replicaCount }}
  selector:
    matchLabels:
      {{- include "cadc-tap.selectorLabels" . | nindent 6 }}
  template:
    metadata:
      annotations:
        checksum/config: {{ include (print $.Template.BasePath "/configmap.yaml") . | sha256sum }}
        {{- with .Values.podAnnotations }}
        {{- toYaml . | nindent 8 }}
        {{- end }}
      labels:
        {{- include "cadc-tap.selectorLabels" . | nindent 8 }}
    spec:
      {{- if .Values.cloudsql.enabled }}
      serviceAccountName: {{ include "cadc-tap.serviceAccountName" . }}
      {{- end }}
      securityContext:
        runAsNonRoot: true
        runAsUser: 1000
        runAsGroup: 1000
        fsGroup: 1000
      containers:
      - name: tap
        image: "{{ .Values.config.qserv.image.repository }}:{{ .Values.config.qserv.image.tag }}"
        imagePullPolicy: {{ .Values.config.qserv.image.pullPolicy }}
        ports:
        - name: http
          containerPort: 8080
          protocol: TCP
        env:
        {{- if .Values.tapSchema.useCloudSQL }}
        - name: TAP_SCHEMA_JDBC_URL
          value: "jdbc:postgresql://127.0.0.1:5432/{{ .Values.tapSchema.cloudSqlDatabase }}"
        {{- else }}
        - name: TAP_SCHEMA_JDBC_URL
          value: "jdbc:mysql://{{ .Values.config.tapSchemaAddress }}"
        {{- end }}
        {{- if .Values.uws.useCloudSQL }}
        - name: UWS_JDBC_URL
          value: "jdbc:postgresql://127.0.0.1:5432/{{ .Values.uws.cloudSqlDatabase }}"
        {{- else }}
        - name: UWS_JDBC_URL
          value: "jdbc:postgresql://cadc-tap-uws-db:5432/uwsdb"
        {{- end }}
        # ... (other env vars)        

      {{- if .Values.cloudsql.enabled }}
      - name: cloud-sql-proxy
        image: "{{ .Values.cloudsql.image.repository }}:{{ .Values.cloudsql.image.tag }}"
        imagePullPolicy: {{ .Values.cloudsql.image.pullPolicy }}
        args:
        - "--structured-logs"
        - "--port=5432"
        - {{ .Values.cloudsql.instanceConnectionName | quote }}
        securityContext:
          runAsNonRoot: true
          runAsUser: 65532
          allowPrivilegeEscalation: false
          capabilities:
            drop:
            - all
          readOnlyRootFilesystem: true
        resources:
          {{- toYaml .Values.cloudsql.resources | nindent 10 }}
      {{- end }}
      volumes:
      - name: config
        configMap:
          name: {{ include "cadc-tap.fullname" . }}-config
      - name: gcs-credentials
        secret:
          secretName: {{ include "cadc-tap.fullname" . }}-gcs-credentials

# applications/tap/templates/configmap-update-script.yaml
apiVersion: v1
kind: ConfigMap
metadata:
  name: tap-schema-update-script
  namespace: {{ .Release.Namespace }}
data:
  update-tap-schema.sh: |
{{ .Files.Get "files/update-tap-schema.sh" | indent 4 }}

Note: The tap-schema-db deployment and related resources in the cadc-tap chart need to be optional and disabled when CloudSQL is enabled for the tapSchema database.

Single CloudSQL Instance: The plan is to use one CloudSQL instance that hosts both uws and tap_schema databases so the above changes reflect that. If for some reason we need separate instances we can adjust accordingly in the future.

4.6 Testing and Validation#

The update script would include some sort of built-in verification to ensure the schemas in the configuration are all being loaded, and perhaps we may want to include some sort of post-deployment validation tests later.

4.7 Felis Functionality Analysis#

Existing Felis Capabilities#

Based on current Felis documentation and implementation, the following functionality already exists:

1. Schema Validation

Command: felis validate validates one or more schema files in YAML format
Returns non-zero exit code on validation errors

2. TAP_SCHEMA Initialization

Command: felis init-tap-schema creates an empty TAP_SCHEMA database
Supports custom schema name via --tap-schema-name option
Creates standard TAP_SCHEMA tables (schemas, tables, columns, keys, key_columns)

3. TAP_SCHEMA Population

Command: felis load-tap-schema can generate SQL statements or update an existing TAP_SCHEMA database directly
Can save SQL to file: --output-file tap_schema.sql
Can update existing database: --engine-url=mysql+mysqlconnector://user:password@host/TAP_SCHEMA
The felis load-tap-schema command only performs INSERT operations. It does NOT:
- Check if data already exists
- Update existing rows
- Delete old data

4. Database Creation

Command: felis create creates database objects from a schema including tables, columns, indexes, and constraints
Supports environment variable FELIS_ENGINE_URL for database connection

Required Functionality#

The following exists in felis already:

Initialize TAP_SCHEMA tables: felis init-tap-schema
Load schema to TAP_SCHEMA: felis load-tap-schema
Validate schema YAML: felis validate

The following does NOT exist:

Incremental Updates: No --update-mode=incremental flag exists

No built-in mechanism to compare existing schema with new schema
No ALTER TABLE support for modifying existing schemas

Multi-Schema Management:

Felis loads one schema file at a time
Our update script needs to loop through multiple schemas
Need custom logic to handle SCHEMAS_TO_LOAD configuration

Selective Schema Cleanup:

No built-in way to identify and remove old schemas
Our update script needs custom logic for CLEANUP_OLD_SCHEMAS

Transaction Management:

Felis wraps each felis load-tap-schema call in its own transaction and commits independently
We cannot wrap multiple Felis calls in an outer transaction
But, we can use --dry-run mode to generate SQL, then execute all SQL in one script-managed transaction (see Section 4.9)

GitHub Release Download:

Felis has no built-in functionality to fetch schemas from GitHub
Update script needs to handle download and extraction

Implementation Requirements Summary#

What Felis provides out of the box:

# Initialize TAP_SCHEMA
felis init-tap-schema --engine-url=postgresql://user:pass@host:port/tap_schema

# Validate a schema
felis validate --check-description --check-tap-principal dp02_dc2.yaml

# Load schema to TAP_SCHEMA
felis load-tap-schema --engine-url=postgresql://user:pass@host:port/tap_schema dp02_dc2.yaml

What We Need to Build in update-tap-schema.sh:

Download and extract schema files from GitHub releases (or use pre-baked container)
Initialize TAP_SCHEMA tables if they don’t exist using felis init-tap-schema
Parse SCHEMAS_TO_LOAD and iterate through schemas
For each schema:
- Validate with felis validate
- DELETE existing schema data from all TAP_SCHEMA tables (required because Felis only INSERTs)
- Execute felis load-tap-schema to INSERT new data
Transaction management: wrap DELETE + INSERT operations for ALL schemas in a single BEGIN/COMMIT block
Cleanup old schemas if CLEANUP_OLD_SCHEMAS is enabled
Error handling and reporting

Future Felis Enhancements (Nice to Have):

felis load-tap-schema --update-mode=incremental to update only changed schemas

felis list-schemas --engine-url=... to show loaded schemas

felis remove-schema schema_name --engine-url=... to remove a schema

4.8 Transaction Strategy#

Felis manages its own transaction internally when loading schemas. Each felis load-tap-schema call commits independently, preventing us from wrapping multiple Felis calls in an outer transaction. The workaround suggested here is to use Felis in dry-run mode to generate SQL, then execute all SQL in a single script-managed transaction:

# Phase 1: Clear and repopulate staging
psql <<EOF
-- Clear staging (keep structure, delete data)
DELETE FROM tap_schema_staging.key_columns;
DELETE FROM tap_schema_staging.keys;
DELETE FROM tap_schema_staging.columns;
DELETE FROM tap_schema_staging.tables;
DELETE FROM tap_schema_staging.schemas;
EOF

# Phase 2: Load new schemas into staging
for schema in $SCHEMAS_TO_LOAD; do
    felis load-tap-schema \
        --engine-url=postgresql://${PGUSER}@${PGHOST}:${PGPORT}/${PGDATABASE} \
        --tap-schema-name=tap_schema_staging \
        ${schema}.yaml
done

# Phase 3: Validate staging
psql -c "SELECT COUNT(*) FROM tap_schema_staging.schemas WHERE schema_name IN ('dp02_dc2', 'apdb');"
# (more validation tests)

# Phase 4: Atomic three-way swap
psql <<EOF
BEGIN;
ALTER SCHEMA tap_schema RENAME TO tap_schema_temp;
ALTER SCHEMA tap_schema_staging RENAME TO tap_schema;
ALTER SCHEMA tap_schema_temp RENAME TO tap_schema_staging;
COMMIT;
EOF

Essentially all schemas in SCHEMAS_TO_LOAD are updated within a single BEGIN…COMMIT block.

Only the schema rename operations are in a transaction (milliseconds) and data loading happens outside transaction in staging schema, so there would be no blocking of TAP service queries during data load.

The above describes the blue-green approach (Option C in Section 4.1). If we eventually instead choose to do a full replacement the above would essentially be simpler as we would not need to create the staging schema and could just delete the existing schema data directly from tap_schema before inserting the new data.

4.9 Felis Docker Image#

For the Helm hook job, we would need a container image with Felis installed. This should be created via Github actions and pushed to GHCR.

4.10 Managing Datalink templates#

Currently, the datalink template files (datalink-snippets) are packaged into a tarball and then the TAP service fetches them at startup from github. With the new architecture, the current plan is that these templates would be pushed to GCS as part of the release. Repertoire through some process to be determined would know where these are located based on the schema version and store the URLS to them along with the URL to the schema files.

The TAP service could then request the link to the datalink template files from Repertoire and fetch them at startup, instead of storing a link to the datalink payload URL as it does now.

There is potentially some room for improvement here in terms of how we handle the datalinks in TAP through this template mechanism, as ideally we want better separation between the schema definitions and any other products like the datalink template files.

However this is out of scope for this document, and may be something to consider in the future and outlined in it’s own technote.

5. Migration Plan#

Phase 1: Infrastructure Setup#

Objective: Update CloudSQL instance and configure access

TAP_SCHEMA will be stored as a Postgres schema in the existing CloudSQL tap database alongside the UWS schema. This approach re-uses existing infrastructure, requires only a single CloudSQL proxy sidecar and keeps maintenance simple. TAP_SCHEMA’s small size and infrequent updates mean it won’t impact UWS performance. If future requirements show the need for complete separation, the tap_schema schema can be migrated to a separate database.

Tasks:

Ensure existing:
- Database tap and schema for tap_schema
- Service accounts
- Workload Identities
Verify service account has CREATE/DROP SCHEMA permissions on tap database
No new CloudSQL instance needed - reusing existing

Deliverables:

Terraform configuration (Not obvious if anything needs to be changed)
Service accounts configured (May re-use existing)

Phase 2: TAP_SCHEMA Update Mechanism#

Objective: Implement schema loading and update logic

Tasks:

Verify Felis capabilities and commands
Create update-tap-schema.sh script
Implement schema distribution method (Need to choose which option we want to implement first)
Test schema loading in dev environment
Implement validation and error handling

Deliverables:

Working update script
Schema validation logic
Error handling and validate rollback procedures

High-Level Flow of `update-tap-schema.sh`:#

Parse configuration - Read SCHEMA_VERSION, SCHEMAS_TO_LOAD

Fetch schema files - Download from GCS

Initialize schemas (first time only):
    felis init-tap-schema --tap-schema-name=tap_schema
    felis init-tap-schema --tap-schema-name=tap_schema_staging

Validate all schemas - Run felis validate on each YAML file

Clear staging schema:
    DELETE all rows from tap_schema_staging tables

Load into staging:
    For each schema in SCHEMAS_TO_LOAD:
        felis load-tap-schema --tap-schema-name=tap_schema_staging schema.yaml

Validate staging:
    Verify all expected schemas exist in tap_schema_staging
    Verify each schema has tables
    Check foreign key integrity

Atomic swap:
    BEGIN TRANSACTION
        ALTER SCHEMA tap_schema RENAME TO tap_schema_temp
        ALTER SCHEMA tap_schema_staging RENAME TO tap_schema
        ALTER SCHEMA tap_schema_temp RENAME TO tap_schema_staging
    COMMIT TRANSACTION

Report results

Exit - Return 0 if succeeded, 1 if failed

Note: If we choose to delete the entire TAP_SCHEMA and recreate it from scratch, the script becomes simpler.

Phase 3: Phalanx Configuration Updates#

Objective: Update Helm charts and configuration

Tasks:

Update TAP application values files
Create Helm hook Job template
Create ConfigMap for update script
Update deployment templates for CloudSQL connectivity
Make tap-schema-db deployment conditional/optional
Test in on dev / int environments

Deliverables:

Updated Helm charts
Environment-specific values files
Working deployment in idfdev

Phase 4: sdm_schemas CI/CD Updates#

Objective: Automate schema packaging and release

Tasks:

Update GitHub Actions workflow
Install Felis in CI
Validate all schemas in CI
Create release assets:
- schemas.tar.gz OR
- Pre-baked container image
Deprecate old Docker image build process
Update documentation

Deliverables:

Updated CI/CD pipeline
Automated schema releases
Deprecated old build process

(Optional) Automating schema version updates in Phalanx with GitHub workflows

If we want to provide further automation, we could set up GitHub Actions in the sdm_schemas repository to automatically create PRs in Phalanx when a new release is tagged.

The workflow would detect new release tags in sdm_schemas, checkout the Phalanx repository, update the schemaVersion in appropriate values files and create a pull request for review. This would further reduce manual steps but would add complexity to the GitHub workflows. Manually creating a PR in Phalanx seems straightforward enough that probably makes this not worth the effort for MVP.

Phase 5: Production Migration#

Tasks:

Deploy to idfdev, then idfint
Monitor for a week or two
Deploy to production

6. Operations Runbook#

6.1 Standard Schema Update Workflow#

1. Make change in sdm_schemas repo
Example: change sdm_schemas/yml/dp02_dc2.yaml

2. Validate locally
felis validate --check-description yml/dp02_dc2.yaml

3. Create PR and release in sdm_schemas
Example: tag v1.2.3

4. Update Phalanx
In phalanx, change applications/tap/values-idfdev.yaml
Change: schemaVersion: "v1.2.3" then commit/push/PR

5. ArgoCD syncs and runs update job automatically

6. Job downloads v1.2.3 and loads all schemas

7. TAP service automatically uses new schemas

6.2 Selective Schema Update#

Update only specific schemas without changing version:

# In values file, change which schemas are loaded
tap:
  schemaVersion: "v1.2.4"
  schemas:
    - dp02_dc2
    - apdb
    - new_schema  # Add a new schema

# Commit and push - ArgoCD syncs and loads the new schema

6.3 Rollback Procedure#

If issues occur after a schema update, we can rollback using GitOps via a rollback or a git revert + sync. By syncing to a previous version, the Helm hook job will run again and reload the previous schema version. The previous schema version’s GitHub release (or docker image depending on which approach we go with) must still exist. If for whatever reason the previous version is not available the option is also there to manually restore schema from CloudSQL backup.

In the case of blue-green deployment we also have the option of swapping the schema names in the database, since we can keep the previous version after a new deploy around. However this would be a manual process (or at best some script we can run manually) and not part of the automated Helm hook job, unless we have some sort of flag to indicate a rollback.

GitOps Rollback:

Revert schemaVersion in Phalanx
ArgoCD sync reruns Helm hook
Job reloads old version from GCS into staging
Swap completes

Instant Rollback using staging schema(Primary Method):

Since we maintain both schemas permanently, rollback could also be done via:

psql <<EOF
BEGIN;
ALTER SCHEMA tap_schema RENAME TO tap_schema_temp;
ALTER SCHEMA tap_schema_staging RENAME TO tap_schema;
ALTER SCHEMA tap_schema_temp RENAME TO tap_schema_staging;
COMMIT;
EOF

The current plan is to use the GitOps rollback as the primary method, and we can always revisit and add the instant swap method later if needed.

7. Security Considerations#

7.1 IAM Authentication#

We will use IAM authentication for CloudSQL access, following the pattern used by UWS so no password management is required.

7.2 Database Roles#

The current design grants the felis-updater job full access to the tap_schema database. We should probably consider two roles here, a reader role used by the TAP service and a writer role used by the felis-updater job with full CRUD permissions on the tap_schema tables.

9. Open Questions#

Transaction Size#

Our current design wraps all schemas in a single transaction. Are we concerned about transaction size and potential timeouts? TAP service queries against TAP_SCHEMA might be blocked during the entire update? Note: This question is only relevant if we choose the single transaction approach instead of blue-green.

Schema Distribution#

How should the job updater get the schema files? Download from GitHub? Bake into a container image? Mount from ConfigMap? I’m thinking start with Option A (GitHub download) for MVP or Option B (baked container) if we want to avoid runtime dependency on GitHub.

CloudSQL Proxy Configuration#

Is it ok for UWS and TAP_SCHEMA to share a single CloudSQL proxy sidecar? Or do we want separate proxies? Current design uses single proxy on port 5432, with database selection via JDBC URL database parameter. Need to verify this works correctly.

Update Strategy#

How should we handle the update strategy? Full replacement vs incremental updates? I’d probably aim for full replacement for MVP, implement incremental updates as future enhancement

Version History#

Do we want to maintain schema version history in the database or record the schema version in the database somehow or does that add unnecessary complexity?

10. Documentation Updates Required#

Update TAP service README with new architecture
Document new schema update process
Add troubleshooting guide for common issues

TAP_SCHEMA Migration to CloudSQL - Architecture & Implementation Breakdown#

Abstract

2. Current State#

2.1 Existing Architecture#

TAP Service (lsst-tap-service & tap-postgres)#

TAP_SCHEMA Container Image#

Schema Generation Process#

2.2 Problems with Current Approach#

3. Proposed Architecture#

3.1 High-Level Design#

3.2 TAP_SCHEMA Structure#

3.3 Proposed Architecture Diagram#

4. Detailed Design#

4.1 Schema Update Strategies#

Option A: Full Replacement#

Option B: Incremental Updates (Potential Future Enhancement)#

Option C: Blue-Green Pattern (MVP)#

Comparison of single transaction upgrade with DELETE vs blue-green#

4.2 Schema Distribution Methods#

4.3 Update Script Logic#

1. Fetch Schemas#

2. Initialize TAP_SCHEMA Tables#

3. Validate Configuration#

4. Load Each Schema#

5. Optional Cleanup#

6. Report Results (Optional)#

4.4 Helm Hook Implementation#

4.5 Phalanx Configuration#

Values File Structure#

Environment-Specific Configuration#

Helm Chart Templates#

4.6 Testing and Validation#

4.7 Felis Functionality Analysis#

Existing Felis Capabilities#

Required Functionality#

Implementation Requirements Summary#

4.8 Transaction Strategy#

4.9 Felis Docker Image#

4.10 Managing Datalink templates#

5. Migration Plan#

Phase 1: Infrastructure Setup#

Phase 2: TAP_SCHEMA Update Mechanism#

High-Level Flow of update-tap-schema.sh:#

Phase 3: Phalanx Configuration Updates#

Phase 4: sdm_schemas CI/CD Updates#

Phase 5: Production Migration#

6. Operations Runbook#

6.1 Standard Schema Update Workflow#

6.2 Selective Schema Update#

6.3 Rollback Procedure#

7. Security Considerations#

7.1 IAM Authentication#

7.2 Database Roles#

9. Open Questions#

Transaction Size#

Schema Distribution#

CloudSQL Proxy Configuration#

Update Strategy#

Version History#

10. Documentation Updates Required#

TAP Service (`lsst-tap-service` & `tap-postgres`)#

High-Level Flow of `update-tap-schema.sh`:#