SQR-107
TAP_SCHEMA Migration to CloudSQL - Architecture & Implementation Breakdown#
Abstract
Currently the TAP_SCHEMA database is packaged into a Docker image and pre-populated with schema metadata, built from the sdm_schemas YAML schema definitions using Felis.
This image is deployed as a separate deployment (tap-schema-db) alongside the main TAP service.
This approach requires rebuilding and pushing Docker images for every
schema change, couples schema updates to application deployments, and results in ephemeral schema storage that is lost on pod restarts.
Proposed Solution: Migrate TAP_SCHEMA to a persistent CloudSQL instance, where Felis loads the schema metadata directly via a Kubernetes Job triggered by Helm hooks during our regular updates via ArgoCD.
This architecture simplifies schema management by eliminating the containerized TAP_SCHEMA database pod and removing the Docker image build cycle for schema updates. Instead of rebuilding containers for every schema change, a lightweight Helm hook job loads YAML schema definitions directly to CloudSQL during ArgoCD deployments. The TAP service connects to CloudSQL for schema metadata, gaining better backup, recovery, and monitoring capabilities. Although schema updates still flow through Phalanx (by updating schemaVersion), we improve the maintenance overhead and get less coupling between the tap application and the TAP_SCHEMA database.
Scope: This architecture applies to both the QServ and the Postgres backed TAP services (tap & ssotap applications).
2. Current State#
2.1 Existing Architecture#
TAP Service (lsst-tap-service & tap-postgres)#
- CADC-based TAP implementation - https://github.com/opencadc/tap 
- Deployed via Phalanx/ArgoCD - https://github.com/lsst-sqre/phalanx 
- Uses: - UWS database (PostgreSQL in CloudSQL, but with an option to configure with a cluster database) 
- TAP_SCHEMA database (Cluster database deployment) 
 
TAP_SCHEMA Container Image#
The TAP_SCHEMA container image is built from the sdm_schemas repository
and includes a MySQL database pre-populated with TAP_SCHEMA metadata.
Felis generates SQL files from YAML schema definitions, which are used to
prepopulate the image with TAP_SCHEMA metadata. This image is deployed as a separate Kubernetes deployment, and the TAP service connects to it via a Kubernetes service endpoint using the JDBC URL jdbc:mysql://{{ tapSchemaAddress }}/.
Schema Generation Process#
- Source: - sdm_schemas/yml/*.yamlfiles (https://github.com/lsst/sdm_schemas/)
- Tool: Felis (Python-based schema management tool) (https://github.com/lsst/felis) 
- Build script: - sdm_schemas/tap-schema/build-all
- Output: Docker image with pre-populated TAP_SCHEMA database 
2.2 Problems with Current Approach#
The current containerized approach creates unnecessary friction in schema management. Every schema update requires rebuilding a Docker image, pushing it to a registry, and redeploying pods which is a time-consuming process for what is essentially metadata changes.
Also container storage is ephemeral, meaning schema data is lost on pod restarts and lacks the backup & recovery, logging and monitoring and robustness capabilities of CloudSQL databases.
3. Proposed Architecture#
3.1 High-Level Design#
Core Changes:
TAP_SCHEMA in CloudSQL: Move from containerized DB to persistent CloudSQL database
Helm Hook Automation: Trigger schema updates automatically during ArgoCD deployments
3.2 TAP_SCHEMA Structure#
TAP_SCHEMA consists of several metadata tables:
-- Core TAP_SCHEMA tables
tap_schema.schemas     -- List of schemas (e.g., dp02_dc2, apdb)
tap_schema.tables      -- Tables in each schema
tap_schema.columns     -- Columns in each table
tap_schema.keys        -- Foreign key relationships
tap_schema.key_columns -- Columns involved in foreign keys
How Felis Populates These:
Felis populates these by reading the YAML schema definition (e.g., yml/dp02_dc2.yaml), converting to TAP_SCHEMA INSERT/UPDATE statements which in the current setup are written as .sql scripts which are then mounted and executed during startup of the database pod.
3.3 Proposed Architecture Diagram#
[Developer] → [sdm_schemas repo] → [GitHub Release v1.2.4]
                                           ↓
[Phalanx values.yaml] ← [Manual PR] ← [schemaVersion: v1.2.4]
         ↓
    [ArgoCD Sync]
         ↓
    [Helm pre-upgrade hook]
         ↓
    [Job: felis-updater] → [CloudSQL Proxy] → [CloudSQL: tap_schema DB]
         ↓
    [TAP Service Deployment] → [CloudSQL Proxy] → [CloudSQL: tap_schema + uws DBs]
 
Fig. 1 Proposed Architecture Diagram#
4. Detailed Design#
4.1 Schema Update Strategies#
One complexity that the new architecture introduces is how to handle schema updates. Previously we simply rebuilt the entire TAP_SCHEMA image from scratch for every change, and then during upgrades via GitOps the new image would replace the old one in a rolling update, transparent to the user.
With CloudSQL this is potentially a bit more complex because we have to consider how to handle changes to existing schemas, additions of new schemas, and removals of old schemas, but also how to do so while minimizing downtime for the user.
Option A: Full Replacement#
- Drop existing TAP_SCHEMA tables (Or drop TAP_SCHEMA entirely) 
- Recreate from scratch 
- Simplest option 
- Brief TAP service interruption during update 
Would have to test whether a full drop of the parent TAP_SCHEMA schema or individual deletes of each schema is faster. Dropping the entire schema is probably simpler, but would require re-initialization of the TAP_SCHEMA tables.
Option B: Incremental Updates (Potential Future Enhancement)#
- Update only changed schemas 
- Keep existing schemas intact 
- No service interruption 
- More complex logic needed 
Incremental updates would be more complex because they would potentially require changes to Felis.
With Felis, if each schema was loaded individually using felis load-tap-schema,
a rough outline of what this may require could be:
- Add - --update-mode=incrementalflag to felis load-tap-schema
- When loading a schema, check if it exists: - If exists, compare tables/columns 
- ALTER existing tables to match new definition 
- Add new tables/columns 
- Drop removed tables/columns (optional) 
 
If on the other hand we chose to generate the SQL files and run the updates in a single transaction, incremental updates would require a custom script that would:
- Generate SQL for new schema: - felis load-tap-schema --dry-run new_version.yaml > new.sql
- Query current TAP_SCHEMA state to see what exists 
- Write custom diff logic to: - Compare new schema definition against current database state 
- Generate UPDATE statements for changed columns/tables 
- Generate INSERT statements only for new tables/columns 
- Generate DELETE statements for removed elements (optional) 
 
- Execute the diff-generated SQL instead of full DELETE + INSERT 
This approach seems more complex because, as Felis seems more geared towards full INSERTs, so this would require much more custom logic and risk of errors.
Option C: Blue-Green Pattern (MVP)#
Another option is following the blue-green pattern where we maintain two sets of schema tables:
- tap_schema(active)
- tap_schema_staging(inactive - updated during deployments)
Update Flow:
- Load new version into staging: - Clear - tap_schema_staging(DROP CASCADE + recreate, or DELETE all data)
- Populate - tap_schema_stagingwith new schema metadata
- Validate staging 
 
- Atomic swap (just rename, no DROP): 
   BEGIN;
   ALTER SCHEMA tap_schema RENAME TO tap_schema_temp;
   ALTER SCHEMA tap_schema_staging RENAME TO tap_schema;
   ALTER SCHEMA tap_schema_temp RENAME TO tap_schema_staging;
   COMMIT;
- Result: - tap_schemanow has new version (active)
- tap_schema_stagingnow has old version (available for instant rollback)
 
Rollback (instant):
BEGIN;
ALTER SCHEMA tap_schema RENAME TO tap_schema_temp;
ALTER SCHEMA tap_schema_staging RENAME TO tap_schema;
ALTER SCHEMA tap_schema_temp RENAME TO tap_schema_staging;
COMMIT;
This provides some advantages:
- Zero-downtime updates (only brief rename) 
- Instant rollback capability (just swap names again) 
Postgres should in theory clear existing connections to the old schema during the rename, so queries should be automatically routed to the new schema without requiring a restart of the TAP service. However this would need to be tested to confirm.
Comparison of single transaction upgrade with DELETE vs blue-green#
In terms of service disruption, both approaches would likely result in no downtime. However the single transaction would block queries for a brief period while the update is in progress. Complexity-wise, single transaction is simpler to implement. A full replacement option would also use less storage compared to the blue-green deployment has to at least temporarily have two copies, although this probably is a minimal cost since the TAP_SCHEMA tables shouldn’t take up too much disk space.
Conclusion: For MVP, the plan is to implement Option C (blue-green). This gives us a good balance of minimal downtime and simplicity.
4.2 Schema Distribution Methods#
Another aspect of the design to consider is how the Helm hook job gets the schema files to load. A few options exist:
Option A: Download from GitHub Releases
- Downloads the specified release from GitHub ( - sdm_schemasreleases)
- Extracts the - schemas.tar.gzfile containing all YAML schema definitions
- Validates that the release contains the expected files 
Pros: Simple to implement initially, no additional build steps
Cons: Runtime dependency on GitHub API
Option B: Bake into Container Image (Probably Preferred for Long-Term)
Build image: ghcr.io/lsst/tap-schemas:v1.2.4
Contains:
- All YAML files 
- Felis tool 
- Update scripts 
Pros: No runtime GitHub dependency.
Cons: Requires additional CI/CD build step
Although we could also modify the existing sdm_schemas GitHub Actions
workflow to build and push this image whenever a new release is created instead of the
MySQL database.
Option C: Mount from ConfigMap
Commit the Felis YAML (or SQL) to the Phalanx Git repo, render them into a ConfigMap, and Felis reads locally against that.
Pros: Fully GitOps native
Cons: ConfigMap size limits & probably clutters Phalanx repo
Option D: Store release artifacts in GCS bucket and download from there.
This would involve modifying the sdm_schemas CI/CD to upload the schemas.tar.gz
to a GCS bucket whenever a new release is created.
The Helm hook job would then download from this GCS bucket.
The urls to the artifacts would be constructed based on the release version,
and would be available publicly, and discoverable via Repertoire.
Pros: More control over availability, no GitHub dependency
Cons: Requires GCS bucket management, additional complexity
For our MVP, the current plan is to implement Option D and download from GCS.
4.3 Update Script Logic#
The update script (update-tap-schema.sh) will perform the following operations:
1. Fetch Schemas#
Based on selected distribution method (see 4.2)
2. Initialize TAP_SCHEMA Tables#
- Creates the standard TAP_SCHEMA tables if they don’t exist: - tap_schema.schemas
- tap_schema.tables
- tap_schema.columns
- tap_schema.keys
- tap_schema.key_columns
 
- This probably would also be a call to a felis command - init-tap-schemabefore we run the updates.
- Set up appropriate indexes for query performance 
3. Validate Configuration#
- Parses the comma-separated list of schemas from - SCHEMAS_TO_LOAD
- Verifies that each configured schema exists in the downloaded release 
- Reports available schemas if any configured schema is missing 
- If there is a validation issue we probably want to revert the update process 
4. Load Each Schema#
For each schema in the configuration:
- Validates the YAML file using Felis ( - felis validate)- Generates INSERT SQL using - felis load-tap-schema --dry-run --output-file
 
- Execute DELETE + INSERT in a single transaction (see Section 4.9 for details) - Delete existing schema data from all TAP_SCHEMA tables 
- Execute generated INSERT SQL 
 
Note: All schemas DELETE + INSERT operations are wrapped in a single outer transaction to ensure atomicity across all schemas.
5. Optional Cleanup#
If CLEANUP_OLD_SCHEMAS is enabled:
- Identifies schemas in CloudSQL not in the configured list 
- Removes obsolete schemas and their metadata 
Whether this is necessary depends on if our full replacement strategy is to delete the metadata for each schema from the TAP_SCHEMA tabless, or to drop TAP_SCHEMA altogether and recreate it from scratch. If we drop and recreate then this step is not needed.
6. Report Results (Optional)#
- Should we include some sort of verification step and report somehow? 
- Perhaps the simplest is to exit with appropriate status code and let the Helm job report success/failure. 
Idempotency: The script is designed to be idempotent, running it multiple times with the same schema version is safe and should produce the same result.
Transparency: Since the TAP services query TAP_SCHEMA tables at runtime, updates to TAP_SCHEMA in CloudSQL should be transparent, and the next query from TAP will see the new schema.
4.4 Helm Hook Implementation#
We’ll use Helm hooks to trigger schema updates, following the pattern used by other Phalanx apps like wobbly.
Rough draft of the Job template:
# File: phalanx/charts/cadc-tap/templates/job-schema-update.yaml
apiVersion: batch/v1
kind: Job
metadata:
  name: tap-schema-update-{{ .Values.tapSchema.schemaVersion | replace "." "-" }}
  namespace: {{ .Release.Namespace }}
  annotations:
    helm.sh/hook: "pre-install,pre-upgrade"
    helm.sh/hook-delete-policy: "before-hook-creation"
    helm.sh/hook-weight: "-5"
  labels:
    {{- include "cadc-tap.labels" . | nindent 4 }}
    app.kubernetes.io/component: "schema-update"
spec:
  ttlSecondsAfterFinished: 86400
  backoffLimit: 3
  template:
    metadata:
      labels:
        {{- include "cadc-tap.selectorLabels" . | nindent 8 }}
        app.kubernetes.io/component: "schema-update"
    spec:
      serviceAccountName: {{ include "cadc-tap.serviceAccountName" . }}
      restartPolicy: OnFailure
      
      securityContext:
        runAsNonRoot: true
        runAsUser: 1000
        runAsGroup: 1000
      
      containers:
      - name: felis-updater
        image: ghcr.io/lsst/felis:{{ .Values.tapSchema.felisVersion | default "1.0.0" }}
        imagePullPolicy: IfNotPresent
        
        command: ["/bin/bash", "/scripts/update-tap-schema.sh"]
        
        env:
        - name: PGHOST
          value: "127.0.0.1"
        - name: PGPORT
          value: "5432"
        - name: PGDATABASE
          value: {{ .Values.tapSchema.cloudSqlDatabase | quote }}
        - name: SCHEMA_VERSION
          value: {{ .Values.tapSchema.schemaVersion | quote }}
        - name: SCHEMAS_TO_LOAD
          value: {{ .Values.tapSchema.schemas | join "," | quote }}
        - name: CLEANUP_OLD_SCHEMAS
          value: {{ .Values.tapSchema.cleanupOldSchemas | default "false" | quote }}
        
        volumeMounts:
        - name: update-script
          mountPath: /scripts
          readOnly: true
        
        resources:
          requests:
            memory: "512Mi"
            cpu: "250m"
          limits:
            memory: "2Gi"
            cpu: "1000m"
        
        securityContext:
          allowPrivilegeEscalation: false
          capabilities:
            drop:
              - all
          readOnlyRootFilesystem: true
      
      # Cloud SQL Proxy sidecar
      - name: cloud-sql-proxy
        image: {{ .Values.cloudsql.image.repository }}:{{ .Values.cloudsql.image.tag }}
        imagePullPolicy: {{ .Values.cloudsql.image.pullPolicy }}
        
        args:
        - "--structured-logs"
        - "--port=5432"
        - "--max-sigterm-delay=30s"
        - {{ .Values.cloudsql.instanceConnectionName | quote }}
        
        securityContext:
          runAsNonRoot: true
          runAsUser: 65532
          allowPrivilegeEscalation: false
          capabilities:
            drop:
              - all
          readOnlyRootFilesystem: true
        
        resources:
          {{- toYaml .Values.cloudsql.resources | nindent 10 }}
      
      volumes:
      - name: update-script
        configMap:
          name: tap-schema-update-script
          defaultMode: 0755
How it would work:
- Developer updates schema version in Phalanx 
- Sync using ArgoCD 
- Helm hook executes automatically: - Helm renders template with new - schemaVersion: "v1.2.4"
- pre-upgradehook ensures job runs BEFORE the main deployment updates
- Job - tap-schema-update-v1-2-4is created and runs
- Job loads schemas to CloudSQL 
- After job succeeds, the main TAP deployment proceeds 
 
- Job cleanup: - hook-delete-policy: "before-hook-creation"deletes previous schema update jobs before creating new ones
- ttlSecondsAfterFinished: 86400keeps completed jobs for 24 hours for debugging
 
4.5 Phalanx Configuration#
Values File Structure#
(Ommmitting unchanged sections for conciseness)
# applications/tap/values.yaml (base configuration)
cloudSQL:
  enabled: false
  instanceConnectionName: ""
  serviceAccount: ""
  image:
    repository: ""
    tag: ""
    pullPolicy: ""
uws:
  useCloudSQL: true
  cloudSqlDatabase: "uws"
  image:
    repository: ""
    pullPolicy: ""
    tag: ""
  
tapSchema:
  useCloudSQL: true
  cloudSqlDatabase: "tap_schema"
  schemaVersion: "v1.2.3"
  felisVersion: "1.0.0"
  schemas:
    - dp02_dc2
    - apdb
  cleanupOldSchemas: false
  # Keep legacy image config for cases where CloudSQL is not enabled
  image:
    repository: ""
    pullPolicy: ""
    tag: ""
Environment-Specific Configuration#
Different environments may serve different data, so each needs different schemas loaded. The idea proposed here is to allow each env to specify which schemas to load in their respective values files.
# applications/tap/values-idfint.yaml
tapSchema:
  schemaVersion: "v1.2.4"
  
  # Define which schemas to load in this environment
  schemas:
    - dp02_dc2
    - dp1
Helm Chart Templates#
# applications/tap/templates/deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: {{ include "cadc-tap.fullname" . }}
  labels:
    {{- include "cadc-tap.labels" . | nindent 4 }}
spec:
  replicas: {{ .Values.replicaCount }}
  selector:
    matchLabels:
      {{- include "cadc-tap.selectorLabels" . | nindent 6 }}
  template:
    metadata:
      annotations:
        checksum/config: {{ include (print $.Template.BasePath "/configmap.yaml") . | sha256sum }}
        {{- with .Values.podAnnotations }}
        {{- toYaml . | nindent 8 }}
        {{- end }}
      labels:
        {{- include "cadc-tap.selectorLabels" . | nindent 8 }}
    spec:
      {{- if .Values.cloudsql.enabled }}
      serviceAccountName: {{ include "cadc-tap.serviceAccountName" . }}
      {{- end }}
      securityContext:
        runAsNonRoot: true
        runAsUser: 1000
        runAsGroup: 1000
        fsGroup: 1000
      containers:
      - name: tap
        image: "{{ .Values.config.qserv.image.repository }}:{{ .Values.config.qserv.image.tag }}"
        imagePullPolicy: {{ .Values.config.qserv.image.pullPolicy }}
        ports:
        - name: http
          containerPort: 8080
          protocol: TCP
        env:
        {{- if .Values.tapSchema.useCloudSQL }}
        - name: TAP_SCHEMA_JDBC_URL
          value: "jdbc:postgresql://127.0.0.1:5432/{{ .Values.tapSchema.cloudSqlDatabase }}"
        {{- else }}
        - name: TAP_SCHEMA_JDBC_URL
          value: "jdbc:mysql://{{ .Values.config.tapSchemaAddress }}"
        {{- end }}
        {{- if .Values.uws.useCloudSQL }}
        - name: UWS_JDBC_URL
          value: "jdbc:postgresql://127.0.0.1:5432/{{ .Values.uws.cloudSqlDatabase }}"
        {{- else }}
        - name: UWS_JDBC_URL
          value: "jdbc:postgresql://cadc-tap-uws-db:5432/uwsdb"
        {{- end }}
        # ... (other env vars)        
      {{- if .Values.cloudsql.enabled }}
      - name: cloud-sql-proxy
        image: "{{ .Values.cloudsql.image.repository }}:{{ .Values.cloudsql.image.tag }}"
        imagePullPolicy: {{ .Values.cloudsql.image.pullPolicy }}
        args:
        - "--structured-logs"
        - "--port=5432"
        - {{ .Values.cloudsql.instanceConnectionName | quote }}
        securityContext:
          runAsNonRoot: true
          runAsUser: 65532
          allowPrivilegeEscalation: false
          capabilities:
            drop:
            - all
          readOnlyRootFilesystem: true
        resources:
          {{- toYaml .Values.cloudsql.resources | nindent 10 }}
      {{- end }}
      volumes:
      - name: config
        configMap:
          name: {{ include "cadc-tap.fullname" . }}-config
      - name: gcs-credentials
        secret:
          secretName: {{ include "cadc-tap.fullname" . }}-gcs-credentials
# applications/tap/templates/configmap-update-script.yaml
apiVersion: v1
kind: ConfigMap
metadata:
  name: tap-schema-update-script
  namespace: {{ .Release.Namespace }}
data:
  update-tap-schema.sh: |
{{ .Files.Get "files/update-tap-schema.sh" | indent 4 }}
Note: The tap-schema-db deployment and related resources in the cadc-tap chart need to be optional and disabled when CloudSQL is enabled for the tapSchema database.
Single CloudSQL Instance: The plan is to use one CloudSQL instance that hosts both uws and tap_schema databases so the above changes reflect that. If for some reason we need separate instances we can adjust accordingly in the future.
4.6 Testing and Validation#
The update script would include some sort of built-in verification to ensure the schemas in the configuration are all being loaded, and perhaps we may want to include some sort of post-deployment validation tests later.
4.7 Felis Functionality Analysis#
Existing Felis Capabilities#
Based on current Felis documentation and implementation, the following functionality already exists:
1. Schema Validation
- Command: - felis validatevalidates one or more schema files in YAML format
- Returns non-zero exit code on validation errors 
2. TAP_SCHEMA Initialization
- Command: - felis init-tap-schemacreates an empty TAP_SCHEMA database
- Supports custom schema name via - --tap-schema-nameoption
- Creates standard TAP_SCHEMA tables (schemas, tables, columns, keys, key_columns) 
3. TAP_SCHEMA Population
- Command: - felis load-tap-schemacan generate SQL statements or update an existing TAP_SCHEMA database directly
- Can save SQL to file: - --output-file tap_schema.sql
- Can update existing database: - --engine-url=mysql+mysqlconnector://user:password@host/TAP_SCHEMA
- The - felis load-tap-schemacommand only performs INSERT operations. It does NOT:- Check if data already exists 
- Update existing rows 
- Delete old data 
 
4. Database Creation
- Command: - felis createcreates database objects from a schema including tables, columns, indexes, and constraints
- Supports environment variable - FELIS_ENGINE_URLfor database connection
Required Functionality#
The following exists in felis already:
- Initialize TAP_SCHEMA tables: - felis init-tap-schema
- Load schema to TAP_SCHEMA: - felis load-tap-schema
- Validate schema YAML: - felis validate
The following does NOT exist:
Incremental Updates: No --update-mode=incremental flag exists
- No built-in mechanism to compare existing schema with new schema 
- No ALTER TABLE support for modifying existing schemas 
Multi-Schema Management:
- Felis loads one schema file at a time 
- Our update script needs to loop through multiple schemas 
- Need custom logic to handle - SCHEMAS_TO_LOADconfiguration
Selective Schema Cleanup:
- No built-in way to identify and remove old schemas 
- Our update script needs custom logic for - CLEANUP_OLD_SCHEMAS
Transaction Management:
- Felis wraps each - felis load-tap-schemacall in its own transaction and commits independently
- We cannot wrap multiple Felis calls in an outer transaction 
- But, we can use - --dry-runmode to generate SQL, then execute all SQL in one script-managed transaction (see Section 4.9)
GitHub Release Download:
- Felis has no built-in functionality to fetch schemas from GitHub 
- Update script needs to handle download and extraction 
Implementation Requirements Summary#
What Felis provides out of the box:
# Initialize TAP_SCHEMA
felis init-tap-schema --engine-url=postgresql://user:pass@host:port/tap_schema
# Validate a schema
felis validate --check-description --check-tap-principal dp02_dc2.yaml
# Load schema to TAP_SCHEMA
felis load-tap-schema --engine-url=postgresql://user:pass@host:port/tap_schema dp02_dc2.yaml
What We Need to Build in update-tap-schema.sh:
- Download and extract schema files from GitHub releases (or use pre-baked container) 
- Initialize TAP_SCHEMA tables if they don’t exist using felis - init-tap-schema
- Parse - SCHEMAS_TO_LOADand iterate through schemas
- For each schema: - Validate with felis validate 
- DELETE existing schema data from all TAP_SCHEMA tables (required because Felis only INSERTs) 
- Execute felis load-tap-schema to INSERT new data 
 
- Transaction management: wrap DELETE + INSERT operations for ALL schemas in a single BEGIN/COMMIT block 
- Cleanup old schemas if CLEANUP_OLD_SCHEMAS is enabled 
- Error handling and reporting 
Future Felis Enhancements (Nice to Have):
felis load-tap-schema --update-mode=incremental to update only changed schemas
felis list-schemas --engine-url=... to show loaded schemas
felis remove-schema schema_name --engine-url=... to remove a schema
4.8 Transaction Strategy#
Felis manages its own transaction internally when loading schemas. Each felis load-tap-schema call commits independently, preventing us from wrapping multiple Felis calls in an outer transaction.
The workaround suggested here is to use Felis in dry-run mode to generate SQL,
then execute all SQL in a single script-managed transaction:
# Phase 1: Clear and repopulate staging
psql <<EOF
-- Clear staging (keep structure, delete data)
DELETE FROM tap_schema_staging.key_columns;
DELETE FROM tap_schema_staging.keys;
DELETE FROM tap_schema_staging.columns;
DELETE FROM tap_schema_staging.tables;
DELETE FROM tap_schema_staging.schemas;
EOF
# Phase 2: Load new schemas into staging
for schema in $SCHEMAS_TO_LOAD; do
    felis load-tap-schema \
        --engine-url=postgresql://${PGUSER}@${PGHOST}:${PGPORT}/${PGDATABASE} \
        --tap-schema-name=tap_schema_staging \
        ${schema}.yaml
done
# Phase 3: Validate staging
psql -c "SELECT COUNT(*) FROM tap_schema_staging.schemas WHERE schema_name IN ('dp02_dc2', 'apdb');"
# (more validation tests)
# Phase 4: Atomic three-way swap
psql <<EOF
BEGIN;
ALTER SCHEMA tap_schema RENAME TO tap_schema_temp;
ALTER SCHEMA tap_schema_staging RENAME TO tap_schema;
ALTER SCHEMA tap_schema_temp RENAME TO tap_schema_staging;
COMMIT;
EOF
Essentially all schemas in SCHEMAS_TO_LOAD are updated within a single
BEGIN…COMMIT block.
Only the schema rename operations are in a transaction (milliseconds) and data loading happens outside transaction in staging schema, so there would be no blocking of TAP service queries during data load.
The above describes the blue-green approach (Option C in Section 4.1).
If we eventually instead choose to do a full replacement the above would essentially
be simpler as we would not need to create the staging schema and could just
delete the existing schema data directly from tap_schema before inserting the new data.
4.9 Felis Docker Image#
For the Helm hook job, we would need a container image with Felis installed. This should be created via Github actions and pushed to GHCR.
4.10 Managing Datalink templates#
Currently, the datalink template files (datalink-snippets) are packaged into a tarball and then the TAP service fetches them at startup from github. With the new architecture, the current plan is that these templates would be pushed to GCS as part of the release. Repertoire through some process to be determined would know where these are located based on the schema version and store the URLS to them along with the URL to the schema files.
The TAP service could then request the link to the datalink template files from Repertoire and fetch them at startup, instead of storing a link to the datalink payload URL as it does now.
There is potentially some room for improvement here in terms of how we handle the datalinks in TAP through this template mechanism, as ideally we want better separation between the schema definitions and any other products like the datalink template files.
However this is out of scope for this document, and may be something to consider in the future and outlined in it’s own technote.
5. Migration Plan#
Phase 1: Infrastructure Setup#
Objective: Update CloudSQL instance and configure access
TAP_SCHEMA will be stored as a Postgres schema in the existing CloudSQL tap database alongside the UWS schema.
This approach re-uses existing infrastructure, requires only a single CloudSQL proxy sidecar and keeps maintenance simple.
TAP_SCHEMA’s small size and infrequent updates mean it won’t impact UWS performance.
If future requirements show the need for complete separation, the tap_schema schema can be migrated to a separate database.
Tasks:
- Ensure existing: - Database tap and schema for - tap_schema
- Service accounts 
- Workload Identities 
 
- Verify service account has CREATE/DROP SCHEMA permissions on - tapdatabase
- No new CloudSQL instance needed - reusing existing 
Deliverables:
- Terraform configuration (Not obvious if anything needs to be changed) 
- Service accounts configured (May re-use existing) 
Phase 2: TAP_SCHEMA Update Mechanism#
Objective: Implement schema loading and update logic
Tasks:
- Verify Felis capabilities and commands 
- Create - update-tap-schema.shscript
- Implement schema distribution method (Need to choose which option we want to implement first) 
- Test schema loading in dev environment 
- Implement validation and error handling 
Deliverables:
- Working update script 
- Schema validation logic 
- Error handling and validate rollback procedures 
High-Level Flow of update-tap-schema.sh:#
Parse configuration - Read SCHEMA_VERSION, SCHEMAS_TO_LOAD
Fetch schema files - Download from GCS
Initialize schemas (first time only):
    felis init-tap-schema --tap-schema-name=tap_schema
    felis init-tap-schema --tap-schema-name=tap_schema_staging
Validate all schemas - Run felis validate on each YAML file
Clear staging schema:
    DELETE all rows from tap_schema_staging tables
Load into staging:
    For each schema in SCHEMAS_TO_LOAD:
        felis load-tap-schema --tap-schema-name=tap_schema_staging schema.yaml
Validate staging:
    Verify all expected schemas exist in tap_schema_staging
    Verify each schema has tables
    Check foreign key integrity
Atomic swap:
    BEGIN TRANSACTION
        ALTER SCHEMA tap_schema RENAME TO tap_schema_temp
        ALTER SCHEMA tap_schema_staging RENAME TO tap_schema
        ALTER SCHEMA tap_schema_temp RENAME TO tap_schema_staging
    COMMIT TRANSACTION
Report results
Exit - Return 0 if succeeded, 1 if failed
Note: If we choose to delete the entire TAP_SCHEMA and recreate it from scratch, the script becomes simpler.
Phase 3: Phalanx Configuration Updates#
Objective: Update Helm charts and configuration
Tasks:
- Update TAP application values files 
- Create Helm hook Job template 
- Create ConfigMap for update script 
- Update deployment templates for CloudSQL connectivity 
- Make tap-schema-db deployment conditional/optional 
- Test in on dev / int environments 
Deliverables:
- Updated Helm charts 
- Environment-specific values files 
- Working deployment in idfdev 
Phase 4: sdm_schemas CI/CD Updates#
Objective: Automate schema packaging and release
Tasks:
- Update GitHub Actions workflow 
- Install Felis in CI 
- Validate all schemas in CI 
- Create release assets: - schemas.tar.gzOR
- Pre-baked container image 
 
- Deprecate old Docker image build process 
- Update documentation 
Deliverables:
- Updated CI/CD pipeline 
- Automated schema releases 
- Deprecated old build process 
(Optional) Automating schema version updates in Phalanx with GitHub workflows
If we want to provide further automation, we could set up GitHub Actions in the sdm_schemas repository to automatically create PRs in Phalanx when a new release is tagged.
The workflow would detect new release tags in sdm_schemas, checkout the Phalanx repository, update the schemaVersion in appropriate values files and create a pull request for review.
This would further reduce manual steps but would add complexity to the GitHub workflows. Manually creating a PR in Phalanx seems straightforward enough that probably makes this not worth the effort for MVP.
Phase 5: Production Migration#
Tasks:
- Deploy to idfdev, then idfint 
- Monitor for a week or two 
- Deploy to production 
6. Operations Runbook#
6.1 Standard Schema Update Workflow#
1. Make change in sdm_schemas repo
Example: change sdm_schemas/yml/dp02_dc2.yaml
2. Validate locally
felis validate --check-description yml/dp02_dc2.yaml
3. Create PR and release in sdm_schemas
Example: tag v1.2.3
4. Update Phalanx
In phalanx, change applications/tap/values-idfdev.yaml
Change: schemaVersion: "v1.2.3" then commit/push/PR
5. ArgoCD syncs and runs update job automatically
6. Job downloads v1.2.3 and loads all schemas
7. TAP service automatically uses new schemas
6.2 Selective Schema Update#
Update only specific schemas without changing version:
# In values file, change which schemas are loaded
tap:
  schemaVersion: "v1.2.4"
  schemas:
    - dp02_dc2
    - apdb
    - new_schema  # Add a new schema
# Commit and push - ArgoCD syncs and loads the new schema
6.3 Rollback Procedure#
If issues occur after a schema update, we can rollback using GitOps via a rollback or a git revert + sync. By syncing to a previous version, the Helm hook job will run again and reload the previous schema version. The previous schema version’s GitHub release (or docker image depending on which approach we go with) must still exist. If for whatever reason the previous version is not available the option is also there to manually restore schema from CloudSQL backup.
In the case of blue-green deployment we also have the option of swapping the schema names in the database, since we can keep the previous version after a new deploy around. However this would be a manual process (or at best some script we can run manually) and not part of the automated Helm hook job, unless we have some sort of flag to indicate a rollback.
GitOps Rollback:
- Revert - schemaVersionin Phalanx
- ArgoCD sync reruns Helm hook 
- Job reloads old version from GCS into staging 
- Swap completes 
Instant Rollback using staging schema(Primary Method):
Since we maintain both schemas permanently, rollback could also be done via:
psql <<EOF
BEGIN;
ALTER SCHEMA tap_schema RENAME TO tap_schema_temp;
ALTER SCHEMA tap_schema_staging RENAME TO tap_schema;
ALTER SCHEMA tap_schema_temp RENAME TO tap_schema_staging;
COMMIT;
EOF
The current plan is to use the GitOps rollback as the primary method, and we can always revisit and add the instant swap method later if needed.
7. Security Considerations#
7.1 IAM Authentication#
We will use IAM authentication for CloudSQL access, following the pattern used by UWS so no password management is required.
7.2 Database Roles#
The current design grants the felis-updater job full access to the tap_schema database. We should probably consider two roles here, a reader role used by the TAP service and a writer role used by the felis-updater job with full CRUD permissions on the tap_schema tables.
9. Open Questions#
Transaction Size#
Our current design wraps all schemas in a single transaction. Are we concerned about transaction size and potential timeouts? TAP service queries against TAP_SCHEMA might be blocked during the entire update? Note: This question is only relevant if we choose the single transaction approach instead of blue-green.
Schema Distribution#
How should the job updater get the schema files? Download from GitHub? Bake into a container image? Mount from ConfigMap? I’m thinking start with Option A (GitHub download) for MVP or Option B (baked container) if we want to avoid runtime dependency on GitHub.
CloudSQL Proxy Configuration#
Is it ok for UWS and TAP_SCHEMA to share a single CloudSQL proxy sidecar? Or do we want separate proxies? Current design uses single proxy on port 5432, with database selection via JDBC URL database parameter. Need to verify this works correctly.
Update Strategy#
How should we handle the update strategy? Full replacement vs incremental updates? I’d probably aim for full replacement for MVP, implement incremental updates as future enhancement
Version History#
Do we want to maintain schema version history in the database or record the schema version in the database somehow or does that add unnecessary complexity?
10. Documentation Updates Required#
- Update TAP service README with new architecture 
- Document new schema update process 
- Add troubleshooting guide for common issues