Running Data Transformations

Modified on Wed, 6 Mar, 2024 at 1:21 AM

CTA orchestrates partner ETL via Google Cloud Composer, a Kubernetes-backed Airflow setup.

The process is as follows:

Step 1: Extract (CTA raw data)

1. 1. CTA uses several platforms to extract data and load it into BigQuery:
    1. Airbyte
    2. Google Cloud Composer Operators (native API calls against BigQuery and AWS APIs)
    3. Google Cloud BigQuery Data Transfer Service
  2. Once data is loaded, the DAG moves on to transformations. This data is housed in a CTA-owned production Google project.

Step 2: Transform 1 (CTA base tables)

CTA does any initial transformations of raw JSON or text-file-provided data. CTA base tables are again stored in CTA-owned production Google projects. And all of our dbt modeling for transform 1 and transform 2 (see below) can be found here: https://github.com/community-tech-alliance/dbt-cta

Step 3: Transform 2 (CTA-run partner materializations)

Once CTA base tables are in place, the DAG materializes datasets into partner projects. In this step, tables will appear in your Google Project. Generally, this is where our work stops for partners. Our standard procedure here is to materialize these as Materialized Views with a refresh that is triggered upon a successful base table update.

Step 4: Downstream transformations (Partner materializations)

Further downstream analysis and transformations are the responsibility of partners. There are various options based on product use cases, cost, team specialization, and skillsets. We will discuss them more in-depth below.

Partner Materializations

CTA makes a few different recommendations for partners who wish to run downstream materializations or automate data transformations, ordered by the level of effort from low to high:

CTA runs your materializations - The out-of-the-box solution is for CTA to run your dbt materializations alongside our two other steps. To hook into our process, you’d simply check your changes into Git and then push them up. From there, we’d automate the process of syncing your changes and running them. This could be set up in various ways, but the basic flow could be:

dbt is run locally on machines development and testing
PR is open, and once merged in, GitHub Actions deploys changes for our DAG to pick up. From there, we run the DAG and make available to you a centralized logging interface to see the output of the runs.

Scheduled Queries - BigQuery offers out-of-the-box scheduling for SQL queries. You can schedule queries through the Google Cloud Console, BigQuery CLI tool, or through Google’s SDKs. Scheduled Queries use BigQuery’s native parameterization, so you can include date strings and other values in your queries. Additionally, you can schedule queries as a service account instead of as yourself, so that queries run with a single user that has access to all of the resources it needs. Scheduled queries are best for basic transformations that benefit from little to no templating outside of ensuring correct incrementality. CTA can also assist in adding automation to your local development workflows so that users can write, commit, and see new scheduled queries launch, or, instead use the Cloud Console UI.

Google Dataform - Dataform is a new tool from Google that directly competes with dbt. It is built into BigQuery’s Cloud Console UI and allows for full SQL workflow support, with added support for custom JavaScript functions for those who want a more expressive language in which to write workflows. It additionally supports version control, allows scheduling (and trigger-based executions), and is free. You only pay for the analysis and storage cost associated with the BigQuery usage and any costs associated with more advanced scheduling functionality (see Google Workflows below).

dbt Cloud - If you’re looking for a fully hosted analytics workflow and analytics workflow tool, dbt Cloud could be the answer. It’s hosted outside of Google Cloud, but its cost structure is competitive and fully integrates with Google BigQuery. dbt Cloud allows you to contribute models in SQL and Python and has a testing interface and version control. It runs up-to-date versions of dbt, so the behavior of code executed in dbt Cloud should be the same as the dbt we execute in Cloud Composer.

Cloud Build - While we recommend Google Workflows to orchestrate Google APIs (E.g., BigQuery, Cloud Run containers, Cloud Storage), dbt has some unique requirements. Unlike orchestrated workflows, for our partners, because CTA is responsible for ETL up to step 3 above, your data will already be up-to-date when it lands in your project. We recommend, then, that any further materializations take the form of views or materialized views. Both have set DDL-based definitions that are only modified when the underlying dbt is modified.

Google Workflows - Google Workflows enables users to orchestrate processes using a simple YAML file. It’s a serverless solution that you can deploy from the Google Cloud Console, CLI, or API.
It is particularly useful when you are performing tasks that use the Google API and Google Cloud Services. Rather than needing to write and deploy custom code to handle different APIs, Google Workflows makes it simple to configure tasks and set up dependencies between them.
We have an example of two popular use cases for Google Workflows on our Medium: Loading Data to BigQuery using Google Workflows, which covers loading data from S3 to BigQuery and ingesting files stored in Drive. Note: this is different from ingesting a Google Sheet, which is a native feature of BigQuery.
If the Cloud Build-based flow above doesn’t meet your dbt needs (that is, table builds need to be triggered by an event or schedule), then running dbt commands in Workflows is possible using Cloud Run or Google Batch Jobs. The basic usage looks something like this: https://github.com/community-tech-alliance/workflows-cta/blob/main/workflows/example-batch-api-dbt.workflow.yaml
```
main:
  params: [args]
steps:
  - init:
      assign:
        - repositoryAndImageName: 'dbt-labs/dbt-bigquery:latest'
  - runDbtInit:
    call: createAndRunContainerJobs
      args:
        repositoryAndImageName: 'dbt-labs/dbt-bigquery:latest'
        script: dbt init
  - runDbtRun:
     call: createAndRunContainerJobs
       args:
         repositoryAndImageName: 'dbt-labs/dbt-bigquery:latest'
         script: dbt run --ful-refresh
```