ETL or ELT ? emergency of dbt

ETL or ELT ? emergency of dbt

·

13 min read

Motivation

What's behind the idea of dbt? why is it getting popularity in the world of data engineering and data analytics? Let's talk about the problems we are facing at the moment as DE:

  • The amount of analytics SQL code we need to write is really high and variant since business requirement keeps changing. Then you lost track of it!

  • Lots of testing codes you need to write and need to automate the checking as well.

  • Documentation is not living with the code, which makes it harder to look at

dbt is designed to solve this problem and let DA (or the fancy new job title analytics engineer) work like a software engineer by adapting their work flows.

Thoughts on ELT and ETL

ETL is the data engineering paradigm that

  • extract the data from various sources (databases, APIs)

  • transform it outside with distributed computing framework such as hadoop or spark

  • load into data warehouse

While the ETL paradigm goes way back to SSIS in MS SQL Server, the concept of ELT is relatively new but really popular.

SSIS was released in 2005 but the concept of ETL goes way back. And Apache Hadoop is released in 2006.

The reason why we use the ETL paradigm is that the database we have back in the day is single node architecture so it doesn't have enough computing power to do the transformation inside the database. Therefore, we extract it from the database and use a computing framework to do it. Finally, we load it into the database (data warehouse).

Now, since the destination database like BigQuery, it is both a distributed storage and a monster computing engine as illustrated in the figure below

It means that we could do the computing inside our destination data warehouse, which is closer to the end consumers and BI teams. The dbt fits in the picture of ELT and it provides other values such as software engineering best practice.

The ELT paradigm has the pro that the business transformation logics is living closer to the BI team compared with ETL counterpart.

To summarize

-ETLELT
Descriptiondata integration process that involves extracting data from multiple sources, transforming it into a structured format, and loading it into a target system.extracting data from multiple sources, loading it into a target system, and then transforming it into a structured format.
HistoryThe transformation has been done outside of the database due to the limitation of single-node computing inside the database.Thanks to the data warehouse (big query, redshift and snowflake) It has great scalability.
Applicationdata is transformed outside database. This approach can be useful when dealing with large amounts of data or when transforming data into complex, structured formats.data is loaded into a target system before being transformed. This approach can be useful when dealing with simple transformations or when working with target systems that are optimized for data processing.

Now we understand the key difference between ETL and ELT and its history of it. Now, we can dive into the world of dbt!

What's dbt

dbt is the new hot tool for the "transformation" of ETL or ELT.

It is bringing software engineering best practices into analytics and it saves lots of boilerplate codes, and ad-hoc for you.

Some SQL tests you write for your company are great but the truth is that many companies need the same tests. There must be someone out there have done all of the SQL tested and implemented it for you.

It is the new way of doing SQL.

Setup

dbt labs officially support the following database

  • big query

  • postgres

  • redshift

  • snowflake

But other community developers develop and maintain other database such as dbt-mysql.

To be honest, the officially maintained ones are more up to date with less issues. I suggest use dbt for officially supported database platform to minimize the troubles you will bump into

dbt on mac

Just set it up as any other package, create a virtual environment and install

# create .venv for you python virtual env
python -m venv .venv

# activate it, depends on the scripting lang you use (zsh, fish)
source .venv/bin/activate 

# now you should in
(.venv)

# install 
pip install dbt-postgres && dbt install dbt-core

Install dbt-posrgres will install the dbt-core for you which is the core component of dbt.

Now, you can check

dbt --version

It should give you

Core:
  - installed: 1.5.1
  - latest:    1.5.1 - Up to date!

Plugins:
  - postgres: 1.5.1 - Up to date!

vs-code setup for dbt

As for learning dbt, you just need to know the following scripting (some basic is fine)

  • SQL, need to be really good at it

  • jinja template (template engining for python, used by dbt)

  • yaml

To better use dbt, install those extensions

some extensions are useful:

CLI components

Let's summarize the most important part of the dbt command

CommandDescription
dbt initInitializes a new dbt project
dbt sourceConfigures the sources where the data will come from. used for tables that just got loaded into
dbt seedLoads the raw data into the source tables. Used to define external tables, or relatively quasi-static data
dbt testRuns tests on the data to ensure it's clean and consistent. built-in or third-party packages to test your SQL. such as null checking, enum checking, and even audit package
dbt snapshotCreates incremental models from the source data, only implementing SCD 2.
dbt docsGenerates documentation for the models

dbt init

# create a directory called "project-name" in your working directory
dbt init <project-name>

cd <project-name>

then you will find out something like this

.
├── README.md
├── analyses
├── dbt_packages
├── dbt_project.yml
├── logs
├── macros
├── models
├── seeds
├── snapshots
├── target
└── tests

10 directories, 2 files

The functionality of each directory is summarized here

DirectoryDescription
analysesAnalytical SQL queries that combine models
dbt_packagesPackages installed with dbt, such as dbt-utils
dbt_project.ymlConfiguration file for a dbt project, IMPORTANT
logsLogs from dbt runs. usually don't care.
macrosReusable SQL code snippets, useful for customized settings
modelsSQL code that defines data models, is IMPORTANT!
seedsRaw data that is loaded into the source tables
snapshotsIncremental models created from the source data
targetCompiled SQL code and other artifacts
testsSQL code that tests models for correctness

.dbt where the secret being stored

After you set up your dbt project, you should build up a database connection

cd ~

cd .dbt

You should see a profiles.yml file that contains all your database connections for your current project:

project_name:
  outputs:
    dev:
      type: postgres
      threads: 1
      host: [host]
      port: 5432
      user: [dev_username]
      pass: [dev_password]
      dbname: postgres
      schema: raw

    prod:
      type: postgres
      threads: [1 or more]
      host: [host]
      port: [port]
      user: [prod_username]
      pass: [prod_password]
      dbname: [dbname]
      schema: [prod_schema]

  target: dev

You can input ur credentials here

project.yml file

Let's go back to your dbt project, and check out the configuration

# Name your project! Project names should contain only lowercase characters
# and underscores. A good package name should reflect your organization's
# name or the intended use of these models
name: 'project_name'
version: '1.0.0'
config-version: 2

# This setting configures which "profile" dbt uses for this project.
profile: 'dbt_ads'

# These configurations specify where dbt should look for different types of files.
# The `model-paths` config, for example, states that models in this project can be
# found in the "models/" directory. You probably won't need to change these!
model-paths: ["models"]
analysis-paths: ["analyses"]
test-paths: ["tests"]
seed-paths: ["seeds"]
macro-paths: ["macros"]
snapshot-paths: ["snapshots"]

clean-targets:         # directories to be removed by `dbt clean`
  - "target"
  - "dbt_packages"


# Configuring models
# Full documentation: https://docs.getdbt.com/docs/configuring-models

# In this example config, we tell dbt to build all models in the example/
# directory as views. These settings can be overridden in the individual model
# files using the `{{ config(...) }}` macro.
models:
  dbt_ads:
    # 在dbt_ads文件夹下的所有文件都会被视为view
    # Config indicated by + and applies to all files under models/example/
    example:
      +materialized: view

It is a configuration that list things up. The most crucial part is the last bit

models:
  dbt_ads:
    # 在dbt_ads文件夹下的所有文件都会被视为view
    # Config indicated by + and applies to all files under models/example/
    example:
      +materialized: view

It touches upon the concept of materialization in database.

Concepts of materialization

In the SQL world, materialization refers to the process of creating a physical table or view in a database from the results of a SQL query. Materialization is used to improve query performance by storing the results of a query in a table or view, which can then be accessed more quickly than re-executing the query every time it is needed.

Note

The initiative behind materialization is the trade-off between time and space. (remind u of leetcode again huh)

Materialization can be done in a variety of ways, depending on the database system and the type of query being executed. Common techniques include:

  • Creating a temporary table to store the query results

  • Creating a materialized view, which is a precomputed table that is automatically updated when the underlying data changes

  • Creating a permanent table to store the query results

Materialization can be an effective way to improve query performance, especially for complex queries that involve large amounts of data. However, it can also be resource-intensive, especially for queries that are executed frequently or involve large amounts of data. As with any performance optimization, it is important to carefully evaluate the trade-offs and test the impact of materialization on query performance and resource usage.

Materialization in dbt

In dbt, it is similar and defined as

-ViewTableIncremental TablesEphemeral Tables
Use ityou want a lightweight representationyou read from this model repeatedlyfact tables; appends to tablesyou merely want an alias to your date
Don't use ityou read from the same model several timesbuilding single-use models; your model is populated incrementallyyou want to update historical recordsyou read from the same model several times

Refresh the concept of materialization

Then, if you go back to your yaml file

# file structure for the model/
.
├── dim
   ├── dim_hosts_cleansed.sql
   └── dim_listings_cleansed.sql
└── src
    ├── src_hosts.sql
    ├── src_listings.sql
    └── src_reviews.sql

# yaml file

name: 'dbtlearn'
version: '1.0.0'
config-version: 2

models:
  project:
    +materialized: view 
    dim:
      +materialized: table

The precedence follows the rule:

The more indentation you have, the later the rule applied

For example, in the file above,

models:
 # all files under this dbt "project" folder will be view
  project:
    +materialized: view 
    # all files under project/dim will be table 
    dim:
      +materialized: table # 设置dim/内的默认materialized方法为 table

If we word in properly, the materialization in this dbt project became,

All sql files in project/ will be generated as view, except for files in the ./project/dim

dbt source

The concept of the source in dbt's project is usually the data just get ingested into the data warehouse.

By default, the source lives in the models table,

models
├── dim
├── fct
├── sources.yml
└── src
    ├── src_authors.sql
    ├── src_campaigns.sql
    ├── src_commerce_users.sql
    └── src_hashtags.sql

4 directories, 5 files

You could configure your source.yml file to,

  • define jinja template for your database table

  • define freshness checking

An example source.yml file works like this

version: 2

sources:
  - name: airbnb
    schema: raw
    tables:
      - name: listings
        identifier: raw_listings

      - name: hosts
        identifier: raw_hosts

      - name: reviews
        identifier: raw_reviews
        loaded_at_field: date
        freshness:
          # if no data inserted to reviews table within 1 hour, give warning
          # if no data inserted to reviews table within 24 hours, give error
          warn_after: {count: 1, period: hour} 
          error_after: {count: 24, period: hour}

You can name it whatever you want for defining the sources, just put down sources: then you are good to go

You can check the freshness by

dbt source freshness

dbt seeds

The concept of seeds in dbt is for static table (some table never going to change) like countries table illustrated here.

country_codecountry_name
CACanada
CNChina

To do it, just write a script in project/seeds and run

dbt seed

Note:

The dbt seed is not designed to handle large dataset. It is designed to load smaller and static dataset.

dbt docs

dbt let you build your documentation with the following commands

# build the doc
dbt docs generate

# run it up at port 8080
dbt docs serve

# run it at a specified port
dbt docs serve --port 8082

dbt snapshot

The concept of dbt snapshot is to handle slowly changing dimension (SCD) type 2.

In this section:

  • understand scd type 2

  • understand how dbt handles type-2 slowly changing dimensions

  • understand snapshot strategies

  • learn how to create snapshots on top of our listings and hosts models

What's SCD type 2?

Let's consider a table with name and address

Type 2 slowly changing dimensions means that how to handle historical data. Type 2 is to keep all the historical data and the way to track it is to have two extra columns: valid from and valid to.

  • valid_from: the date that the data is valid

  • valid_to: the date the data is no longer valid. If null , it means it's still valid.

idnameemaildbt_valid_fromdbt_valid_to
1Adamadam.airbnb@gmail.com2022-01-01 00:00:00null
2Bobbob.airbnb@gmail.com2022-01-01 00:00:002022-03-01 12:53:20
3Bobbobs.new.address@gmail.com2022-01-01 00:00:00null

Instead of writing some codes to generate those columns, dbt handle those codes for you behind the scene.

define snapshot

Snapshots lives in the snapshots folder. Define a snapshot in the dbtproject/snapshots/ folder uses sql and jinja templates.

-- start of jinja template
{% snapshot scd_authors %}

-- set up config for this snapshot
{{
   config(
       target_schema='snapshots',
       unique_key='id',
       strategy='timestamp',
       updated_at='ingested_date',
   )
}}

-- by default, this is the conventation
select * FROM {{ source('postgres', "tiktok_authors") }}

{% endsnapshot %}

snapshot's configuration

For the comprehensive configuration, please refer official docs here.

The most important figure it the strategy. It has two options:

  • timestamp: A unique key and a updated_at field is defined on the source model. There columns are used for determining changes.

    • It relies heavily on ingest_by column.
  • check: any change in a set of columns (or all columns) will be picked up as an update.

dbt test

Overview:

  • Understand how tests can be defined

  • Configure built-in generic tests

  • Create your own singular tests

test overviews in dbt

There are two types of tests in dbt: singular and generic.

flowchart TD
a(tests_in_dbt) --> b(singular) & c(generic)
c --> d(built-in) & e("custom")
d --> f("unique") & g("null") & h("accepted_value") & i("relationships")

Generic tests

As for generic tests, you can:

  • use the built-in ones

  • write your own generic tests or import tests from dbt package

As for built-in tests, just define a yaml file. The convention is called schema.yml

Summary

In this summary, we covered the following topics

  • why dbt?

  • set up with dbt

  • some basic CLI of dbt

Reference