Lineage modeling

Lineage modeling for complex SQL structures.

Identifying SQL dependencies from an abundance of nested views and tables.

CREATE TABLE first_table AS (
    SELECT a FROM my_table
    WHERE 1=1
);
CREATE TABLE second_table AS (
    SELECT a as b FROM first_table
);
Related: https://getdatascale.com/datalineage

How it works

Data lineage is the process of tracking the flow of data from its origin to its destination. It provides a comprehensive view of your data's lifecycle, helping you understand:

  1. Where your data comes from (sources)

  2. How it's transformed and processed

  3. Where it's stored (destination tables)

  4. How it's used (metrics, BI dashboards, app, etc.)

1. SQL analysis

The first step in mapping data lineage is to analyze SQL code and identify:

  • Input tables (dependencies)

  • Output tables and columns

  • Transformations applied (column-level lineage)

Here's a simple example

CREATE VIEW customer_summary AS
SELECT 
    c.customer_id,
    c.name,
    COUNT(o.order_id) as total_orders,
    SUM(o.order_total) as lifetime_value
FROM 
    customers c
LEFT JOIN 
    orders o ON c.customer_id = o.customer_id
GROUP BY 
    c.customer_id, c.name;

In this query, we can identify:

  • Source tables: customers, orders

  • Target view: customer_summary

  • Transformations: COUNT, SUM, GROUP BY (optional)

2. Dependency mapping

Table-level lineage

| customers | --> |                  |
                  | customer_summary |
   | orders | --> |                  |

Table-level may seems basic, but it can give clear pictures when it comes to complex data models (1,000s lines of codes and 100s of SQL scripts). It also provides a simplified version of the DAG without complicating the diagram.

Column-level lineage

Column-level lineage is more advanced; it helps track how individual columns are used and transformed.

SourceTarget

customers.customer_id

-> customer_summary.customer_id

customers.name

-> customer_summary.name

COUNT(orders.order_id)

-> customer_summary.total_orders

SUM(orders.order_total)

-> customer_summary.lifetime_value

3. Lineage modeling

The above example showed how the dependencies are extracted from a single SQL file.

Connecting the dots

In actual scenarios, we process all of your SQL scripts all at once.

This approach is designed to work out of the box, meaning you don't have to change your existing pipeline or modify your code to reference tables like {{ ref(..) }}.

The system collects all SQL scripts from your project,

  • parses each one to identify tables, views, and their relationships,

  • and then combines this information to create a lineage.


Our goal is to help you save time, reduce errors, and allow you to focus on writing SQL logic while we help maintain the visual representation and documentation.

It offers best value in large projects where manually tracking all dependencies would be impractical.

Fun fact: learn how we display a lineage Auto rearrange


Examples

These are examples from our lineage playground. In real application, there's a data catalog system where you can navigate the tables and view lineage easily.

Example 1 - Financial

Code
-- Raw transaction data from various systems
CREATE TABLE raw_transactions (
    transaction_id BIGINT PRIMARY KEY,
    transaction_date DATE,
    posting_date DATE,
    account_id INT,
    counterparty_id INT,
    amount DECIMAL(15,2),
    currency VARCHAR(3),
    transaction_type VARCHAR(50),
    department_id INT,
    project_id INT,
    raw_transaction_data JSON
);

-- First level: Cleaned and standardized transactions
CREATE TABLE clean_transactions AS
SELECT
    transaction_id,
    transaction_date,
    posting_date,
    account_id,
    counterparty_id,
    amount,
    currency,
    transaction_type,
    department_id,
    project_id,
    raw_transaction_data->>'description' AS transaction_description,
    raw_transaction_data->>'reference_number' AS reference_number
FROM raw_transactions;

-- Second level: Currency conversion to company's base currency (assuming USD)
CREATE TABLE normalized_transactions AS
SELECT
    t.*,
    CASE 
        WHEN t.currency = 'USD' THEN t.amount
        ELSE t.amount * e.exchange_rate
    END AS amount_usd
FROM clean_transactions t
LEFT JOIN exchange_rates e ON t.currency = e.currency AND t.transaction_date = e.date;

-- Third level: Daily account balances
CREATE TABLE daily_account_balances AS
SELECT
    posting_date,
    account_id,
    SUM(amount_usd) AS daily_net_change,
    SUM(SUM(amount_usd)) OVER (PARTITION BY account_id ORDER BY posting_date) AS running_balance
FROM normalized_transactions
GROUP BY posting_date, account_id;

-- Fourth level: Monthly financial statements
CREATE TABLE monthly_financial_statements AS
SELECT
    DATE_TRUNC('month', posting_date) AS month,
    a.account_type,
    SUM(CASE WHEN a.account_type IN ('Asset', 'Expense') THEN daily_net_change ELSE -daily_net_change END) AS net_change,
    SUM(CASE WHEN a.account_type IN ('Asset', 'Expense') THEN running_balance ELSE -running_balance END) AS end_balance
FROM daily_account_balances b
JOIN accounts a ON b.account_id = a.account_id
GROUP BY DATE_TRUNC('month', posting_date), a.account_type;

-- Fifth level: Quarterly financial ratios
CREATE TABLE quarterly_financial_ratios AS
WITH quarterly_financials AS (
    SELECT
        DATE_TRUNC('quarter', month) AS quarter,
        account_type,
        SUM(net_change) AS quarterly_change,
        SUM(end_balance) AS quarter_end_balance
    FROM monthly_financial_statements
    GROUP BY DATE_TRUNC('quarter', month), account_type
)
SELECT
    quarter,
    SUM(CASE WHEN account_type = 'Asset' THEN quarter_end_balance ELSE 0 END) AS total_assets,
    SUM(CASE WHEN account_type = 'Liability' THEN quarter_end_balance ELSE 0 END) AS total_liabilities,
    SUM(CASE WHEN account_type = 'Equity' THEN quarter_end_balance ELSE 0 END) AS total_equity,
    SUM(CASE WHEN account_type = 'Revenue' THEN quarterly_change ELSE 0 END) AS total_revenue,
    SUM(CASE WHEN account_type = 'Expense' THEN quarterly_change ELSE 0 END) AS total_expenses,
    SUM(CASE WHEN account_type = 'Revenue' THEN quarterly_change ELSE 0 END) - 
    SUM(CASE WHEN account_type = 'Expense' THEN quarterly_change ELSE 0 END) AS net_income,
    (SUM(CASE WHEN account_type = 'Revenue' THEN quarterly_change ELSE 0 END) - 
     SUM(CASE WHEN account_type = 'Expense' THEN quarterly_change ELSE 0 END)) /
    NULLIF(SUM(CASE WHEN account_type = 'Revenue' THEN quarterly_change ELSE 0 END), 0) AS profit_margin,
    SUM(CASE WHEN account_type = 'Asset' THEN quarter_end_balance ELSE 0 END) /
    NULLIF(SUM(CASE WHEN account_type = 'Liability' THEN quarter_end_balance ELSE 0 END), 0) AS current_ratio
FROM quarterly_financials
GROUP BY quarter;

Example 2 - Streaming service

Code
-- User information
CREATE TABLE users (
    user_id INT PRIMARY KEY,
    username VARCHAR(50),
    email VARCHAR(100),
    created_at TIMESTAMP,
    last_login TIMESTAMP,
    is_active BOOLEAN
);

-- User profile details
CREATE TABLE user_profiles (
    profile_id INT PRIMARY KEY,
    user_id INT REFERENCES users(user_id),
    first_name VARCHAR(50),
    last_name VARCHAR(50),
    date_of_birth DATE,
    country VARCHAR(50),
    preferred_language VARCHAR(20)
);

-- Content catalog
CREATE TABLE content (
    content_id INT PRIMARY KEY,
    title VARCHAR(100),
    description TEXT,
    genre VARCHAR(50),
    release_date DATE,
    duration_minutes INT,
    rating DECIMAL(3,1)
);

-- User viewing history
CREATE TABLE viewing_history (
    view_id INT PRIMARY KEY,
    user_id INT REFERENCES users(user_id),
    content_id INT REFERENCES content(content_id),
    view_date TIMESTAMP,
    duration_watched INT,
    completed BOOLEAN
);

-- User subscriptions
CREATE TABLE subscriptions (
    subscription_id INT PRIMARY KEY,
    user_id INT REFERENCES users(user_id),
    plan_name VARCHAR(50),
    start_date DATE,
    end_date DATE,
    is_active BOOLEAN
);

-- User reviews and ratings
CREATE TABLE reviews (
    review_id INT PRIMARY KEY,
    user_id INT REFERENCES users(user_id),
    content_id INT REFERENCES content(content_id),
    rating INT CHECK (rating BETWEEN 1 AND 5),
    review_text TEXT,
    created_at TIMESTAMP
);

-- View of active users
CREATE VIEW active_users AS
SELECT user_id, username, email, last_login
FROM users
WHERE is_active = TRUE;

-- Materialized view of user engagement metrics
CREATE MATERIALIZED VIEW user_engagement_metrics AS
SELECT 
    u.user_id,
    u.username,
    COUNT(DISTINCT vh.content_id) AS unique_content_watched,
    SUM(vh.duration_watched) AS total_watch_time,
    AVG(vh.duration_watched) AS avg_watch_time,
    COUNT(CASE WHEN vh.completed THEN 1 END) AS completed_views
FROM 
    users u
LEFT JOIN 
    viewing_history vh ON u.user_id = vh.user_id
GROUP BY 
    u.user_id, u.username;

-- Staging table for content performance
CREATE TABLE stg_content_performance AS
SELECT 
    c.content_id,
    c.title,
    c.genre,
    COUNT(DISTINCT vh.user_id) AS unique_viewers,
    SUM(vh.duration_watched) AS total_watch_time,
    AVG(r.rating) AS avg_rating
FROM 
    content c
LEFT JOIN 
    viewing_history vh ON c.content_id = vh.content_id
LEFT JOIN 
    reviews r ON c.content_id = r.content_id
GROUP BY 
    c.content_id, c.title, c.genre;

-- Mart table for genre performance
CREATE TABLE mart_genre_performance AS
SELECT 
    genre,
    COUNT(DISTINCT content_id) AS content_count,
    SUM(unique_viewers) AS total_unique_viewers,
    SUM(total_watch_time) AS total_watch_time,
    AVG(avg_rating) AS avg_genre_rating
FROM 
    stg_content_performance
GROUP BY 
    genre;

-- Mart table for user segments
CREATE TABLE mart_user_segments AS
SELECT 
    u.user_id,
    u.username,
    uem.unique_content_watched,
    uem.total_watch_time,
    s.plan_name,
    up.country,
    CASE 
        WHEN uem.total_watch_time > 1000 AND s.plan_name = 'Premium' THEN 'High Value'
        WHEN uem.unique_content_watched > 50 THEN 'Engaged'
        WHEN s.is_active = FALSE THEN 'Churned'
        ELSE 'Standard'
    END AS user_segment
FROM 
    users u
JOIN 
    user_engagement_metrics uem ON u.user_id = uem.user_id
LEFT JOIN 
    subscriptions s ON u.user_id = s.user_id AND s.is_active = TRUE
LEFT JOIN 
    user_profiles up ON u.user_id = up.user_id;

-- Final reporting table combining user and content insights
CREATE TABLE report_content_user_insights AS
SELECT 
    mus.user_segment,
    cp.genre,
    COUNT(DISTINCT vh.user_id) AS unique_viewers,
    SUM(vh.duration_watched) AS total_watch_time,
    AVG(r.rating) AS avg_rating
FROM 
    mart_user_segments mus
JOIN 
    viewing_history vh ON mus.user_id = vh.user_id
JOIN 
    stg_content_performance cp ON vh.content_id = cp.content_id
LEFT JOIN 
    reviews r ON vh.user_id = r.user_id AND vh.content_id = r.content_id
GROUP BY 
    mus.user_segment, cp.genre;

Example 3 - Supply chain

Code
-- Raw material inventory
CREATE TABLE raw_material_inventory (
    material_id INT PRIMARY KEY,
    material_name VARCHAR(100),
    quantity_on_hand DECIMAL(10,2),
    unit_of_measure VARCHAR(20),
    last_updated TIMESTAMP
);

-- Supplier information
CREATE TABLE suppliers (
    supplier_id INT PRIMARY KEY,
    supplier_name VARCHAR(100),
    contact_person VARCHAR(100),
    email VARCHAR(100),
    phone VARCHAR(20)
);

-- Purchase orders
CREATE TABLE purchase_orders (
    po_id INT PRIMARY KEY,
    supplier_id INT REFERENCES suppliers(supplier_id),
    order_date DATE,
    expected_delivery_date DATE,
    total_amount DECIMAL(12,2)
);

-- Purchase order details
CREATE TABLE po_details (
    po_detail_id INT PRIMARY KEY,
    po_id INT REFERENCES purchase_orders(po_id),
    material_id INT REFERENCES raw_material_inventory(material_id),
    quantity DECIMAL(10,2),
    unit_price DECIMAL(10,2)
);

-- First level: Aggregate purchase order data
CREATE TABLE aggregate_po_data AS
SELECT
    p.po_id,
    p.supplier_id,
    s.supplier_name,
    p.order_date,
    p.expected_delivery_date,
    COUNT(pd.material_id) AS num_materials,
    SUM(pd.quantity) AS total_quantity,
    SUM(pd.quantity * pd.unit_price) AS total_amount
FROM purchase_orders p
JOIN po_details pd ON p.po_id = pd.po_id
JOIN suppliers s ON p.supplier_id = s.supplier_id
GROUP BY p.po_id, p.supplier_id, s.supplier_name, p.order_date, p.expected_delivery_date;

-- Second level: Supplier performance metrics
CREATE TABLE supplier_performance AS
SELECT
    supplier_id,
    supplier_name,
    COUNT(po_id) AS total_orders,
    AVG(total_amount) AS avg_order_value,
    AVG(EXTRACT(DAY FROM (expected_delivery_date - order_date))) AS avg_lead_time,
    SUM(total_amount) AS total_spend
FROM aggregate_po_data
GROUP BY supplier_id, supplier_name;

-- Third level: Material procurement analysis
CREATE TABLE material_procurement_analysis AS
SELECT
    rmi.material_id,
    rmi.material_name,
    COUNT(DISTINCT pd.po_id) AS number_of_orders,
    SUM(pd.quantity) AS total_ordered_quantity,
    AVG(pd.unit_price) AS average_unit_price,
    SUM(pd.quantity * pd.unit_price) AS total_spend
FROM raw_material_inventory rmi
JOIN po_details pd ON rmi.material_id = pd.material_id
GROUP BY rmi.material_id, rmi.material_name;

-- Fourth level: Inventory turnover analysis
CREATE TABLE inventory_turnover_analysis AS
SELECT
    mpa.material_id,
    mpa.material_name,
    mpa.total_ordered_quantity,
    rmi.quantity_on_hand,
    mpa.total_ordered_quantity / NULLIF(rmi.quantity_on_hand, 0) AS turnover_ratio,
    CASE 
        WHEN mpa.total_ordered_quantity / NULLIF(rmi.quantity_on_hand, 0) > 3 THEN 'High'
        WHEN mpa.total_ordered_quantity / NULLIF(rmi.quantity_on_hand, 0) > 1 THEN 'Medium'
        ELSE 'Low'
    END AS turnover_category
FROM material_procurement_analysis mpa
JOIN raw_material_inventory rmi ON mpa.material_id = rmi.material_id;

-- Fifth level: Supplier-Material relationship analysis
CREATE TABLE supplier_material_relationship AS
SELECT
    s.supplier_id,
    s.supplier_name,
    rmi.material_id,
    rmi.material_name,
    COUNT(DISTINCT po.po_id) AS order_count,
    SUM(pd.quantity) AS total_quantity_ordered,
    AVG(pd.unit_price) AS avg_unit_price,
    SUM(pd.quantity * pd.unit_price) AS total_spend
FROM suppliers s
JOIN purchase_orders po ON s.supplier_id = po.supplier_id
JOIN po_details pd ON po.po_id = pd.po_id
JOIN raw_material_inventory rmi ON pd.material_id = rmi.material_id
GROUP BY s.supplier_id, s.supplier_name, rmi.material_id, rmi.material_name;

-- Sixth level: Supply chain risk assessment
CREATE TABLE supply_chain_risk_assessment AS
WITH supplier_diversity AS (
    SELECT 
        material_id,
        COUNT(DISTINCT supplier_id) AS supplier_count
    FROM supplier_material_relationship
    GROUP BY material_id
)
SELECT
    smr.material_id,
    smr.material_name,
    smr.supplier_id,
    smr.supplier_name,
    smr.total_quantity_ordered,
    ita.turnover_category,
    sd.supplier_count,
    CASE
        WHEN sd.supplier_count = 1 THEN 'High Risk'
        WHEN sd.supplier_count = 2 THEN 'Medium Risk'
        ELSE 'Low Risk'
    END AS supplier_diversity_risk,
    CASE
        WHEN ita.turnover_category = 'High' AND sd.supplier_count = 1 THEN 'Critical'
        WHEN ita.turnover_category = 'High' OR sd.supplier_count = 1 THEN 'High'
        WHEN ita.turnover_category = 'Medium' OR sd.supplier_count = 2 THEN 'Medium'
        ELSE 'Low'
    END AS overall_risk_assessment
FROM supplier_material_relationship smr
JOIN inventory_turnover_analysis ita ON smr.material_id = ita.material_id
JOIN supplier_diversity sd ON smr.material_id = sd.material_id;

-- Final level: Strategic procurement insights
CREATE TABLE strategic_procurement_insights AS
SELECT
    scra.material_id,
    scra.material_name,
    scra.overall_risk_assessment,
    COUNT(DISTINCT scra.supplier_id) AS supplier_count,
    AVG(sp.avg_lead_time) AS avg_lead_time,
    SUM(smr.total_spend) AS total_spend,
    MAX(ita.turnover_ratio) AS max_turnover_ratio,
    STRING_AGG(DISTINCT scra.supplier_name, ', ') AS suppliers
FROM supply_chain_risk_assessment scra
JOIN supplier_performance sp ON scra.supplier_id = sp.supplier_id
JOIN supplier_material_relationship smr ON scra.material_id = smr.material_id AND scra.supplier_id = smr.supplier_id
JOIN inventory_turnover_analysis ita ON scra.material_id = ita.material_id
GROUP BY scra.material_id, scra.material_name, scra.overall_risk_assessment
ORDER BY total_spend DESC, overall_risk_assessment DESC;

Last updated