Performance Optimization¶

Strategies for optimizing Dango performance with large datasets.

Overview¶

As your data grows, you may need to optimize: - Sync performance (data loading) - Query performance (DuckDB) - dbt model execution - Memory usage

DuckDB Optimization¶

Understanding DuckDB¶

DuckDB is optimized for analytical workloads: - Columnar storage for fast aggregations - Automatic parallelization - In-memory processing with disk spillover

Version Alignment¶

DuckDB requires that the Python library and the Metabase JDBC driver share the same major.minor version. Dango handles this automatically — the driver version (1.5.1.0) is pinned to match the Python DuckDB dependency (1.5.x).

If you upgrade DuckDB manually, the Metabase driver must also be updated. Dango checks version alignment on startup and warns if there is a mismatch.

Why Version Alignment Matters

DuckDB's read-only mode cannot open database files created by a different major.minor version. If the Python library is 1.5.x but the Metabase JDBC driver bundles 1.4.x, Metabase cannot read the database. Dango's startup check and pre-commit hook prevent this drift.

Memory Configuration¶

-- Check current memory limit
SELECT current_setting('memory_limit');

-- Increase memory for large datasets (run in DuckDB CLI)
SET memory_limit = '8GB';
SET threads = 4;

Configure in dbt:

# dbt/profiles.yml
my_project:
  outputs:
    dev:
      type: duckdb
      path: ../data/warehouse.duckdb
      config:
        memory_limit: '8GB'
        threads: 4

Query Optimization¶

Use EXPLAIN to analyze queries:

EXPLAIN ANALYZE
SELECT date_trunc('day', order_date), SUM(amount)
FROM raw_orders.orders
GROUP BY 1;

Common optimizations:

-- BAD: Full table scan for filter
SELECT * FROM orders WHERE YEAR(order_date) = 2024;

-- GOOD: Range filter
SELECT * FROM orders
WHERE order_date >= '2024-01-01' AND order_date < '2025-01-01';

-- BAD: SELECT *
SELECT * FROM orders JOIN customers USING (customer_id);

-- GOOD: Select only needed columns
SELECT o.order_id, o.amount, c.name
FROM orders o JOIN customers c USING (customer_id);

Sync Performance¶

Sync Subprocess Model¶

Syncs run in isolated subprocesses, keeping the web UI responsive. The web server never blocks on a long-running sync — it monitors progress via status files at .dango/state/sync_status_{id}.json and broadcasts updates over WebSocket.

This means you can continue using the dashboard, viewing data, and managing sources while a sync runs in the background.

Single-Writer Constraint¶

DuckDB allows only one writer process at a time. Dango enforces this with DbtLock — a file lock at .dango/state/dbt.lock.

Do not run syncs in parallel. DuckDB's single-writer constraint means parallel syncs will fail:

# WRONG: parallel syncs will fail on the second one
dango sync source1 &
dango sync source2 &

# CORRECT: sync sources sequentially
dango sync source1
dango sync source2

# CORRECT: sync all sources (Dango handles ordering)
dango sync

When a sync is triggered from the web UI while another is running, it waits up to 5 minutes for the lock before failing. The CLI fails immediately.

See DuckDB Single-Writer for details.

Reduce Sync Scope¶

Select specific endpoints:

sources:
  - name: stripe
    type: stripe
    stripe:
      endpoints:
        - charges      # Only what you need
        - customers
        # Skip: products, plans, etc.

Date range limits:

sources:
  - name: google_analytics
    type: google_analytics
    google_analytics:
      start_date: "2024-01-01"  # Don't fetch all history

Full Refresh Scheduling¶

# Regular syncs: incremental
dango sync

# Weekly: full refresh for data quality
dango sync --full-refresh  # Run on weekends

Large Dataset Handling¶

Sync Strategies¶

Incremental Syncs:

# .dango/sources.yml
sources:
  - name: large_source
    type: stripe
    stripe:
      # Only sync recent data after initial load
      start_date: "2024-01-01"

Source Timeouts:

# For slow sources, increase timeout
sources:
  - name: slow_api
    type: rest_api
    rest_api:
      timeout: 600  # 10 minutes

Data Volume Guidelines¶

Data Size	Recommendation
< 100 MB	Default settings work well
100 MB - 1 GB	Consider incremental syncs
1 GB - 10 GB	Use materialized tables, indexes
> 10 GB	Consider partitioning, external tables

dbt Performance¶

Materialization Strategy¶

Dango uses table materialization for all models to ensure Metabase compatibility:

Type	Use When	Performance
`table`	Default for all models	Best query performance, reliable Metabase support
`incremental`	Very large, append-only data	Fast to build, fast to query

Why Tables Only

Dango auto-generates staging models as tables (not views) for reliable Metabase compatibility. While views rebuild faster, they can cause query issues in Metabase.

-- Staging: tables (auto-generated by Dango)
{{ config(materialized='table') }}

-- Intermediate: tables
{{ config(materialized='table') }}

-- Marts: tables (query performance matters)
{{ config(materialized='table') }}

-- Large fact tables: incremental
{{ config(
    materialized='incremental',
    unique_key='id',
    incremental_strategy='merge'
) }}

Incremental Models¶

For large fact tables:

-- dbt/models/marts/fct_events.sql
{{ config(
    materialized='incremental',
    unique_key='event_id'
) }}

SELECT
    event_id,
    user_id,
    event_type,
    created_at
FROM {{ ref('stg_events') }}

{% if is_incremental() %}
    WHERE created_at > (SELECT MAX(created_at) FROM {{ this }})
{% endif %}

Run Specific Models¶

# Don't rebuild everything
cd dbt

# Run only what changed
dbt run --select state:modified+

# Run specific model and downstream
dbt run --select my_model+

Parallel Execution¶

# Run models in parallel (default is 1)
dbt run --threads 4

Single-Worker Architecture¶

Cloud deployments run Dango with a single uvicorn worker. This is intentional — ws_manager (the WebSocket connection manager) is an in-process singleton, so multi-worker mode would break real-time sync progress broadcasting. The workers config option exists for advanced use but defaults to 1.

If you need to handle more concurrent HTTP connections, scale vertically (resize the server) or put a CDN/reverse proxy in front of Caddy for static assets.

Memory Management¶

Monitor Memory Usage¶

# Check DuckDB memory
duckdb data/warehouse.duckdb "SELECT * FROM duckdb_memory()"

# Check system memory during sync
top -pid $(pgrep -f dango)

Reduce Memory Pressure¶

Close unused connections:

# Stop services when not needed
dango stop

Process in batches:

# Custom source with batching
@dlt.resource
def large_api():
    for page in range(100):
        batch = fetch_page(page, size=1000)
        yield batch  # Yield batches, not all at once

Database Maintenance¶

Vacuum and Analyze¶

-- Run periodically in DuckDB CLI
VACUUM;
ANALYZE;

Check Table Sizes¶

-- Find large tables
SELECT
    table_name,
    estimated_size
FROM duckdb_tables()
ORDER BY estimated_size DESC;

Remove Unused Data¶

# Clean up old dlt state
dango db clean

# This removes:
# - DuckDB database
# - dlt pipeline state
# Fresh sync required after

Monitoring Performance¶

Sync Duration¶

# Time a sync
time dango sync my_source

# Check sync logs
cat .dango/logs/activity.jsonl | tail -20

Query Performance¶

-- Enable profiling
PRAGMA enable_profiling;
PRAGMA profile_output='query_profile.json';

-- Run your query
SELECT ...;

-- Check profile
.read query_profile.json

dbt Timing¶

# dbt shows timing by default
dbt run --select my_model

# Detailed timing
dbt run --select my_model --log-level debug

Performance Checklist¶

Before Scaling¶

Use incremental syncs where possible
Materialize slow models as tables
Limit sync scope to needed endpoints
Set appropriate date ranges

For Large Datasets¶

Increase DuckDB memory limit
Use incremental dbt models
Run dbt with multiple threads
Schedule full refreshes off-peak

Troubleshooting Slow Queries¶

Check with EXPLAIN ANALYZE
Verify indexes exist on join columns
Review model materializations
Check for unnecessary SELECT *

Next Steps¶

dbt Workflows - Model optimization
Troubleshooting - Debug slow syncs
Backup & Restore - Large database backups