Skip to content

Custom Source: Parse PDF Documents

Ingest data trapped in PDF files — supplier invoices, receipts, tax documents, legacy reports.


Overview

Many small businesses have data trapped in PDF files. Dango can ingest these using custom dlt sources. You write the parsing logic for your specific document format; Dango handles the pipeline (loading, dedup, scheduling, staging models, catalog).


Example: Supplier Invoice Parser

Use case: Track supplier spend across all your invoices in one dashboard.

Adapt to your format

This example shows the pattern for a simple tabular invoice. Your invoices will have different layouts — adapt the parsing logic to match your document format. This is not production-ready code.

custom_sources/invoices.py

"""Example: Parse supplier invoices from PDF files.

Adapt the table extraction and field mapping to match your invoice format.
"""
import dlt
import pdfplumber
from pathlib import Path


@dlt.resource(write_disposition="merge", primary_key="invoice_number")
def invoices(folder="data/invoices/"):
    """Extract invoice header data from PDF files."""
    for pdf_path in sorted(Path(folder).glob("*.pdf")):
        with pdfplumber.open(pdf_path) as pdf:
            # Example: first page contains invoice header as a table
            page = pdf.pages[0]
            table = page.extract_table()
            if not table:
                continue
            # Adapt these field positions to your invoice format
            yield {
                "invoice_number": table[0][1],
                "date": table[1][1],
                "vendor_name": table[2][1],
                "total_amount": float(table[3][1].replace(",", "")),
                "tax_amount": float(table[4][1].replace(",", "")),
                "source_file": pdf_path.name,
            }


@dlt.resource(write_disposition="merge", primary_key=["invoice_number", "line_number"])
def line_items(folder="data/invoices/"):
    """Extract line items from invoice PDFs."""
    for pdf_path in sorted(Path(folder).glob("*.pdf")):
        with pdfplumber.open(pdf_path) as pdf:
            for page in pdf.pages:
                table = page.extract_table()
                if not table:
                    continue
                # Adapt: find where line items start in your format
                for i, row in enumerate(table[5:], start=1):  # skip header rows
                    if not row[0]:
                        break
                    yield {
                        "invoice_number": table[0][1],
                        "line_number": i,
                        "description": row[0],
                        "quantity": int(row[1]) if row[1] else 1,
                        "unit_price": float(row[2].replace(",", "")),
                        "amount": float(row[3].replace(",", "")),
                    }


@dlt.source
def supplier_invoices():
    return [invoices(), line_items()]

Steps

  1. Install parsing library (in your project venv):

    pip install pdfplumber
    
  2. Create custom_sources/invoices.py with your parsing logic (adapt from example above)

  3. Register the source:

    dango source add
    # Select "dlt_native" → enter module name
    
  4. Drop PDF files into data/invoices/

  5. Sync:

    dango sync
    

    Parses PDFs, loads into DuckDB. Staging models are auto-generated, profiling runs, and data appears in the catalog.


What to Do With the Data

After ingestion, your data flows through Dango's standard layers:

  1. Raw: raw_invoices.invoices and raw_invoices.line_items — exactly as parsed from PDFs
  2. Staging: Auto-generated by Dango — cleans column names, applies types (e.g., stg_invoices__invoices)
  3. Intermediate: Create dbt/models/intermediate/int_supplier_spend.sql to standardize across vendors, join invoices with line items, normalize different invoice formats into a common schema
  4. Marts: Create dbt/models/marts/mart_monthly_spend_by_vendor.sql for dashboard-ready aggregations — monthly spend by vendor, category breakdowns, year-over-year comparisons

Use dango model add to create intermediate and marts models with the interactive wizard. See the dbt best practices for advanced modeling patterns.


Adapting for Other Document Types

  • Bank statements: typically have transaction tables — extract date, description, amount per row
  • Receipts: may need OCR for scanned images (pytesseract) — more complex setup
  • Tax documents: often fixed-format with known field positions
  • Each PDF format needs its own parsing logic — there is no universal PDF parser

Cloud Deployment

PDF files need to be on the server to sync. Options:

  • SFTP files to server: scp invoices/*.pdf user@server:/srv/dango/project/data/invoices/
  • Cloud storage: store in S3/DO Spaces and read from there in your source
  • File watcher: can auto-trigger sync when new PDFs are added locally

Prerequisites

  • pdfplumber is not included with Dango — install separately: pip install pdfplumber
  • For scanned PDFs (images, not text-based), you'll need OCR (e.g., pytesseract + Tesseract) — significantly more complex than text-based parsing

Next Steps