Custom Source: Parse PDF Documents¶
Ingest data trapped in PDF files — supplier invoices, receipts, tax documents, legacy reports.
Overview¶
Many small businesses have data trapped in PDF files. Dango can ingest these using custom dlt sources. You write the parsing logic for your specific document format; Dango handles the pipeline (loading, dedup, scheduling, staging models, catalog).
Example: Supplier Invoice Parser¶
Use case: Track supplier spend across all your invoices in one dashboard.
Adapt to your format
This example shows the pattern for a simple tabular invoice. Your invoices will have different layouts — adapt the parsing logic to match your document format. This is not production-ready code.
custom_sources/invoices.py¶
"""Example: Parse supplier invoices from PDF files.
Adapt the table extraction and field mapping to match your invoice format.
"""
import dlt
import pdfplumber
from pathlib import Path
@dlt.resource(write_disposition="merge", primary_key="invoice_number")
def invoices(folder="data/invoices/"):
"""Extract invoice header data from PDF files."""
for pdf_path in sorted(Path(folder).glob("*.pdf")):
with pdfplumber.open(pdf_path) as pdf:
# Example: first page contains invoice header as a table
page = pdf.pages[0]
table = page.extract_table()
if not table:
continue
# Adapt these field positions to your invoice format
yield {
"invoice_number": table[0][1],
"date": table[1][1],
"vendor_name": table[2][1],
"total_amount": float(table[3][1].replace(",", "")),
"tax_amount": float(table[4][1].replace(",", "")),
"source_file": pdf_path.name,
}
@dlt.resource(write_disposition="merge", primary_key=["invoice_number", "line_number"])
def line_items(folder="data/invoices/"):
"""Extract line items from invoice PDFs."""
for pdf_path in sorted(Path(folder).glob("*.pdf")):
with pdfplumber.open(pdf_path) as pdf:
for page in pdf.pages:
table = page.extract_table()
if not table:
continue
# Adapt: find where line items start in your format
for i, row in enumerate(table[5:], start=1): # skip header rows
if not row[0]:
break
yield {
"invoice_number": table[0][1],
"line_number": i,
"description": row[0],
"quantity": int(row[1]) if row[1] else 1,
"unit_price": float(row[2].replace(",", "")),
"amount": float(row[3].replace(",", "")),
}
@dlt.source
def supplier_invoices():
return [invoices(), line_items()]
Steps¶
-
Install parsing library (in your project venv):
-
Create
custom_sources/invoices.pywith your parsing logic (adapt from example above) -
Register the source:
-
Drop PDF files into
data/invoices/ -
Sync:
Parses PDFs, loads into DuckDB. Staging models are auto-generated, profiling runs, and data appears in the catalog.
What to Do With the Data¶
After ingestion, your data flows through Dango's standard layers:
- Raw:
raw_invoices.invoicesandraw_invoices.line_items— exactly as parsed from PDFs - Staging: Auto-generated by Dango — cleans column names, applies types (e.g.,
stg_invoices__invoices) - Intermediate: Create
dbt/models/intermediate/int_supplier_spend.sqlto standardize across vendors, join invoices with line items, normalize different invoice formats into a common schema - Marts: Create
dbt/models/marts/mart_monthly_spend_by_vendor.sqlfor dashboard-ready aggregations — monthly spend by vendor, category breakdowns, year-over-year comparisons
Use dango model add to create intermediate and marts models with the interactive wizard. See the dbt best practices for advanced modeling patterns.
Adapting for Other Document Types¶
- Bank statements: typically have transaction tables — extract date, description, amount per row
- Receipts: may need OCR for scanned images (pytesseract) — more complex setup
- Tax documents: often fixed-format with known field positions
- Each PDF format needs its own parsing logic — there is no universal PDF parser
Cloud Deployment¶
PDF files need to be on the server to sync. Options:
- SFTP files to server:
scp invoices/*.pdf user@server:/srv/dango/project/data/invoices/ - Cloud storage: store in S3/DO Spaces and read from there in your source
- File watcher: can auto-trigger sync when new PDFs are added locally
Prerequisites¶
- pdfplumber is not included with Dango — install separately:
pip install pdfplumber - For scanned PDFs (images, not text-based), you'll need OCR (e.g., pytesseract + Tesseract) — significantly more complex than text-based parsing
Next Steps¶
- Custom Sources — full guide for building dlt_native sources
- Transformations — build dbt models on top of your parsed data
- Creating Dashboards — visualize supplier spend