GitHub¶
Connect GitHub repositories as a data source using a Personal Access Token.
Overview¶
| Feature | Details |
|---|---|
| Auth | API Key (Personal Access Token) |
| Incremental | No (full refresh) |
| Category | Development |
Not OAuth
GitHub uses a Personal Access Token (PAT) for authentication, not an OAuth browser flow. No browser redirect is needed — you paste your token directly during setup.
GitHub loads repository data into DuckDB including issues and pull requests (with embedded reactions and comments).
Managing this source in the Web UI
After setup, manage this source from the Sources page in the Web UI (http://localhost:8800/sources). Trigger syncs, view history, and monitor status without using the CLI. See Web UI — Sources.
Prerequisites¶
Before adding GitHub as a source, you need:
- GitHub account with access to the target repository
- Personal Access Token (classic) — not fine-grained
Generate a Personal Access Token¶
- Go to GitHub Settings > Developer settings > Personal access tokens > Tokens (classic)
- Click Generate new token (classic)
- Set a descriptive name (e.g., "Dango data sync")
- Select scopes:
repo— full repository access (required for private repos)read:org— read org membershipread:user— read user profile
- Click Generate token
- Copy the token (starts with
ghp_) — you won't see it again
Classic tokens only
Use classic personal access tokens, not fine-grained tokens. Fine-grained tokens are not fully supported by the dlt GitHub source.
Setup¶
Step 1: Add Source¶
Step 2: Configure¶
The wizard will prompt for:
? GitHub Personal Access Token: ghp_xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
? Repository owner (e.g., getdango): myorg
? Repository name (e.g., dango): my-repo
The token is saved to .env as GITHUB_ACCESS_TOKEN.
Step 3: Sync¶
Configuration¶
sources.yml¶
version: '1.0'
sources:
- name: my_github
type: github
enabled: true
description: GitHub issues and PRs from main repo
github:
owner: "myorg"
name: "my-repo"
access_token_env: "GITHUB_ACCESS_TOKEN"
.env¶
Never commit secrets
.env is gitignored by default. Never add it to version control.
Tables Loaded¶
GitHub data loads into the raw_{source_name} schema using dlt's github_reactions source function. Tables include:
| Table | Description |
|---|---|
issues | All issues (open and closed) with reactions and comments |
pull_requests | All pull requests with reactions and comments |
-- Example: query open issues
SELECT * FROM raw_my_github.issues
WHERE state = 'open'
ORDER BY created_at DESC
LIMIT 10;
Sync Behavior¶
- Full refresh — all issues and pull requests are reloaded on every sync (write disposition:
replace) - Each sync loads all historical data for the repository
- The
github_reactionssource function fetches issues and PRs with their embedded reactions and comments
Troubleshooting¶
401 Unauthorized¶
Problem: 401 Bad credentials
Solutions:
- Verify your PAT is still valid at GitHub Settings > Tokens
- Check that the token hasn't expired (if you set an expiration date)
- Regenerate the token and update
.env
403 Forbidden¶
Problem: 403 Forbidden on certain endpoints
Solutions:
- Verify the token has the required scopes:
repo,read:org,read:user - For private repos, the
reposcope is required (not justpublic_repo) - Check that your GitHub account has access to the target repository
Rate Limits¶
Problem: 403 API rate limit exceeded
Solution: Authenticated requests are limited to 5,000 per hour. If you sync large repos frequently:
- Increase the interval between syncs
- The dlt source handles rate limiting automatically with retries
Private Repository Access¶
Problem: 404 Not Found on a private repo
Solution: Ensure your PAT has the repo scope (full access to private repos). The public_repo scope is insufficient for private repositories.
Next Steps¶
- Source Catalog - See all available sources
- Sync Modes - Understand incremental loading
- Adding Sources - General source setup guide
- Credentials - Token storage and security