REST API Data Extraction
Starlake can extract data from any REST API that returns JSON or XML. You define endpoints, authentication, pagination, and response structure in a YAML configuration file. Starlake handles the HTTP requests, pagination, rate limiting, and writes the results as CSV files ready for ingestion.
This covers both extract-rest-schema (infer table definitions) and extract-rest-data (fetch actual data).
Quick Start
1. Create the extraction config
version: 1
extract:
restAPI:
baseUrl: "https://api.example.com/v2"
auth:
type: bearer
token: "{{API_TOKEN}}"
rateLimit:
requestsPerSecond: 10
defaults:
pagination:
type: offset
limitParam: "limit"
offsetParam: "offset"
pageSize: 100
endpoints:
- path: "/customers"
as: "customer"
domain: "crm"
responsePath: "$.data"
incrementalField: "updated_at"
2. Extract schemas (optional)
starlake extract-rest-schema --config my-api
This fetches a sample from each endpoint and generates Starlake YAML table definitions in metadata/load/.
3. Extract data
starlake extract-rest-data --config my-api --outputDir /tmp/api-data
4. Load into warehouse
starlake load
Authentication
Configure authentication in the auth section. All credential values support {{ENV_VAR}} syntax for environment variable substitution.
Bearer Token
auth:
type: bearer
token: "{{API_TOKEN}}"
API Key
auth:
type: api_key
key: "{{API_KEY}}"
header: "X-API-Key" # Header name (default: X-API-Key)
Basic Auth
auth:
type: basic
username: "{{API_USER}}"
password: "{{API_PASSWORD}}"
OAuth2 Client Credentials
auth:
type: oauth2_client_credentials
tokenUrl: "https://auth.example.com/oauth/token"
clientId: "{{OAUTH_CLIENT_ID}}"
clientSecret: "{{OAUTH_CLIENT_SECRET}}"
scope: "read:data" # Optional
Starlake automatically fetches tokens, caches them, and refreshes on expiry or 401 responses.
Pagination
Configure pagination per endpoint or set a default for all endpoints.
Offset Pagination
For APIs using ?limit=100&offset=200:
pagination:
type: offset
limitParam: "limit" # Query param for page size
offsetParam: "offset" # Query param for offset
pageSize: 100
Cursor Pagination
For APIs returning a cursor in the response body:
pagination:
type: cursor
cursorParam: "after" # Query param to pass cursor value
cursorPath: "$.meta.next_cursor" # JSONPath to extract cursor from response
pageSize: 50
limitParam: "per_page" # Optional: query param for page size
Link Header Pagination
For APIs using RFC 5988 Link headers with rel="next":
pagination:
type: link_header
pageSize: 100
limitParam: "per_page" # Optional
Page Number Pagination
For APIs using ?page=3:
pagination:
type: page_number
pageParam: "page"
pageSize: 25
limitParam: "per_page" # Optional
Endpoint Configuration
Basic Endpoint
endpoints:
- path: "/customers" # API path (required)
as: "customer" # Table name (default: derived from path)
domain: "crm" # Domain grouping (default: "default")
responsePath: "$.data" # JSONPath to data array in response
POST Endpoint with Request Body
endpoints:
- path: "/search"
method: "POST"
as: "search_results"
domain: "catalog"
requestBody: '{"query": "active", "filters": {"status": "published"}}'
responsePath: "$.hits"
Custom Headers and Query Parameters
endpoints:
- path: "/reports"
as: "report"
domain: "analytics"
headers:
X-Custom-Header: "value"
queryParams:
format: "detailed"
status: "active"
Field Exclusion
endpoints:
- path: "/users"
as: "user"
domain: "iam"
excludeFields:
- "password_hash"
- "internal_.*" # Regex patterns supported
Parent-Child Endpoints
Use {parent.fieldName} placeholders to fetch related data for each parent record:
endpoints:
- path: "/orders"
as: "order"
domain: "sales"
responsePath: "$.data"
children:
- path: "/orders/{parent.id}/items"
as: "order_item"
domain: "sales"
responsePath: "$.items"
- path: "/orders/{parent.id}/payments"
as: "order_payment"
domain: "sales"
For each order returned by /orders, Starlake calls /orders/{id}/items and /orders/{id}/payments with the parent order's id field substituted.
Incremental Extraction
Track changes between runs using incrementalField:
endpoints:
- path: "/customers"
as: "customer"
domain: "crm"
incrementalField: "updated_at" # Field to track
Run with the --incremental flag:
starlake extract-rest-data --config my-api --outputDir /tmp/api-data --incremental
How it works:
- First run: Extracts all data. Saves the max value of
updated_atto a state file at{outputDir}/.state/crm/customer.json. - Next run: Reads the last value from the state file and passes it as a query parameter (
?updated_at=2024-01-15), so the API only returns newer records.
Rate Limiting and Retries
Rate Limiting
rateLimit:
requestsPerSecond: 10 # Max requests per second
Retry Configuration
retry:
maxRetries: 5 # Default: 3
initialBackoffMs: 2000 # Default: 1000 (doubles on each retry)
maxBackoffMs: 60000 # Default: 30000
Starlake automatically retries on:
- HTTP 429 (Too Many Requests) -- with exponential backoff
- HTTP 5xx (Server errors) -- up to
maxRetrieswith backoff - Connection failures -- up to
maxRetries
Timeout Configuration
timeout:
connectTimeoutMs: 15000 # Default: 30000 (30s)
readTimeoutMs: 120000 # Default: 60000 (60s)
Proxy Support
For APIs behind corporate proxies:
proxy:
host: "proxy.corp.example.com"
port: 8080
username: "{{PROXY_USER}}" # Optional
password: "{{PROXY_PASS}}" # Optional
TLS / mTLS Configuration
For APIs requiring custom CA certificates or mutual TLS (client certificates):
tls:
trustStorePath: "/path/to/truststore.jks"
trustStorePassword: "{{TRUST_STORE_PASS}}"
keyStorePath: "/path/to/keystore.jks" # For mTLS client cert
keyStorePassword: "{{KEY_STORE_PASS}}"
For development/testing only (not recommended for production):
tls:
insecure: true # Trust all certificates
Response Validation
Some APIs return errors inside HTTP 200 responses. Use errorPath to detect these:
endpoints:
- path: "/data"
as: "data"
domain: "api"
errorPath: "$.error" # If $.error is non-null, treat as error
Resume on Failure
If extraction fails mid-way (e.g., network error on page 50 of 100), resume from where it stopped:
starlake extract-rest-data --config my-api --outputDir /tmp/api-data --resume
Starlake tracks the number of pages extracted per endpoint in the state file. On --resume, it skips already-extracted pages and continues from the next one.
Output Formats
CSV (default)
starlake extract-rest-data --config my-api --outputDir /tmp/api-data
Nested JSON objects are flattened using dot notation (e.g., address.city).
JSON Lines
starlake extract-rest-data --config my-api --outputDir /tmp/api-data --outputFormat jsonl
Writes one JSON object per line, preserving the full nested structure. Better for complex/array data.
Defaults
Set default pagination, headers, and query params for all endpoints:
defaults:
pagination:
type: offset
limitParam: "limit"
offsetParam: "offset"
pageSize: 100
headers:
X-API-Version: "2"
queryParams:
format: "json"
Individual endpoints can override any default.
XML Response Support
REST APIs returning Content-Type: application/xml are automatically parsed and converted to JSON for processing. XML elements become JSON fields, repeated elements become arrays, and attributes are prefixed with @.
Full Configuration Reference
| Field | Location | Description |
|---|---|---|
baseUrl | restAPI | Base URL of the API (required) |
auth | restAPI | Authentication config |
auth.type | restAPI.auth | bearer, api_key, basic, oauth2_client_credentials |
headers | restAPI | Global HTTP headers |
rateLimit.requestsPerSecond | restAPI.rateLimit | Max requests/second (default: 10) |
retry.maxRetries | restAPI.retry | Max retry attempts (default: 3) |
retry.initialBackoffMs | restAPI.retry | Initial backoff ms (default: 1000) |
retry.maxBackoffMs | restAPI.retry | Max backoff ms (default: 30000) |
timeout.connectTimeoutMs | restAPI.timeout | Connection timeout ms (default: 30000) |
timeout.readTimeoutMs | restAPI.timeout | Read timeout ms (default: 60000) |
proxy.host | restAPI.proxy | Proxy hostname |
proxy.port | restAPI.proxy | Proxy port |
tls.trustStorePath | restAPI.tls | Path to trust store (JKS) |
tls.keyStorePath | restAPI.tls | Path to key store for mTLS |
tls.insecure | restAPI.tls | Trust all certs (dev only) |
defaults.pagination | restAPI.defaults | Default pagination for all endpoints |
defaults.headers | restAPI.defaults | Default headers for all endpoints |
defaults.queryParams | restAPI.defaults | Default query params for all endpoints |
endpoints[].path | restAPI.endpoints | API endpoint path (required) |
endpoints[].method | restAPI.endpoints | GET (default) or POST |
endpoints[].as | restAPI.endpoints | Table name override |
endpoints[].domain | restAPI.endpoints | Domain name (default: default) |
endpoints[].headers | restAPI.endpoints | Endpoint-specific headers |
endpoints[].queryParams | restAPI.endpoints | Endpoint-specific query params |
endpoints[].requestBody | restAPI.endpoints | JSON body for POST requests |
endpoints[].pagination | restAPI.endpoints | Endpoint-specific pagination |
endpoints[].responsePath | restAPI.endpoints | JSONPath to data array (e.g. $.data) |
endpoints[].incrementalField | restAPI.endpoints | Field for incremental tracking |
endpoints[].children | restAPI.endpoints | Child endpoints with {parent.field} placeholders |
endpoints[].excludeFields | restAPI.endpoints | Regex patterns to exclude fields |
endpoints[].errorPath | restAPI.endpoints | JSONPath to error indicator in 200 responses |
CLI Commands
| Command | Description |
|---|---|
extract-rest-schema | Infer table schemas from API sample responses |
extract-rest-data | Extract data to CSV files |
Frequently Asked Questions
What REST APIs does Starlake support?
Any REST API returning JSON or XML. You configure the base URL, authentication, pagination, and response structure in YAML.
What authentication methods are supported?
Bearer tokens, API keys (in custom headers), HTTP Basic, and OAuth2 client credentials with automatic token refresh.
What pagination strategies are available?
Offset (limit/offset), cursor (cursor from response body), Link header (RFC 5988), and page number.
Can I extract data incrementally?
Yes. Set incrementalField on the endpoint and run with --incremental. State is tracked in a JSON file between runs.
How are nested JSON objects handled?
Objects are flattened to dot notation in CSV (e.g., address.city). Arrays and deeply nested objects are serialized as JSON strings.
How do parent-child endpoints work?
Child endpoints use {parent.fieldName} in their path. For each parent record, the child endpoint is called with the field value substituted.