Skip to main content

REST API Data Extraction

Starlake can extract data from any REST API that returns JSON or XML. You define endpoints, authentication, pagination, and response structure in a YAML configuration file. Starlake handles the HTTP requests, pagination, rate limiting, and writes the results as CSV files ready for ingestion.

This covers both extract-rest-schema (infer table definitions) and extract-rest-data (fetch actual data).

Quick Start

1. Create the extraction config

metadata/extract/my-api.sl.yml
version: 1
extract:
restAPI:
baseUrl: "https://api.example.com/v2"
auth:
type: bearer
token: "{{API_TOKEN}}"
rateLimit:
requestsPerSecond: 10
defaults:
pagination:
type: offset
limitParam: "limit"
offsetParam: "offset"
pageSize: 100
endpoints:
- path: "/customers"
as: "customer"
domain: "crm"
responsePath: "$.data"
incrementalField: "updated_at"

2. Extract schemas (optional)

starlake extract-rest-schema --config my-api

This fetches a sample from each endpoint and generates Starlake YAML table definitions in metadata/load/.

3. Extract data

starlake extract-rest-data --config my-api --outputDir /tmp/api-data

4. Load into warehouse

starlake load

Authentication

Configure authentication in the auth section. All credential values support {{ENV_VAR}} syntax for environment variable substitution.

Bearer Token

auth:
type: bearer
token: "{{API_TOKEN}}"

API Key

auth:
type: api_key
key: "{{API_KEY}}"
header: "X-API-Key" # Header name (default: X-API-Key)

Basic Auth

auth:
type: basic
username: "{{API_USER}}"
password: "{{API_PASSWORD}}"

OAuth2 Client Credentials

auth:
type: oauth2_client_credentials
tokenUrl: "https://auth.example.com/oauth/token"
clientId: "{{OAUTH_CLIENT_ID}}"
clientSecret: "{{OAUTH_CLIENT_SECRET}}"
scope: "read:data" # Optional

Starlake automatically fetches tokens, caches them, and refreshes on expiry or 401 responses.

Pagination

Configure pagination per endpoint or set a default for all endpoints.

Offset Pagination

For APIs using ?limit=100&offset=200:

pagination:
type: offset
limitParam: "limit" # Query param for page size
offsetParam: "offset" # Query param for offset
pageSize: 100

Cursor Pagination

For APIs returning a cursor in the response body:

pagination:
type: cursor
cursorParam: "after" # Query param to pass cursor value
cursorPath: "$.meta.next_cursor" # JSONPath to extract cursor from response
pageSize: 50
limitParam: "per_page" # Optional: query param for page size

For APIs using RFC 5988 Link headers with rel="next":

pagination:
type: link_header
pageSize: 100
limitParam: "per_page" # Optional

Page Number Pagination

For APIs using ?page=3:

pagination:
type: page_number
pageParam: "page"
pageSize: 25
limitParam: "per_page" # Optional

Endpoint Configuration

Basic Endpoint

endpoints:
- path: "/customers" # API path (required)
as: "customer" # Table name (default: derived from path)
domain: "crm" # Domain grouping (default: "default")
responsePath: "$.data" # JSONPath to data array in response

POST Endpoint with Request Body

endpoints:
- path: "/search"
method: "POST"
as: "search_results"
domain: "catalog"
requestBody: '{"query": "active", "filters": {"status": "published"}}'
responsePath: "$.hits"

Custom Headers and Query Parameters

endpoints:
- path: "/reports"
as: "report"
domain: "analytics"
headers:
X-Custom-Header: "value"
queryParams:
format: "detailed"
status: "active"

Field Exclusion

endpoints:
- path: "/users"
as: "user"
domain: "iam"
excludeFields:
- "password_hash"
- "internal_.*" # Regex patterns supported

Parent-Child Endpoints

Use {parent.fieldName} placeholders to fetch related data for each parent record:

endpoints:
- path: "/orders"
as: "order"
domain: "sales"
responsePath: "$.data"
children:
- path: "/orders/{parent.id}/items"
as: "order_item"
domain: "sales"
responsePath: "$.items"
- path: "/orders/{parent.id}/payments"
as: "order_payment"
domain: "sales"

For each order returned by /orders, Starlake calls /orders/{id}/items and /orders/{id}/payments with the parent order's id field substituted.

Incremental Extraction

Track changes between runs using incrementalField:

endpoints:
- path: "/customers"
as: "customer"
domain: "crm"
incrementalField: "updated_at" # Field to track

Run with the --incremental flag:

starlake extract-rest-data --config my-api --outputDir /tmp/api-data --incremental

How it works:

  1. First run: Extracts all data. Saves the max value of updated_at to a state file at {outputDir}/.state/crm/customer.json.
  2. Next run: Reads the last value from the state file and passes it as a query parameter (?updated_at=2024-01-15), so the API only returns newer records.

Rate Limiting and Retries

Rate Limiting

rateLimit:
requestsPerSecond: 10 # Max requests per second

Automatic Retries

Starlake automatically retries on:

  • HTTP 429 (Too Many Requests) -- with exponential backoff
  • HTTP 5xx (Server errors) -- up to 3 retries with backoff
  • Connection failures -- up to 3 retries

Defaults

Set default pagination, headers, and query params for all endpoints:

defaults:
pagination:
type: offset
limitParam: "limit"
offsetParam: "offset"
pageSize: 100
headers:
X-API-Version: "2"
queryParams:
format: "json"

Individual endpoints can override any default.

Output Format

Data is written as CSV files:

/tmp/api-data/
crm/
customer-20240115103000.csv
order-20240115103000.csv
order_item-20240115103000.csv
catalog/
product-20240115103000.csv

Nested JSON objects are flattened using dot notation:

"id","name","address.city","address.country"
"1","Alice","Paris","France"

XML Response Support

REST APIs returning Content-Type: application/xml are automatically parsed and converted to JSON for processing. XML elements become JSON fields, repeated elements become arrays, and attributes are prefixed with @.

Full Configuration Reference

FieldLocationDescription
baseUrlrestAPIBase URL of the API (required)
authrestAPIAuthentication config
auth.typerestAPI.authbearer, api_key, basic, oauth2_client_credentials
headersrestAPIGlobal HTTP headers
rateLimit.requestsPerSecondrestAPI.rateLimitMax requests/second (default: 10)
defaults.paginationrestAPI.defaultsDefault pagination for all endpoints
defaults.headersrestAPI.defaultsDefault headers for all endpoints
defaults.queryParamsrestAPI.defaultsDefault query params for all endpoints
endpoints[].pathrestAPI.endpointsAPI endpoint path (required)
endpoints[].methodrestAPI.endpointsGET (default) or POST
endpoints[].asrestAPI.endpointsTable name override
endpoints[].domainrestAPI.endpointsDomain name (default: default)
endpoints[].headersrestAPI.endpointsEndpoint-specific headers
endpoints[].queryParamsrestAPI.endpointsEndpoint-specific query params
endpoints[].requestBodyrestAPI.endpointsJSON body for POST requests
endpoints[].paginationrestAPI.endpointsEndpoint-specific pagination
endpoints[].responsePathrestAPI.endpointsJSONPath to data array (e.g. $.data)
endpoints[].incrementalFieldrestAPI.endpointsField for incremental tracking
endpoints[].childrenrestAPI.endpointsChild endpoints with {parent.field} placeholders
endpoints[].excludeFieldsrestAPI.endpointsRegex patterns to exclude fields

CLI Commands

CommandDescription
extract-rest-schemaInfer table schemas from API sample responses
extract-rest-dataExtract data to CSV files

Frequently Asked Questions

What REST APIs does Starlake support?

Any REST API returning JSON or XML. You configure the base URL, authentication, pagination, and response structure in YAML.

What authentication methods are supported?

Bearer tokens, API keys (in custom headers), HTTP Basic, and OAuth2 client credentials with automatic token refresh.

What pagination strategies are available?

Offset (limit/offset), cursor (cursor from response body), Link header (RFC 5988), and page number.

Can I extract data incrementally?

Yes. Set incrementalField on the endpoint and run with --incremental. State is tracked in a JSON file between runs.

How are nested JSON objects handled?

Objects are flattened to dot notation in CSV (e.g., address.city). Arrays and deeply nested objects are serialized as JSON strings.

How do parent-child endpoints work?

Child endpoints use {parent.fieldName} in their path. For each parent record, the child endpoint is called with the field value substituted.