What authentication methods are supported for REST API extraction?

Starlake supports bearer tokens, API keys (in custom headers), HTTP Basic authentication, and OAuth2 client credentials with automatic token refresh.

What pagination strategies does Starlake support?

Four strategies: offset-based (limit/offset), cursor-based (cursor from response body), Link header (RFC 5988 rel=next), and page number pagination.

Can Starlake extract data from REST APIs incrementally?

Yes. Configure an incrementalField on the endpoint and run with --incremental. Starlake tracks the max value between runs in a state file and passes it as a query parameter on the next run.

What is the output format of REST API extraction?

Data is extracted into CSV files organized by domain: {outputDir}/{domain}/{tableName}-{timestamp}.csv. Nested JSON objects are flattened using dot notation (e.g., address.city). The files can be loaded using starlake load.

REST API Data Extraction

Q: What REST APIs does Starlake support?

Starlake supports any REST API that returns JSON or XML responses. You configure the base URL, authentication, pagination strategy, and response structure in a YAML file.

Q: How do parent-child endpoints work?

Child endpoints use {parent.fieldName} placeholders in their path. For each parent record, Starlake calls the child endpoint with the parent field value substituted. For example, /orders/{parent.id}/items fetches items for each order.

Starlake can extract data from any REST API that returns JSON or XML. You define endpoints, authentication, pagination, and response structure in a YAML configuration file. Starlake handles the HTTP requests, pagination, rate limiting, and writes the results as CSV files ready for ingestion.

This covers both extract-rest-schema (infer table definitions) and extract-rest-data (fetch actual data).

Quick Start

1. Create the extraction config

metadata/extract/my-api.sl.yml
version: 1
extract:
  restAPI:
    baseUrl: "https://api.example.com/v2"
    auth:
      type: bearer
      token: "{{API_TOKEN}}"
    rateLimit:
      requestsPerSecond: 10
    defaults:
      pagination:
        type: offset
        limitParam: "limit"
        offsetParam: "offset"
        pageSize: 100
    endpoints:
      - path: "/customers"
        as: "customer"
        domain: "crm"
        responsePath: "$.data"
        incrementalField: "updated_at"

2. Extract schemas (optional)

starlake extract-rest-schema --config my-api

This fetches a sample from each endpoint and generates Starlake YAML table definitions in metadata/load/.

3. Extract data

starlake extract-rest-data --config my-api --outputDir /tmp/api-data

4. Load into warehouse

starlake load

Authentication

Configure authentication in the auth section. All credential values support {{ENV_VAR}} syntax for environment variable substitution.

Bearer Token

auth:
  type: bearer
  token: "{{API_TOKEN}}"

API Key

auth:
  type: api_key
  key: "{{API_KEY}}"
  header: "X-API-Key"           # Header name (default: X-API-Key)

Basic Auth

auth:
  type: basic
  username: "{{API_USER}}"
  password: "{{API_PASSWORD}}"

OAuth2 Client Credentials

auth:
  type: oauth2_client_credentials
  tokenUrl: "https://auth.example.com/oauth/token"
  clientId: "{{OAUTH_CLIENT_ID}}"
  clientSecret: "{{OAUTH_CLIENT_SECRET}}"
  scope: "read:data"            # Optional

Starlake automatically fetches tokens, caches them, and refreshes on expiry or 401 responses.

Pagination

Configure pagination per endpoint or set a default for all endpoints.

Offset Pagination

For APIs using ?limit=100&offset=200:

pagination:
  type: offset
  limitParam: "limit"           # Query param for page size
  offsetParam: "offset"         # Query param for offset
  pageSize: 100

Cursor Pagination

For APIs returning a cursor in the response body:

pagination:
  type: cursor
  cursorParam: "after"          # Query param to pass cursor value
  cursorPath: "$.meta.next_cursor"  # JSONPath to extract cursor from response
  pageSize: 50
  limitParam: "per_page"        # Optional: query param for page size

Link Header Pagination

For APIs using RFC 5988 Link headers with rel="next":

pagination:
  type: link_header
  pageSize: 100
  limitParam: "per_page"        # Optional

Page Number Pagination

For APIs using ?page=3:

pagination:
  type: page_number
  pageParam: "page"
  pageSize: 25
  limitParam: "per_page"        # Optional

Endpoint Configuration

Basic Endpoint

endpoints:
  - path: "/customers"          # API path (required)
    as: "customer"              # Table name (default: derived from path)
    domain: "crm"               # Domain grouping (default: "default")
    responsePath: "$.data"      # JSONPath to data array in response

POST Endpoint with Request Body

endpoints:
  - path: "/search"
    method: "POST"
    as: "search_results"
    domain: "catalog"
    requestBody: '{"query": "active", "filters": {"status": "published"}}'
    responsePath: "$.hits"

Custom Headers and Query Parameters

endpoints:
  - path: "/reports"
    as: "report"
    domain: "analytics"
    headers:
      X-Custom-Header: "value"
    queryParams:
      format: "detailed"
      status: "active"

Field Exclusion

endpoints:
  - path: "/users"
    as: "user"
    domain: "iam"
    excludeFields:
      - "password_hash"
      - "internal_.*"           # Regex patterns supported

Parent-Child Endpoints

Use {parent.fieldName} placeholders to fetch related data for each parent record:

endpoints:
  - path: "/orders"
    as: "order"
    domain: "sales"
    responsePath: "$.data"
    children:
      - path: "/orders/{parent.id}/items"
        as: "order_item"
        domain: "sales"
        responsePath: "$.items"
      - path: "/orders/{parent.id}/payments"
        as: "order_payment"
        domain: "sales"

For each order returned by /orders, Starlake calls /orders/{id}/items and /orders/{id}/payments with the parent order's id field substituted.

Incremental Extraction

Track changes between runs using incrementalField:

endpoints:
  - path: "/customers"
    as: "customer"
    domain: "crm"
    incrementalField: "updated_at"  # Field to track

Run with the --incremental flag:

starlake extract-rest-data --config my-api --outputDir /tmp/api-data --incremental

How it works:

First run: Extracts all data. Saves the max value of updated_at to a state file at {outputDir}/.state/crm/customer.json.
Next run: Reads the last value from the state file and passes it as a query parameter (?updated_at=2024-01-15), so the API only returns newer records.

Rate Limiting and Retries

Rate Limiting

rateLimit:
  requestsPerSecond: 10         # Max requests per second

Retry Configuration

retry:
  maxRetries: 5                  # Default: 3
  initialBackoffMs: 2000         # Default: 1000 (doubles on each retry)
  maxBackoffMs: 60000            # Default: 30000

Starlake automatically retries on:

HTTP 429 (Too Many Requests) -- with exponential backoff
HTTP 5xx (Server errors) -- up to maxRetries with backoff
Connection failures -- up to maxRetries

Timeout Configuration

timeout:
  connectTimeoutMs: 15000        # Default: 30000 (30s)
  readTimeoutMs: 120000          # Default: 60000 (60s)

Proxy Support

For APIs behind corporate proxies:

proxy:
  host: "proxy.corp.example.com"
  port: 8080
  username: "{{PROXY_USER}}"     # Optional
  password: "{{PROXY_PASS}}"     # Optional

TLS / mTLS Configuration

For APIs requiring custom CA certificates or mutual TLS (client certificates):

tls:
  trustStorePath: "/path/to/truststore.jks"
  trustStorePassword: "{{TRUST_STORE_PASS}}"
  keyStorePath: "/path/to/keystore.jks"       # For mTLS client cert
  keyStorePassword: "{{KEY_STORE_PASS}}"

For development/testing only (not recommended for production):

tls:
  insecure: true                 # Trust all certificates

Response Validation

Some APIs return errors inside HTTP 200 responses. Use errorPath to detect these:

endpoints:
  - path: "/data"
    as: "data"
    domain: "api"
    errorPath: "$.error"          # If $.error is non-null, treat as error

Resume on Failure

If extraction fails mid-way (e.g., network error on page 50 of 100), resume from where it stopped:

starlake extract-rest-data --config my-api --outputDir /tmp/api-data --resume

Starlake tracks the number of pages extracted per endpoint in the state file. On --resume, it skips already-extracted pages and continues from the next one.

Output Formats

CSV (default)

starlake extract-rest-data --config my-api --outputDir /tmp/api-data

Nested JSON objects are flattened using dot notation (e.g., address.city).

JSON Lines

starlake extract-rest-data --config my-api --outputDir /tmp/api-data --outputFormat jsonl

Writes one JSON object per line, preserving the full nested structure. Better for complex/array data.

Defaults

Set default pagination, headers, and query params for all endpoints:

defaults:
  pagination:
    type: offset
    limitParam: "limit"
    offsetParam: "offset"
    pageSize: 100
  headers:
    X-API-Version: "2"
  queryParams:
    format: "json"

Individual endpoints can override any default.

XML Response Support

REST APIs returning Content-Type: application/xml are automatically parsed and converted to JSON for processing. XML elements become JSON fields, repeated elements become arrays, and attributes are prefixed with @.

Full Configuration Reference

Field	Location	Description
`baseUrl`	`restAPI`	Base URL of the API (required)
`auth`	`restAPI`	Authentication config
`auth.type`	`restAPI.auth`	`bearer`, `api_key`, `basic`, `oauth2_client_credentials`
`headers`	`restAPI`	Global HTTP headers
`rateLimit.requestsPerSecond`	`restAPI.rateLimit`	Max requests/second (default: 10)
`retry.maxRetries`	`restAPI.retry`	Max retry attempts (default: 3)
`retry.initialBackoffMs`	`restAPI.retry`	Initial backoff ms (default: 1000)
`retry.maxBackoffMs`	`restAPI.retry`	Max backoff ms (default: 30000)
`timeout.connectTimeoutMs`	`restAPI.timeout`	Connection timeout ms (default: 30000)
`timeout.readTimeoutMs`	`restAPI.timeout`	Read timeout ms (default: 60000)
`proxy.host`	`restAPI.proxy`	Proxy hostname
`proxy.port`	`restAPI.proxy`	Proxy port
`tls.trustStorePath`	`restAPI.tls`	Path to trust store (JKS)
`tls.keyStorePath`	`restAPI.tls`	Path to key store for mTLS
`tls.insecure`	`restAPI.tls`	Trust all certs (dev only)
`defaults.pagination`	`restAPI.defaults`	Default pagination for all endpoints
`defaults.headers`	`restAPI.defaults`	Default headers for all endpoints
`defaults.queryParams`	`restAPI.defaults`	Default query params for all endpoints
`endpoints[].path`	`restAPI.endpoints`	API endpoint path (required)
`endpoints[].method`	`restAPI.endpoints`	`GET` (default) or `POST`
`endpoints[].as`	`restAPI.endpoints`	Table name override
`endpoints[].domain`	`restAPI.endpoints`	Domain name (default: `default`)
`endpoints[].headers`	`restAPI.endpoints`	Endpoint-specific headers
`endpoints[].queryParams`	`restAPI.endpoints`	Endpoint-specific query params
`endpoints[].requestBody`	`restAPI.endpoints`	JSON body for POST requests
`endpoints[].pagination`	`restAPI.endpoints`	Endpoint-specific pagination
`endpoints[].responsePath`	`restAPI.endpoints`	JSONPath to data array (e.g. `$.data`)
`endpoints[].incrementalField`	`restAPI.endpoints`	Field for incremental tracking
`endpoints[].children`	`restAPI.endpoints`	Child endpoints with `{parent.field}` placeholders
`endpoints[].excludeFields`	`restAPI.endpoints`	Regex patterns to exclude fields
`endpoints[].errorPath`	`restAPI.endpoints`	JSONPath to error indicator in 200 responses

CLI Commands

Command	Description
`extract-rest-schema`	Infer table schemas from API sample responses
`extract-rest-data`	Extract data to CSV files

Frequently Asked Questions

What REST APIs does Starlake support?

Any REST API returning JSON or XML. You configure the base URL, authentication, pagination, and response structure in YAML.

What authentication methods are supported?

Bearer tokens, API keys (in custom headers), HTTP Basic, and OAuth2 client credentials with automatic token refresh.

What pagination strategies are available?

Offset (limit/offset), cursor (cursor from response body), Link header (RFC 5988), and page number.

Can I extract data incrementally?

Yes. Set incrementalField on the endpoint and run with --incremental. State is tracked in a JSON file between runs.

How are nested JSON objects handled?

Objects are flattened to dot notation in CSV (e.g., address.city). Arrays and deeply nested objects are serialized as JSON strings.

How do parent-child endpoints work?

Child endpoints use {parent.fieldName} in their path. For each parent record, the child endpoint is called with the field value substituted.

Quick Start​

1. Create the extraction config​

2. Extract schemas (optional)​

3. Extract data​

4. Load into warehouse​

Authentication​

Bearer Token​

API Key​

Basic Auth​

OAuth2 Client Credentials​

Pagination​

Offset Pagination​

Cursor Pagination​

Link Header Pagination​

Page Number Pagination​

Endpoint Configuration​

Basic Endpoint​

POST Endpoint with Request Body​

Custom Headers and Query Parameters​

Field Exclusion​

Parent-Child Endpoints​

Incremental Extraction​

Rate Limiting and Retries​

Rate Limiting​

Retry Configuration​

Timeout Configuration​

Proxy Support​

TLS / mTLS Configuration​

Response Validation​

Resume on Failure​

Output Formats​

CSV (default)​

JSON Lines​

Defaults​

XML Response Support​

Full Configuration Reference​

CLI Commands​

Frequently Asked Questions​

What REST APIs does Starlake support?​

What authentication methods are supported?​

What pagination strategies are available?​

Can I extract data incrementally?​

How are nested JSON objects handled?​

How do parent-child endpoints work?​