Skip to main content

Ingestion & Loading

9 skills covering every aspect of data ingestion — from auto-inferring schemas to loading into BigQuery, Snowflake, Elasticsearch, and Kafka.

Skills

autoload

Auto-infer schemas and load data in a single step. Scans incoming files, infers column types, and creates load configurations automatically.

You: /autoload Infer schema from my CSV files in the incoming/customers/ directory

Key features:

  • Automatic type inference from file samples
  • Support for CSV, JSON, XML, and Parquet
  • Generates both domain and table YAML configurations

load

Load data from the pending area into your target warehouse. The primary ingestion skill with comprehensive write strategy support.

You: /load Configure loading JSON files into BigQuery with UPSERT_BY_KEY strategy

Write strategies:

StrategyDescription
APPENDAdd new records without deduplication
OVERWRITEReplace all existing data
UPSERT_BY_KEYUpdate by primary key, insert new
UPSERT_BY_KEY_AND_TIMESTAMPUpdate by key + timestamp for SCD
SCD2Slowly changing dimension type 2
DELETE_THEN_INSERTDelete matching records, then insert

Supported sinks: BigQuery, Snowflake, DuckDB, PostgreSQL, Redshift, Databricks, Elasticsearch, Kafka


cnxload

Load files directly into JDBC tables. Bypasses the standard staging pipeline for direct file-to-database loading.

You: /cnxload Load a CSV file directly into my PostgreSQL customers table

esload

Load data into Elasticsearch indices. Configure index mappings, document IDs, and bulk loading parameters.

You: /esload Configure loading product data into an Elasticsearch index

index

Alias for esload. The starlake index command is identical to starlake esload — all options and behavior are the same. See esload for full details.


ingest

Generic data ingestion skill. Covers the overall ingestion pipeline from landing to accepted/rejected areas.

You: /ingest Walk me through the full ingestion pipeline for XML files

kafkaload

Load data to/from Apache Kafka topics. Configure producers, consumers, serialization, and topic management.

You: /kafkaload Set up Kafka ingestion for real-time event data

Capabilities:

  • Topic-to-table loading
  • Table-to-topic offloading
  • Avro/JSON serialization
  • Consumer group management

preload

Check the landing area for incoming files before processing. Validates file presence, naming conventions, and readiness.

You: /preload Configure landing area checks for my daily CSV deliveries

stage

Move files from landing to pending area. Handles file organization, renaming, and preparation for the load step.

You: /stage Set up staging rules for files arriving in GCS

Common Configuration Example

A typical ingestion domain configuration:

# metadata/load/customers/_config.sl.yml
load:
metadata:
mode: FILE
format: DSV
withHeader: true
separator: ","
encoding: UTF-8
multiline: false
writeStrategy:
type: UPSERT_BY_KEY_AND_TIMESTAMP
key: [customer_id]
timestamp: updated_at
sink:
connectionRef: my-bigquery
partition:
- ingestion_date
# metadata/load/customers/customers.sl.yml
table:
name: customers
pattern: "customers-.*.csv"
attributes:
- name: customer_id
type: long
required: true
privacy: NONE
- name: email
type: string
required: true
privacy: SHA256
- name: name
type: string
- name: created_at
type: timestamp
- name: updated_at
type: timestamp