Skip to main content

Create a New Starlake Project with Bootstrap

The starlake bootstrap command creates a new Starlake project with a standard directory structure. It generates the metadata/ folder containing application.sl.yml (project configuration), environment override files, type mappings, and sample data in datasets/incoming/. The sample data includes JSON, CSV, and XML files so you can test loading different formats.

Run the Bootstrap Command

Create an empty directory and run starlake bootstrap. The command generates the full project structure in the current directory. By default, Starlake uses the current working directory. To use a different location, set the SL_ROOT environment variable.

mkdir $HOME/userguide
cd $HOME/userguide
starlake bootstrap

To bootstrap in a custom directory:

SL_ROOT=/my/other/location starlake bootstrap

Default Starlake Project Directory Structure

The bootstrap command creates the following hierarchy:

.
├── metadata
│ ├── application.sl.yml # project configuration
│ ├── env.sl.yml # variables used in the project with their default values
│ ├── env.BQ.sl.yml # variables overridden for a BigQuery connection
│ ├── env.DUCKDB.sl.yml # variables overridden for a DuckDB connection
│ ├── expectations
│ │ └── default.sl.yml # expectations macros
│ ├── extract
│ ├── load
│ ├── transform
│ ├── types
│ │ ├── default.sl.yml # types mapping
└── datasets # sample incoming data for this user guide
└── incoming
└── starbake
├── order_202403011414.json
├── order_line_202403011415.csv
└── product.xml

Key directories:

  • metadata/: Contains all configuration files for extract, load, and transform pipelines
  • metadata/expectations/: Contains data validation rules applied during load and transform
  • datasets/incoming/: Contains files to be loaded into your data warehouse. The sample data uses the Starbake project (a bakery management demo)

Configure Your Data Warehouse Connection in application.sl.yml

The metadata/application.sl.yml file is the main project configuration. It defines:

  • The list of database connections (DuckDB, BigQuery, Snowflake, etc.)
  • The active connection reference (connectionRef)
  • Audit sink configuration
  • Environment-specific overrides

Here is the default configuration:

metadata/application.sl.yml
application:
connectionRef: "{{activeConnection}}"

audit:
sink:
connectionRef: "{{activeConnection}}"

connections:
sparkLocal:
type: "fs" # Connection to local file system (delta files)
duckdb:
type: "jdbc" # Connection to DuckDB
options:
url: "jdbc:duckdb:{{SL_ROOT}}/datasets/duckdb.db" # Location of the DuckDB database
driver: "org.duckdb.DuckDBDriver"
bigquery:
type: "bigquery"
options:
location: europe-west1
authType: "APPLICATION_DEFAULT"
authScopes: "https://www.googleapis.com/auth/cloud-platform"
writeMethod: "direct"

The connectionRef uses a variable ({{activeConnection}}) that is resolved from the environment files. Each environment file sets this variable to point to the appropriate connection.

Switch Between DuckDB and BigQuery Environments

The files env.DUCKDB.sl.yml and env.BQ.sl.yml override default variable values for their respective connections. Set the SL_ENV environment variable to select the active environment:

SL_ENV=DUCKDB starlake <command>

Next Steps: Load, Transform, and Orchestrate Data

With the project created, follow these guides:

  1. Load data into your warehouse
  2. Transform data for analysis
  3. Run transformations from CLI and Airflow
  4. Generate project documentation

The tutorials use the Starbake sample project. Starbake is a bakery management demo that ships with the bootstrap command and demonstrates Starlake's load, transform, and orchestration features.

Starbake Architecture

Frequently Asked Questions

What does starlake bootstrap do?

The starlake bootstrap command creates a new Starlake project with a default directory structure. It generates the metadata/ folder with configuration files, type mappings, and sample data in datasets/incoming/.

What is the default Starlake project structure?

A bootstrapped project contains: metadata/application.sl.yml (project configuration), metadata/env.sl.yml (default variables), environment-specific overrides (env.BQ.sl.yml, env.DUCKDB.sl.yml), and directories for extract, load, transform configurations. Sample data is placed in datasets/incoming/.

How do I switch between DuckDB and BigQuery in Starlake?

Set the SL_ENV environment variable. Use SL_ENV=DUCKDB starlake <command> for DuckDB or SL_ENV=BQ starlake <command> for BigQuery. Each environment has its own override file (env.DUCKDB.sl.yml, env.BQ.sl.yml).

What is the application.sl.yml file?

It is the main project configuration file in Starlake. It defines the list of database connections, the active connection reference, audit configuration, and environment-specific overrides.

Can I bootstrap a Starlake project in a custom directory?

Yes. Set the SL_ROOT environment variable to the desired path: SL_ROOT=/my/other/location starlake bootstrap.

How do I run Starlake bootstrap with Docker?

Create a directory, then run: docker run -v $(pwd):/app/userguide -e SL_ROOT=/app/userguide -it starlakeai/starlake:VERSION bootstrap.