Create a New Starlake Project with Bootstrap

Q: What does starlake bootstrap do?

The starlake bootstrap command creates a new Starlake project with a default directory structure. It generates the metadata/ folder with configuration files, type mappings, and sample data in datasets/incoming/.

Q: What is the default Starlake project structure?

A bootstrapped project contains: metadata/application.sl.yml (project configuration), metadata/env.sl.yml (default variables), environment-specific overrides (env.BQ.sl.yml, env.DUCKDB.sl.yml), and directories for extract, load, transform configurations. Sample data is placed in datasets/incoming/.

Q: How do I switch between DuckDB and BigQuery in Starlake?

Set the SL_ENV environment variable. Use SL_ENV=DUCKDB starlake for DuckDB or SL_ENV=BQ starlake for BigQuery. Each environment has its own override file (env.DUCKDB.sl.yml, env.BQ.sl.yml).

Q: Can I bootstrap a Starlake project in a custom directory?

Yes. Set the SL_ROOT environment variable to the desired path: SL_ROOT=/my/other/location starlake bootstrap.

Q: How do I run Starlake bootstrap with Docker?

Create a directory, then run: docker run -v $(pwd):/app/userguide -e SL_ROOT=/app/userguide -it starlakeai/starlake:VERSION bootstrap.

The starlake bootstrap command creates a new Starlake project with a standard directory structure. It generates the metadata/ folder containing application.sl.yml (project configuration), environment override files, type mappings, and sample data in datasets/incoming/. The sample data includes JSON, CSV, and XML files so you can test loading different formats.

Run the Bootstrap Command

Create an empty directory and run starlake bootstrap. The command generates the full project structure in the current directory. By default, Starlake uses the current working directory. To use a different location, set the SL_ROOT environment variable.

Linux/MacOS
Windows
Docker

mkdir $HOME/userguide
cd $HOME/userguide
starlake bootstrap

To bootstrap in a custom directory:

SL_ROOT=/my/other/location starlake bootstrap

mkdir c:\userguide
cd c:\userguide
starlake bootstrap

mkdir $HOME/userguide
cd $HOME/userguide
docker run -v `pwd`:/app/userguide -e SL_ROOT=/app/userguide -it starlakeai/starlake:VERSION bootstrap

Default Starlake Project Directory Structure

The bootstrap command creates the following hierarchy:

.
├── metadata
│   ├── application.sl.yml      # project configuration
│   ├── env.sl.yml              # variables used in the project with their default values
│   ├── env.BQ.sl.yml           # variables overridden for a BigQuery connection
│   ├── env.DUCKDB.sl.yml       # variables overridden for a DuckDB connection
│   ├── expectations
│   │   └── default.sl.yml      # expectations macros
│   ├── extract
│   ├── load
│   ├── transform
│   ├── types
│   │   ├── default.sl.yml      # types mapping
└── datasets                    # sample incoming data for this user guide
    └── incoming
        └── starbake
            ├── order_202403011414.json
            ├── order_line_202403011415.csv
            └── product.xml

Key directories:

metadata/: Contains all configuration files for extract, load, and transform pipelines
metadata/expectations/: Contains data validation rules applied during load and transform
datasets/incoming/: Contains files to be loaded into your data warehouse. The sample data uses the Starbake project (a bakery management demo)

Configure Your Data Warehouse Connection in application.sl.yml

The metadata/application.sl.yml file is the main project configuration. It defines:

The list of database connections (DuckDB, BigQuery, Snowflake, etc.)
The active connection reference (connectionRef)
Audit sink configuration
Environment-specific overrides

Here is the default configuration:

metadata/application.sl.yml
application:
  connectionRef: "{{activeConnection}}"

  audit:
    sink:
      connectionRef: "{{activeConnection}}"

  connections:
    sparkLocal:
      type: "fs" # Connection to local file system (delta files)
    duckdb:
      type: "jdbc" # Connection to DuckDB
      options:
        url: "jdbc:duckdb:{{SL_ROOT}}/datasets/duckdb.db" # Location of the DuckDB database
        driver: "org.duckdb.DuckDBDriver"
    bigquery:
      type: "bigquery"
      options:
        location: europe-west1
        authType: "APPLICATION_DEFAULT"
        authScopes: "https://www.googleapis.com/auth/cloud-platform"
        writeMethod: "direct"

The connectionRef uses a variable ({{activeConnection}}) that is resolved from the environment files. Each environment file sets this variable to point to the appropriate connection.

Switch Between DuckDB and BigQuery Environments

The files env.DUCKDB.sl.yml and env.BQ.sl.yml override default variable values for their respective connections. Set the SL_ENV environment variable to select the active environment:

Linux/MacOS
Windows
Docker

SL_ENV=DUCKDB starlake <command>

SET SL_ENV=DUCKDB
starlake <command>

docker run -v `pwd`:/app/userguide \
             -e SL_ROOT=/app/userguide \
             -e SL_ENV=DUCKDB \
             -it starlakeai/starlake:VERSION <command>

Next Steps: Load, Transform, and Orchestrate Data

With the project created, follow these guides:

Load data into your warehouse
Transform data for analysis
Run transformations from CLI and Airflow
Generate project documentation

The tutorials use the Starbake sample project. Starbake is a bakery management demo that ships with the bootstrap command and demonstrates Starlake's load, transform, and orchestration features.

Starbake Architecture

Frequently Asked Questions

What does `starlake bootstrap` do?

The starlake bootstrap command creates a new Starlake project with a default directory structure. It generates the metadata/ folder with configuration files, type mappings, and sample data in datasets/incoming/.

What is the default Starlake project structure?

A bootstrapped project contains: metadata/application.sl.yml (project configuration), metadata/env.sl.yml (default variables), environment-specific overrides (env.BQ.sl.yml, env.DUCKDB.sl.yml), and directories for extract, load, transform configurations. Sample data is placed in datasets/incoming/.

How do I switch between DuckDB and BigQuery in Starlake?

Set the SL_ENV environment variable. Use SL_ENV=DUCKDB starlake <command> for DuckDB or SL_ENV=BQ starlake <command> for BigQuery. Each environment has its own override file (env.DUCKDB.sl.yml, env.BQ.sl.yml).

What is the `application.sl.yml` file?

It is the main project configuration file in Starlake. It defines the list of database connections, the active connection reference, audit configuration, and environment-specific overrides.

Can I bootstrap a Starlake project in a custom directory?

Yes. Set the SL_ROOT environment variable to the desired path: SL_ROOT=/my/other/location starlake bootstrap.

How do I run Starlake bootstrap with Docker?

Create a directory, then run: docker run -v $(pwd):/app/userguide -e SL_ROOT=/app/userguide -it starlakeai/starlake:VERSION bootstrap.

Run the Bootstrap Command​

Default Starlake Project Directory Structure​

Configure Your Data Warehouse Connection in application.sl.yml​

Switch Between DuckDB and BigQuery Environments​

Next Steps: Load, Transform, and Orchestrate Data​

Frequently Asked Questions​

What does starlake bootstrap do?​

What is the default Starlake project structure?​

How do I switch between DuckDB and BigQuery in Starlake?​

What is the application.sl.yml file?​

Can I bootstrap a Starlake project in a custom directory?​

How do I run Starlake bootstrap with Docker?​