Skip to main content

Create a project

Select a project template

To create a new project, first create an empty folder and run the starlake bootstrap CLI command:

$ mkdir $HOME/userguide
$ cd $HOME/userguide
$ docker run -v `pwd`:/app/userguide -e SL_ROOT=/app/userguide -it starlakeai/starlake:VERSION bootstrap
note

By default, the project will be created in the current working directory. To bootstrap the project in a different folder, set the SL_ROOT environment variable:

$ mkdir $HOME/userguide
$ cd $HOME/userguide
$ docker run -v `pwd`:/app/userguide -e SL_ROOT=/app/userguide -it starlakeai/starlake:VERSION bootstrap

Project Structure

Starlake will create a default project hierarchy that enables you to start extracting, loading, transforming and orchestrating your data pipelines:

.
├── metadata
│ ├── application.sl.yml # project configuration
│ ├── env.sl.yml # variables used in the project with their default values
│ ├── env.BQ.sl.yml # variables overriden for a BigQuery connection
│ ├── env.DUCKDB.sl.yml # variables overriden for a DuckDB connection
│ ├── expectations
│ │ └── default.sl.yml # expectations macros
│ ├── extract
│ ├── load
│ ├── transform
│ ├── types
│ │ ├── default.sl.yml # types mapping
└── datasets # sample incoming data for this user guide
└── incoming
└── starbake
├── order_202403011414.json
├── order_line_202403011415.csv
└── product.xml

Key directories:

  • incoming: Contains files to be loaded into your warehouse
  • metadata: Contains extract, load and transform configuration files
  • expectations: Contains data validation rules for loaded/transformed data

Configure Your Data Warehouse Connection

The project configuration is stored in metadata/application.sl.yml. This file contains:

  • Project version
  • List of connections to different data sinks
  • Active connection reference
  • Environment-specific configuration overrides

Here's an example configuration:

metadata/application.sl.yml
application:
connectionRef: "{{activeConnection}}"

audit:
sink:
connectionRef: "{{activeConnection}}"

connections:
sparkLocal:
type: "fs" # Connection to local file system (delta files)
duckdb:
type: "jdbc" # Connection to DuckDB
options:
url: "jdbc:duckdb:{{SL_ROOT}}/datasets/duckdb.db" # Location of the DuckDB database
driver: "org.duckdb.DuckDBDriver"
bigquery:
type: "bigquery"
options:
location: europe-west1
authType: "APPLICATION_DEFAULT"
authScopes: "https://www.googleapis.com/auth/cloud-platform"
writeMethod: "direct"

The files env.DUCKDB.sl.yml and env.BQ.sl.yml override default values for DuckDB and BigQuery connections. Set the SL_ENV environment variable to switch between environments:

$ docker run -v `pwd`:/app/userguide \
-e SL_ROOT=/app/userguide \
-e SL_ENV=DUCKDB \
-it starlakeai/starlake:VERSION <command>

Next Steps

You're now ready to start working with Starlake! The next steps are:

  1. Load data into your warehouse
  2. Transform data for analysis
  3. Run transformations from CLI and Airflow
  4. Generate project documentation

We'll use the Starbake sample project to demonstrate these capabilities:

Starbake Architecture