Create a project
Select a project template
To create a new project, first create an empty folder and run the starlake bootstrap CLI command:
- Docker
- Linux/MacOS
- Windows
$ mkdir $HOME/userguide
$ cd $HOME/userguide
$ docker run -v `pwd`:/app/userguide -e SL_ROOT=/app/userguide -it starlakeai/starlake:VERSION bootstrap
$ mkdir $HOME/userguide
$ cd $HOME/userguide
$ starlake bootstrap
c:\> mkdir c:\userguide
c:\> cd c:\userguide
c:\> starlake bootstrap
By default, the project will be created in the current working directory. To bootstrap the project in a different folder, set the SL_ROOT environment variable:
- Docker
- Linux/MacOS
- Windows
$ mkdir $HOME/userguide
$ cd $HOME/userguide
$ docker run -v `pwd`:/app/userguide -e SL_ROOT=/app/userguide -it starlakeai/starlake:VERSION bootstrap
$ SL_ROOT=/my/other/location starlake bootstrap
c:\> mkdir c:\my\other\location
c:\> starlake bootstrap
Project Structure
Starlake will create a default project hierarchy that enables you to start extracting, loading, transforming and orchestrating your data pipelines:
.
├── metadata
│ ├── application.sl.yml # project configuration
│ ├── env.sl.yml # variables used in the project with their default values
│ ├── env.BQ.sl.yml # variables overriden for a BigQuery connection
│ ├── env.DUCKDB.sl.yml # variables overriden for a DuckDB connection
│ ├── expectations
│ │ └── default.sl.yml # expectations macros
│ ├── extract
│ ├── load
│ ├── transform
│ ├── types
│ │ ├── default.sl.yml # types mapping
└── datasets # sample incoming data for this user guide
└── incoming
└── starbake
├── order_202403011414.json
├── order_line_202403011415.csv
└── product.xml
Key directories:
incoming
: Contains files to be loaded into your warehousemetadata
: Contains extract, load and transform configuration filesexpectations
: Contains data validation rules for loaded/transformed data
Configure Your Data Warehouse Connection
The project configuration is stored in metadata/application.sl.yml
. This file contains:
- Project version
- List of connections to different data sinks
- Active connection reference
- Environment-specific configuration overrides
Here's an example configuration:
application:
connectionRef: "{{activeConnection}}"
audit:
sink:
connectionRef: "{{activeConnection}}"
connections:
sparkLocal:
type: "fs" # Connection to local file system (delta files)
duckdb:
type: "jdbc" # Connection to DuckDB
options:
url: "jdbc:duckdb:{{SL_ROOT}}/datasets/duckdb.db" # Location of the DuckDB database
driver: "org.duckdb.DuckDBDriver"
bigquery:
type: "bigquery"
options:
location: europe-west1
authType: "APPLICATION_DEFAULT"
authScopes: "https://www.googleapis.com/auth/cloud-platform"
writeMethod: "direct"
The files env.DUCKDB.sl.yml
and env.BQ.sl.yml
override default values for DuckDB and BigQuery connections. Set the SL_ENV
environment variable to switch between environments:
- Docker
- Linux/MacOS
- Windows
$ docker run -v `pwd`:/app/userguide \
-e SL_ROOT=/app/userguide \
-e SL_ENV=DUCKDB \
-it starlakeai/starlake:VERSION <command>
$ SL_ENV=DUCKDB starlake <command>
c:> SET SL_ENV=DUCKDB
c:> starlake <command>
Next Steps
You're now ready to start working with Starlake! The next steps are:
- Load data into your warehouse
- Transform data for analysis
- Run transformations from CLI and Airflow
- Generate project documentation
We'll use the Starbake sample project to demonstrate these capabilities: