Connections
Connections are defined in the connections
section under the root attribute application
.
The following types of connections are supported:
Local File System
The local file system connection is used to read and write files to the local file system.
application:
connections:
local:
type: local
Files will be stored in the area
directory under the datasets
directory.
Défault values for area
and datasets
can be set in the application
section.
application:
datasets = "{{root}}/datasets" # or set it through the SL_DATASETS environnement variable.
area:
pending: "pending" # Location where files of pending load are stored. May be overloaded by the ${SL_AREA_PENDING} environment variable.
unresolved: "unresolved" # Location where files that do not match any pattern are moved. May be overloaded by the ${SL_AREA_UNRESOLVED} environment variable.
archive: "archive" # Location where files are moved after they have been processed. May be overloaded by the ${SL_AREA_ARCHIVE} environment variable.
ingesting: "ingesting" # Location where files are moved while they are being processed. May be overloaded by the ${SL_AREA_INGESTING} environment variable.
accepted: "accepted" # Location where files are moved after they have been processed and accepted. May be overloaded by the ${SL_AREA_ACCEPTED} environment variable.
rejected: "rejected" # Location where files are moved after they have been processed and rejected. May be overloaded by the ${SL_AREA_REJECTED} environment variable.
business: "business" # Location where transform tasks store their result. May be overloaded by the ${SL_AREA_BUSINESS} environment variable.
replay: "replay" # Location rejected records are stored in their orginial format. May be overloaded by the ${SL_AREA_REPLAY} environment variable.
hiveDatabase: "${domain}_${area}" # Hive database name. May be overloaded by the ${SL_AREA_HIVE_DATABASE} environment variable.
Google BigQuery
Starlake support native and spark bigquery connections.
application:
connections:
bigquery:
type: "bigquery"
# Uncomment the line below to use the spark bigquery connector instead of the native one.
# sparkFormat: "bigquery"
options:
location: "us-central1" # EU or US or ...
authType: "APPLICATION_DEFAULT"
authScopes: "https://www.googleapis.com/auth/cloud-platform" # comma separated list of scopes
# writeMethod: # Only when 'sparkFormat' above is set. "direct" # direct or indirect (indirect is required for certain features see https://github.com/GoogleCloudDataproc/spark-bigquery-connector)
# temporaryGcsBucket: # Only when 'sparkFormat' above is set. "bucket_name" without the prefix "gcs://" bucket name for temporary files
#authType: SERVICE_ACCOUNT_JSON_KEYFILE
#jsonKeyfile: "/Users/me/.gcloud/keys/starlake-me.json"
#authType: "ACCESS_TOKEN"
#gcpAccessToken: "XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX"
accessPolicies: # Required when applying access policies to table columns (Column Level Security)
apply: true
location: EU
taxonomy: RGPD
Snowflake
User / Password Auth
You may use the native Snowflake JDBC driver or the Spark Snowflake connector. The native driver is recommended for most use cases. The Spark Snowflake connector is only required when exporting data from Snowflake to Excel or CSV / Parquet / ... files or when loading data from to Snowflake using the Starlake Spark embedded library .
application:
connections:
snowflake:
type: jdbc
# Uncomment the line below to use the Snowflake Spark embedded library instead of the Snowflake JDBC connector.
# sparkFormat: snowflake
options:
url: "jdbc:snowflake://{{SNOWFLAKE_ACCOUNT}}.snowflakecomputing.com"
driver: "net.snowflake.client.jdbc.SnowflakeDriver"
user: {{SNOWFLAKE_USER}}
password: {{SNOWFLAKE_PASSWORD}}
warehouse: {{SNOWFLAKE_WAREHOUSE}}
db: {{SNOWFLAKE_DB}}
keep_column_case: "off"
preActions: "alter session set TIMESTAMP_TYPE_MAPPING = 'TIMESTAMP_LTZ';ALTER SESSION SET QUOTED_IDENTIFIERS_IGNORE_CASE = true"
sfUrl: "{{SNOWFLAKE_ACCOUNT}}.snowflakecomputing.com" # make sure you do not prefix by jdbc:snowflake://. This is done by the snowflaek driver
#sfUrl: "jdbc:snowflake://{{SNOWFLAKE_ACCOUNT}}.snowflakecomputing.com" # make sure the key is prefixed with 'sf' is 'sparkFormat' is set
# sfUser: {{SNOWFLAKE_USER}} # make sure the key is prefixed with 'sf' is 'sparkFormat' is set
# sfPassword: {{SNOWFLAKE_PASSWORD}} # make sure the key is prefixed with 'sf' is 'sparkFormat' is set
# sfWarehouse: {{SNOWFLAKE_WAREHOUSE}} # make sure the key is prefixed with 'sf' is 'sparkFormat' is set
# sfDatabase: {{SNOWFLAKE_DB}} # make sure the key is prefixed with 'sf' is 'sparkFormat' is set
Single Sign On (OAuth)
Single Sign-On with Snowflake OAuth
Single Sign-On (SSO) is supported via the oauth authentication type, which is only compatible with the native Snowflake JDBC driver.
You can continue using the same configuration as before. Once the user is authenticated through Snowflake, Starlake will automatically use the oauth authentication type to connect. The user and password fields will be ignored and replaced by the user token generated by Snowflake.
Configuring Snowflake OAuth
To use Snowflake OAuth, you need to configure the following settings in your Snowflake account:
- Log in to Snowflake using the ACCOUNTADMIN role or another role with the global CREATE INTEGRATION privilege.
- Execute the following SQL command to create a new OAuth integration:
-- Create a new OAuth integration
CREATE OR REPLACE SECURITY INTEGRATION STARLAKE
TYPE = OAUTH
ENABLED = TRUE
OAUTH_CLIENT = CUSTOM
OAUTH_CLIENT_TYPE = 'CONFIDENTIAL'
OAUTH_REDIRECT_URI = '<REDIRECT_URI>'
OAUTH_ISSUE_REFRESH_TOKENS = TRUE
OAUTH_REFRESH_TOKEN_VALIDITY = 7776000 -- Valid for 90 days
OAUTH_USE_SECONDARY_ROLES = 'IMPLICIT';
OAuth Redirect URI
REDIRECT_URI refers to the URL that Snowflake will redirect to after authentication:
-
If you’re running Starlake on your local machine, use:
http://localhost:8080/api/v1/auth/snowflake/callback
-
If you’re running Starlake as a native app on Snowflake, the redirect URI will be:
https://<account>.snowflakecomputing.com/api/v1/auth/snowflake/callback
Token Validity
The access token provided by Snowflake is valid for 10 minutes. However, the refresh token—valid for 90 days—will automatically renew the access token as needed. After 90 days, the user must reauthenticate.
Retrieving Integration Credentials
To extract your account name, client ID, and client secret from the integration you just created, run the following SQL query:
⸻
WITH SECURITY_INTEGRATION as (
SELECT PARSE_JSON(SYSTEM$SHOW_OAUTH_CLIENT_SECRETS('STARLAKE')) as OAUTH_CLIENT_SECRETS
)
select
CURRENT_ORGANIZATION_NAME() || '-' || CURRENT_ACCOUNT_NAME() as account,
OAUTH_CLIENT_SECRETS:"OAUTH_CLIENT_ID"::string as client_id,
OAUTH_CLIENT_SECRETS:"OAUTH_CLIENT_SECRET"::string as client_secret
from
SECURITY_INTEGRATION;
In the Starlake UI, log in as an admin and navigate to the Admin page.
Then, click on the Snowflake SSO
tab. Enter the Account
, Client ID
, and Client Secret
fields using the values you extracted from Snowflake.
Apache Spark / Databricks
Spark connections are used to read and write data from Spark.
- Spark Parquet
- Spark Delta
- Spark Iceberg
application:
defaultWriteFormat: parquet
connections:
spark:
type: "spark"
options:
# any spark configuration can be set here
application:
defaultWriteFormat: delta
connections:
spark:
type: "spark"
options:
# any spark configuration can be set here
spark:
sql:
extensions: "io.delta.sql.DeltaSparkSessionExtension"
catalog:
spark_catalog: "org.apache.spark.sql.delta.catalog.DeltaCatalog"
application:
defaultWriteFormat: iceberg
connections:
spark:
type: "spark"
options:
# any spark configuration can be set here
spark:
sql.extensions: "org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions"
sql.catalog.spark_catalog: org.apache.iceberg.spark.SparkSessionCatalog
sql.catalog.spark_catalog.type: hadoop
sql.catalog.local: org.apache.iceberg.spark.SparkCatalog
sql.catalog.local.type: hadoop
sql.catalog.spark_catalog.warehouse: "{{SL_ROOT}}/warehouse"
sql.catalog.local.warehouse: "{{SL_ROOT}}/warehouse"
sql.defaultCatalog: local
Amazon Redshift
- Redshift JDBC
application:
connections:
redshift:
type: jdbc
# Uncomment the line below if you are running on top of Spark or Databricks
# sparkFormat: "io.github.spark_redshift_community.spark.redshift" # set to "redshift" if running on top of Databricks
options:
url: "jdbc:redshift://account.region.redshift.amazonaws.com:5439/database",
driver: com.amazon.redshift.Driver
password: "{{REDSHIFT_PASSWORD}}"
tempdir: "s3a://bucketName/data",
tempdir_region: "eu-central-1" # required only if running from outside AWS (your laptop ...)
aws_iam_role: "arn:aws:iam::aws_count_id:role/role_name"
DuckDB
DuckDB is a lightweight, in-process SQL OLAP database management system. It is designed to support analytical queries on large datasets.
application:
connections:
duckdb:
type: jdbc
options:
url: "jdbc:duckdb:{{DUCKDB_PATH}}"
driver: "org.duckdb.DuckDBDriver"
user: "{{DATABASE_USER}}"
password: "{{DATABASE_PASSWORD}}"
## uncomment and customize the line below if using ducklake
# preActions: ""ATTACH IF NOT EXISTS 'ducklake:metadata.ducklake' As my_ducklake (DATA_PATH 'file_path/');USE my_ducklake;"
Any Database
The syntax below applies to any JDBC database. The example below is for Postgres.
application:
connectionRef: "postgresql"
connections:
postgresql:
type: jdbc
# Uncomment the line below to use the spark postgres connector instead of the native one.
# sparkFormat: jdbc
options:
url: "jdbc:postgresql://{{POSTGRES_HOST}}:{{POSTGRES_PORT}}/{{POSTGRES_DATABASE}}"
driver: "org.postgresql.Driver"
user: "{{DATABASE_USER}}"
password: "{{DATABASE_PASSWORD}}"
quoteIdentifiers: false