Skip to main content

Connections

Connections are defined in the connections section under the root attribute application.

The following types of connections are supported:

Local File System

The local file system connection is used to read and write files to the local file system.

application:
connections:
local:
type: local

Files will be stored in the area directory under the datasets directory. Défault values for area and datasets can be set in the application section.

application:
datasets = "{{root}}/datasets" # or set it through the SL_DATASETS environnement variable.
area:
pending: "pending" # Location where files of pending load are stored. May be overloaded by the ${SL_AREA_PENDING} environment variable.
unresolved: "unresolved" # Location where files that do not match any pattern are moved. May be overloaded by the ${SL_AREA_UNRESOLVED} environment variable.
archive: "archive" # Location where files are moved after they have been processed. May be overloaded by the ${SL_AREA_ARCHIVE} environment variable.
ingesting: "ingesting" # Location where files are moved while they are being processed. May be overloaded by the ${SL_AREA_INGESTING} environment variable.
accepted: "accepted" # Location where files are moved after they have been processed and accepted. May be overloaded by the ${SL_AREA_ACCEPTED} environment variable.
rejected: "rejected" # Location where files are moved after they have been processed and rejected. May be overloaded by the ${SL_AREA_REJECTED} environment variable.
business: "business" # Location where transform tasks store their result. May be overloaded by the ${SL_AREA_BUSINESS} environment variable.
replay: "replay" # Location rejected records are stored in their orginial format. May be overloaded by the ${SL_AREA_REPLAY} environment variable.
hiveDatabase: "${domain}_${area}" # Hive database name. May be overloaded by the ${SL_AREA_HIVE_DATABASE} environment variable.

Google BigQuery

Starlake support native and spark bigquery connections.

application:
connections:
bigquery:
type: "bigquery"
# Uncomment the line below to use the spark bigquery connector instead of the native one.
# sparkFormat: "bigquery"
options:
location: "us-central1" # EU or US or ...
authType: "APPLICATION_DEFAULT"
authScopes: "https://www.googleapis.com/auth/cloud-platform" # comma separated list of scopes
# writeMethod: # Only when 'sparkFormat' above is set. "direct" # direct or indirect (indirect is required for certain features see https://github.com/GoogleCloudDataproc/spark-bigquery-connector)
# temporaryGcsBucket: # Only when 'sparkFormat' above is set. "bucket_name" without the prefix "gcs://" bucket name for temporary files
#authType: SERVICE_ACCOUNT_JSON_KEYFILE
#jsonKeyfile: "/Users/me/.gcloud/keys/starlake-me.json"
#authType: "ACCESS_TOKEN"
#gcpAccessToken: "XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX"
accessPolicies: # Required when applying access policies to table columns (Column Level Security)
apply: true
location: EU
taxonomy: RGPD

Snowflake

User / Password Auth

You may use the native Snowflake JDBC driver or the Spark Snowflake connector. The native driver is recommended for most use cases. The Spark Snowflake connector is only required when exporting data from Snowflake to Excel or CSV / Parquet / ... files or when loading data from to Snowflake using the Starlake Spark embedded library .


application:
connections:
snowflake:
type: jdbc
# Uncomment the line below to use the Snowflake Spark embedded library instead of the Snowflake JDBC connector.
# sparkFormat: snowflake
options:
url: "jdbc:snowflake://{{SNOWFLAKE_ACCOUNT}}.snowflakecomputing.com"
driver: "net.snowflake.client.jdbc.SnowflakeDriver"
user: {{SNOWFLAKE_USER}}
password: {{SNOWFLAKE_PASSWORD}}
warehouse: {{SNOWFLAKE_WAREHOUSE}}
db: {{SNOWFLAKE_DB}}
keep_column_case: "off"
preActions: "alter session set TIMESTAMP_TYPE_MAPPING = 'TIMESTAMP_LTZ';ALTER SESSION SET QUOTED_IDENTIFIERS_IGNORE_CASE = true"
sfUrl: "{{SNOWFLAKE_ACCOUNT}}.snowflakecomputing.com" # make sure you do not prefix by jdbc:snowflake://. This is done by the snowflaek driver
#sfUrl: "jdbc:snowflake://{{SNOWFLAKE_ACCOUNT}}.snowflakecomputing.com" # make sure the key is prefixed with 'sf' is 'sparkFormat' is set
# sfUser: {{SNOWFLAKE_USER}} # make sure the key is prefixed with 'sf' is 'sparkFormat' is set
# sfPassword: {{SNOWFLAKE_PASSWORD}} # make sure the key is prefixed with 'sf' is 'sparkFormat' is set
# sfWarehouse: {{SNOWFLAKE_WAREHOUSE}} # make sure the key is prefixed with 'sf' is 'sparkFormat' is set
# sfDatabase: {{SNOWFLAKE_DB}} # make sure the key is prefixed with 'sf' is 'sparkFormat' is set


Single Sign On (OAuth)

Single Sign-On with Snowflake OAuth

Single Sign-On (SSO) is supported via the oauth authentication type, which is only compatible with the native Snowflake JDBC driver.

You can continue using the same configuration as before. Once the user is authenticated through Snowflake, Starlake will automatically use the oauth authentication type to connect. The user and password fields will be ignored and replaced by the user token generated by Snowflake.

Configuring Snowflake OAuth

To use Snowflake OAuth, you need to configure the following settings in your Snowflake account:

  1. Log in to Snowflake using the ACCOUNTADMIN role or another role with the global CREATE INTEGRATION privilege.
  2. Execute the following SQL command to create a new OAuth integration:
-- Create a new OAuth integration
CREATE OR REPLACE SECURITY INTEGRATION STARLAKE
TYPE = OAUTH
ENABLED = TRUE
OAUTH_CLIENT = CUSTOM
OAUTH_CLIENT_TYPE = 'CONFIDENTIAL'
OAUTH_REDIRECT_URI = '<REDIRECT_URI>'
OAUTH_ISSUE_REFRESH_TOKENS = TRUE
OAUTH_REFRESH_TOKEN_VALIDITY = 7776000 -- Valid for 90 days
OAUTH_USE_SECONDARY_ROLES = 'IMPLICIT';

OAuth Redirect URI

REDIRECT_URI refers to the URL that Snowflake will redirect to after authentication:

  • If you’re running Starlake on your local machine, use: http://localhost:8080/api/v1/auth/snowflake/callback

  • If you’re running Starlake as a native app on Snowflake, the redirect URI will be: https://<account>.snowflakecomputing.com/api/v1/auth/snowflake/callback

Token Validity

The access token provided by Snowflake is valid for 10 minutes. However, the refresh token—valid for 90 days—will automatically renew the access token as needed. After 90 days, the user must reauthenticate.

Retrieving Integration Credentials

To extract your account name, client ID, and client secret from the integration you just created, run the following SQL query:

WITH SECURITY_INTEGRATION as (
SELECT PARSE_JSON(SYSTEM$SHOW_OAUTH_CLIENT_SECRETS('STARLAKE')) as OAUTH_CLIENT_SECRETS
)
select
CURRENT_ORGANIZATION_NAME() || '-' || CURRENT_ACCOUNT_NAME() as account,
OAUTH_CLIENT_SECRETS:"OAUTH_CLIENT_ID"::string as client_id,
OAUTH_CLIENT_SECRETS:"OAUTH_CLIENT_SECRET"::string as client_secret
from
SECURITY_INTEGRATION;

In the Starlake UI, log in as an admin and navigate to the Admin page. Then, click on the Snowflake SSO tab. Enter the Account, Client ID, and Client Secret fields using the values you extracted from Snowflake.

Apache Spark / Databricks

Spark connections are used to read and write data from Spark.

application:
defaultWriteFormat: parquet
connections:
spark:
type: "spark"
options:
# any spark configuration can be set here

Amazon Redshift


application:
connections:
redshift:
type: jdbc
# Uncomment the line below if you are running on top of Spark or Databricks
# sparkFormat: "io.github.spark_redshift_community.spark.redshift" # set to "redshift" if running on top of Databricks
options:
url: "jdbc:redshift://account.region.redshift.amazonaws.com:5439/database",
driver: com.amazon.redshift.Driver
password: "{{REDSHIFT_PASSWORD}}"
tempdir: "s3a://bucketName/data",
tempdir_region: "eu-central-1" # required only if running from outside AWS (your laptop ...)
aws_iam_role: "arn:aws:iam::aws_count_id:role/role_name"

DuckDB

DuckDB is a lightweight, in-process SQL OLAP database management system. It is designed to support analytical queries on large datasets.

application:
connections:
duckdb:
type: jdbc
options:
url: "jdbc:duckdb:{{DUCKDB_PATH}}"
driver: "org.duckdb.DuckDBDriver"
user: "{{DATABASE_USER}}"
password: "{{DATABASE_PASSWORD}}"
## uncomment and customize the line below if using ducklake
# preActions: ""ATTACH IF NOT EXISTS 'ducklake:metadata.ducklake' As my_ducklake (DATA_PATH 'file_path/');USE my_ducklake;"

Any Database

The syntax below applies to any JDBC database. The example below is for Postgres.

application:
connectionRef: "postgresql"
connections:
postgresql:
type: jdbc
# Uncomment the line below to use the spark postgres connector instead of the native one.
# sparkFormat: jdbc
options:
url: "jdbc:postgresql://{{POSTGRES_HOST}}:{{POSTGRES_PORT}}/{{POSTGRES_DATABASE}}"
driver: "org.postgresql.Driver"
user: "{{DATABASE_USER}}"
password: "{{DATABASE_PASSWORD}}"
quoteIdentifiers: false