Connections

Connections are defined in the connections section under the root attribute application.

The following types of connections are supported:

Local File System
BigQuery
Spark / Databricks
Snowflake
Redshift
DuckdDB / DuckLake
Other Database

Local File System

The local file system connection is used to read and write files to the local file system.

application:
    connections:
    local:
        type: local

Files will be stored in the area directory under the datasets directory. Défault values for area and datasets can be set in the application section.

application:
    datasets = "{{root}}/datasets" # or set it through the SL_DATASETS environnement variable.
    area:
        pending: "pending" # Location where files of pending load are stored. May be overloaded by the ${SL_AREA_PENDING} environment variable.
        unresolved: "unresolved" # Location where files that do not match any pattern are moved. May be overloaded by the ${SL_AREA_UNRESOLVED} environment variable.
        archive: "archive" # Location where files are moved after they have been processed. May be overloaded by the ${SL_AREA_ARCHIVE} environment variable.
        ingesting: "ingesting" # Location where files are moved while they are being processed. May be overloaded by the ${SL_AREA_INGESTING} environment variable.
        accepted: "accepted" # Location where files are moved after they have been processed and accepted. May be overloaded by the ${SL_AREA_ACCEPTED} environment variable.
        rejected: "rejected" # Location where files are moved after they have been processed and rejected. May be overloaded by the ${SL_AREA_REJECTED} environment variable.
        business: "business" # Location where transform tasks store their result. May be overloaded by the ${SL_AREA_BUSINESS} environment variable.
        replay: "replay" # Location rejected records are stored in their orginial format. May be overloaded by the ${SL_AREA_REPLAY} environment variable.
        hiveDatabase: "${domain}_${area}" # Hive database name. May be overloaded by the ${SL_AREA_HIVE_DATABASE} environment variable.

Google BigQuery

Starlake support native and spark bigquery connections.

application:
  connections:
    bigquery:
      type: "bigquery"
      # Uncomment the line below to use the spark bigquery connector instead of the native one.
      # sparkFormat: "bigquery" 
      options:
        location: "us-central1" # EU or US or ...
        authType: "APPLICATION_DEFAULT"
        authScopes: "https://www.googleapis.com/auth/cloud-platform" # comma separated list of scopes
        # writeMethod: # Only when 'sparkFormat' above is set. "direct" # direct or indirect (indirect is required for certain features see https://github.com/GoogleCloudDataproc/spark-bigquery-connector)
        # temporaryGcsBucket: # Only when 'sparkFormat' above is set. "bucket_name" without the prefix "gcs://"  bucket name for temporary files
        #authType: SERVICE_ACCOUNT_JSON_KEYFILE
        #jsonKeyfile: "/Users/me/.gcloud/keys/starlake-me.json"
        #authType: "ACCESS_TOKEN"
        #gcpAccessToken: "XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX"
  accessPolicies: # Required when applying access policies to table columns (Column Level Security)
    apply: true
    location: EU
    taxonomy: RGPD

Snowflake

User / Password Auth

You may use the native Snowflake JDBC driver or the Spark Snowflake connector. The native driver is recommended for most use cases. The Spark Snowflake connector is only required when exporting data from Snowflake to Excel or CSV / Parquet / ... files or when loading data from to Snowflake using the Starlake Spark embedded library .

application:
  connections:
    snowflake:
      type: jdbc
      # Uncomment the line below to use the Snowflake Spark embedded library instead of the Snowflake JDBC connector.
      # sparkFormat: snowflake 
      options:
        url: "jdbc:snowflake://{{SNOWFLAKE_ACCOUNT}}.snowflakecomputing.com"
        driver: "net.snowflake.client.jdbc.SnowflakeDriver"
        user: {{SNOWFLAKE_USER}}
        password: {{SNOWFLAKE_PASSWORD}}
        warehouse: {{SNOWFLAKE_WAREHOUSE}}
        db: {{SNOWFLAKE_DB}}
        keep_column_case: "off"
        preActions: "alter session set TIMESTAMP_TYPE_MAPPING = 'TIMESTAMP_LTZ';ALTER SESSION SET QUOTED_IDENTIFIERS_IGNORE_CASE = true"
                sfUrl: "{{SNOWFLAKE_ACCOUNT}}.snowflakecomputing.com" # make sure you do not prefix by jdbc:snowflake://. This is done by the snowflaek driver
        #sfUrl: "jdbc:snowflake://{{SNOWFLAKE_ACCOUNT}}.snowflakecomputing.com" # make sure the key is prefixed with 'sf' is 'sparkFormat' is set
        # sfUser: {{SNOWFLAKE_USER}} # make sure the key is prefixed with 'sf' is 'sparkFormat' is set
        # sfPassword: {{SNOWFLAKE_PASSWORD}} # make sure the key is prefixed with 'sf' is 'sparkFormat' is set
        # sfWarehouse: {{SNOWFLAKE_WAREHOUSE}} # make sure the key is prefixed with 'sf' is 'sparkFormat' is set
        # sfDatabase: {{SNOWFLAKE_DB}} # make sure the key is prefixed with 'sf' is 'sparkFormat' is set

      

Single Sign On (OAuth)

Single Sign-On with Snowflake OAuth

Single Sign-On (SSO) is supported via the oauth authentication type, which is only compatible with the native Snowflake JDBC driver.

You can continue using the same configuration as before. Once the user is authenticated through Snowflake, Starlake will automatically use the oauth authentication type to connect. The user and password fields will be ignored and replaced by the user token generated by Snowflake.

Configuring Snowflake OAuth

To use Snowflake OAuth, you need to configure the following settings in your Snowflake account:

Log in to Snowflake using the ACCOUNTADMIN role or another role with the global CREATE INTEGRATION privilege.
Execute the following SQL command to create a new OAuth integration:

-- Create a new OAuth integration
CREATE OR REPLACE SECURITY INTEGRATION STARLAKE
  TYPE = OAUTH
  ENABLED = TRUE
  OAUTH_CLIENT = CUSTOM
  OAUTH_CLIENT_TYPE = 'CONFIDENTIAL'
  OAUTH_REDIRECT_URI = '<REDIRECT_URI>'
  OAUTH_ISSUE_REFRESH_TOKENS = TRUE
  OAUTH_REFRESH_TOKEN_VALIDITY = 7776000 -- Valid for 90 days
  OAUTH_USE_SECONDARY_ROLES = 'IMPLICIT'; 

OAuth Redirect URI

REDIRECT_URI refers to the URL that Snowflake will redirect to after authentication:

If you’re running Starlake on your local machine, use: http://localhost:8080/api/v1/auth/snowflake/callback
If you’re running Starlake as a native app on Snowflake, the redirect URI will be: https://<account>.snowflakecomputing.com/api/v1/auth/snowflake/callback

Token Validity

The access token provided by Snowflake is valid for 10 minutes. However, the refresh token—valid for 90 days—will automatically renew the access token as needed. After 90 days, the user must reauthenticate.

Retrieving Integration Credentials

To extract your account name, client ID, and client secret from the integration you just created, run the following SQL query:

⸻

WITH SECURITY_INTEGRATION as (
  SELECT PARSE_JSON(SYSTEM$SHOW_OAUTH_CLIENT_SECRETS('STARLAKE')) as OAUTH_CLIENT_SECRETS
)
select
CURRENT_ORGANIZATION_NAME() || '-' || CURRENT_ACCOUNT_NAME() as account,
  OAUTH_CLIENT_SECRETS:"OAUTH_CLIENT_ID"::string     as client_id,
  OAUTH_CLIENT_SECRETS:"OAUTH_CLIENT_SECRET"::string as client_secret
from
  SECURITY_INTEGRATION;

In the Starlake UI, log in as an admin and navigate to the Admin page. Then, click on the Snowflake SSO tab. Enter the Account, Client ID, and Client Secret fields using the values you extracted from Snowflake.

Apache Spark / Databricks

Spark connections are used to read and write data from Spark.

Spark Parquet
Spark Delta
Spark Iceberg

application:
  defaultWriteFormat: parquet
  connections:
    spark:
      type: "spark"
      options:
        # any spark configuration can be set here
      

application:
  defaultWriteFormat: delta
  connections:
    spark:
      type: "spark"
      options:
        # any spark configuration can be set here
  spark:
    sql:
      extensions: "io.delta.sql.DeltaSparkSessionExtension"
      catalog:
        spark_catalog: "org.apache.spark.sql.delta.catalog.DeltaCatalog"

application:
  defaultWriteFormat: iceberg
  connections:
    spark:
      type: "spark"
      options:
        # any spark configuration can be set here
  spark:
    sql.extensions: "org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions"
    sql.catalog.spark_catalog: org.apache.iceberg.spark.SparkSessionCatalog
    sql.catalog.spark_catalog.type: hadoop
    sql.catalog.local: org.apache.iceberg.spark.SparkCatalog
    sql.catalog.local.type: hadoop
    sql.catalog.spark_catalog.warehouse: "{{SL_ROOT}}/warehouse"
    sql.catalog.local.warehouse: "{{SL_ROOT}}/warehouse"
    sql.defaultCatalog:  local

Amazon Redshift

Redshift JDBC

application:
  connections:
    redshift:
      type: jdbc
      # Uncomment the line below if you are running on top of Spark or Databricks
      # sparkFormat: "io.github.spark_redshift_community.spark.redshift"  # set to "redshift" if running on top of Databricks
      options:
        url: "jdbc:redshift://account.region.redshift.amazonaws.com:5439/database",
        driver: com.amazon.redshift.Driver
        password: "{{REDSHIFT_PASSWORD}}"
        tempdir: "s3a://bucketName/data",
        tempdir_region: "eu-central-1" # required only if running from outside AWS (your laptop ...)
        aws_iam_role: "arn:aws:iam::aws_count_id:role/role_name"
  

DuckDB / DuckLake

DuckDB is a lightweight, in-process SQL OLAP database management system. It is designed to support analytical queries on large datasets.

application:
  connections:
    duckdb:
      type: jdbc
      options:
        url: "jdbc:duckdb:{{DUCKDB_PATH}}"
        driver: "org.duckdb.DuckDBDriver"
        user: "{{DATABASE_USER}}"
        password: "{{DATABASE_PASSWORD}}"
        ## uncomment and customize the line below if using ducklake
        # preActions: ""ATTACH IF NOT EXISTS 'ducklake:metadata.ducklake' As my_ducklake (DATA_PATH 'file_path/');USE my_ducklake;"

Any Database

The syntax below applies to any JDBC database. The example below is for Postgres.

application:
  connectionRef: "postgresql"
  connections:
    postgresql:
      type: jdbc
      # Uncomment the line below to use the spark postgres connector instead of the native one.
      # sparkFormat: jdbc 
      options:
        url: "jdbc:postgresql://{{POSTGRES_HOST}}:{{POSTGRES_PORT}}/{{POSTGRES_DATABASE}}"
        driver: "org.postgresql.Driver"
        user: "{{DATABASE_USER}}"
        password: "{{DATABASE_PASSWORD}}"
        quoteIdentifiers: false

Local File System​

Google BigQuery​

Snowflake​

User / Password Auth​

Single Sign On (OAuth)​

Apache Spark / Databricks​

Amazon Redshift​

DuckDB / DuckLake​

Any Database​