Autoload
Datawarehouses are organized around schemas where tables are grouped.
Depending on the database, a database schema
can be called schema
or catalog
or dataset
.
In starlake, we use the term domain
to designate a schema
, catalog
or dataset
.
The conventions used by the autoload command are:
- files are stored under the incoming directory and named after the domain and table they are loaded into:
$SL_ROOT/incoming/<domain>/<table><suffix>.<extension>
- The target schema is named after the domain names
- The target tables are named after the file name without the extension or suffix.
- The suffix is used to differentiate between files that are loaded in full
overwrite
mode and those loaded inappend
mode. For exampleorder_20240228.json
andorder_line_20240228.csv
are respectively loaded into the same tableorder
andorder_line
.
incoming
└── starbake
├── product.xml
├── order_20240228.csv
└── order_line_20240228.csv
After running the autoload
command:
starlake autoload
Since we did not specify the target datawarehouse but the local filesystem as the target, the directory structure will be updated as follows.
In real life, the target datawarehouse will be a database and the tables will be created in the target database.
metadata/load
└── starbake
├── _config.sl.yml
├── product.sl.yml
└── order.sl.yml
datasets/
├── archive # files are moved to the archive directory after being loaded
│ └── starbake
├── ingesting # files are moved to the ingesting directory before being loaded
│ └── starbake
├── replay # invalid records that need to be replayed are moved
│ └── starbake # in the replay directory
├── stage # files are moved to the stage directory before being loaded
│ └── starbake
├── unresolved # filenames that do not match a valid pattern are moved to the unresolved directory.
└──incoming
└── starbake
├── product.xml
├── order_20240228.csv
└── order_line_20240228.csv
-
The
_config.sl.yml
file is a configuration file that contains the domain configuration and describe how the domain will be created in the target database. -
The
<table>.sl.yml
file contains the table configuration and describes how files are parsed, validated and loaded into the target database.
How autoload detects the format of the files
- It expects the file to be named after the table it will be loaded into.
- It uses the file extension to detect the format of the file. The supported formats are
csv
,psv
,dsv
,json
,xml
- It uses the first line of the file to detect the separator and the number of columns.
- In the case of CSV files, it uses the first line of the file to detect the column names.
- It uses the whole file content to detect the column types.
- In the case of JSON file, it is a considered a JSON_ARRAY file if the first character is
[
and a JSON file if the first character is{
. - In the case of XML file, it uses the first line of the file to detect the root element and the number of columns.
XML files are expected to have the following structure:
<?xml> # optional xml declaration
<myroot>
<myrecord attr1="value1" >
<column1>value1</column1>
<column2>value2</column2>
...
</myrecord>
<myroot>
Spark specifics
When running against Spark Locally, the tables are created in the datasets
directory.
When running against a real datawarehouse, the tables are created in the target database.
datasets/
├── starbake.db # files are loaded as tables into the starbake schema
│ ├── order
│ ├── order_line
│ └── product
├── audit.db # audit tables are created in the audit schema
│ ├── audit
│ └── rejected