Load Strategies: Control File Processing Order
Starlake load strategies control the order in which files are processed during starlake load or starlake autoload. Two built-in strategies are available: IngestionTimeStrategy (chronological order, the default) and IngestionNameStrategy (alphabetical order). You can also implement a custom strategy in Scala.
Load strategies vs. write strategies: This page covers load strategies, which control the order files are processed. This is different from write strategies (APPEND, OVERWRITE, UPSERT_BY_KEY, SCD2) which control how data is written to target tables.
How Load Strategies Work
During a starlake load or starlake autoload run, Starlake scans each domain's incoming directory for files matching the table.pattern defined in metadata/load/<domain>/<table>.sl.yml. The load strategy determines the order in which matching files are processed.
The strategy's list method receives two filtering parameters:
since(LocalDateTime): Only return files modified after this timestamp. The built-inIngestionTimeStrategyuses this to skip already-processed files.extension: Only return files matching this extension.
Built-in Strategies: IngestionTimeStrategy and IngestionNameStrategy
| Strategy | Class | Description |
|---|---|---|
| Time-based (default) | ai.starlake.job.load.IngestionTimeStrategy | Processes files in chronological order based on file last modification time. Older files are loaded first. |
| Name-based | ai.starlake.job.load.IngestionNameStrategy | Processes files in lexicographical (alphabetical) order based on file name. |
Example: Time-Based Ordering (Default)
With IngestionTimeStrategy, given three files:
order_20240101.csv(modified Jan 15)order_20240201.csv(modified Feb 10)order_20240301.csv(modified Mar 5)
Files are loaded in this order: order_20240101.csv, order_20240201.csv, order_20240301.csv.
Switching to Name-Based Ordering
Set the loadStrategyClass property in metadata/application.sl.yml:
application:
loadStrategyClass: ai.starlake.job.load.IngestionNameStrategy
With IngestionNameStrategy, files are sorted alphabetically by file name, regardless of modification time.
Implement a Custom Load Strategy in Scala
For advanced use cases, implement a custom strategy by extending the LoadStrategy interface.
Step 1: Implement the Interface
Create a Scala class that implements ai.starlake.job.load.LoadStrategy:
object CustomLoadStrategy extends LoadStrategy with StrictLogging {
def list(
storageHandler: StorageHandler,
path: Path,
extension: String = "",
since: LocalDateTime = LocalDateTime.MIN,
recursive: Boolean
): List[FileInfo] = ???
}
The list method must return a List[FileInfo] in the desired processing order. Use the since parameter to filter files by modification time. Use the extension parameter to filter by file type.
Step 2: Package and Deploy
- Compile and package your class into a JAR file.
- Place the JAR in Starlake's classpath (e.g., in the
lib/directory or via your build tool).
Step 3: Configure the Strategy
Reference your custom class in metadata/application.sl.yml:
application:
loadStrategyClass: my.own.CustomLoadStrategy
Frequently Asked Questions
What is a load strategy in Starlake?
A load strategy controls the order in which files are processed during a starlake load or starlake autoload run. It determines which files are loaded first. This is different from write strategies (APPEND, OVERWRITE, SCD2) which control how data is written to tables.
What load strategies does Starlake provide out of the box?
Starlake provides two built-in strategies: IngestionTimeStrategy (loads files by last modification time, the default) and IngestionNameStrategy (loads files in alphabetical order by name).
How do I switch from time-based to name-based file ordering?
Set loadStrategyClass: ai.starlake.job.load.IngestionNameStrategy in metadata/application.sl.yml.
Can I implement a custom load strategy?
Yes. Implement the ai.starlake.job.load.LoadStrategy interface in Scala. Your class must define a list method that returns files in the desired order.
What is the default load strategy?
The default is IngestionTimeStrategy, which processes files in chronological order based on their last modification timestamp.
Does the load strategy affect which files are loaded or only their order?
It controls both. The list method receives a since parameter (LocalDateTime) that can be used to filter files modified after a given timestamp, and an extension filter.
Where do I configure the load strategy?
In metadata/application.sl.yml, under the application.loadStrategyClass property.
Related
- Load Tutorial -- end-to-end walkthrough for loading data into your warehouse
- Write Strategies -- control how data is written to tables (APPEND, OVERWRITE, UPSERT, SCD2)
- Configure Database Connections -- set up the target warehouse connection