Skip to main content

Load Strategies: Control File Processing Order

Starlake load strategies control the order in which files are processed during starlake load or starlake autoload. Two built-in strategies are available: IngestionTimeStrategy (chronological order, the default) and IngestionNameStrategy (alphabetical order). You can also implement a custom strategy in Scala.

note

Load strategies vs. write strategies: This page covers load strategies, which control the order files are processed. This is different from write strategies (APPEND, OVERWRITE, UPSERT_BY_KEY, SCD2) which control how data is written to target tables.

How Load Strategies Work

During a starlake load or starlake autoload run, Starlake scans each domain's incoming directory for files matching the table.pattern defined in metadata/load/<domain>/<table>.sl.yml. The load strategy determines the order in which matching files are processed.

The strategy's list method receives two filtering parameters:

  • since (LocalDateTime): Only return files modified after this timestamp. The built-in IngestionTimeStrategy uses this to skip already-processed files.
  • extension: Only return files matching this extension.

Built-in Strategies: IngestionTimeStrategy and IngestionNameStrategy

StrategyClassDescription
Time-based (default)ai.starlake.job.load.IngestionTimeStrategyProcesses files in chronological order based on file last modification time. Older files are loaded first.
Name-basedai.starlake.job.load.IngestionNameStrategyProcesses files in lexicographical (alphabetical) order based on file name.

Example: Time-Based Ordering (Default)

With IngestionTimeStrategy, given three files:

  • order_20240101.csv (modified Jan 15)
  • order_20240201.csv (modified Feb 10)
  • order_20240301.csv (modified Mar 5)

Files are loaded in this order: order_20240101.csv, order_20240201.csv, order_20240301.csv.

Switching to Name-Based Ordering

Set the loadStrategyClass property in metadata/application.sl.yml:

metadata/application.sl.yml
application:
loadStrategyClass: ai.starlake.job.load.IngestionNameStrategy

With IngestionNameStrategy, files are sorted alphabetically by file name, regardless of modification time.

Implement a Custom Load Strategy in Scala

For advanced use cases, implement a custom strategy by extending the LoadStrategy interface.

Step 1: Implement the Interface

Create a Scala class that implements ai.starlake.job.load.LoadStrategy:

src/main/scala/my/own/CustomLoadStrategy.scala
object CustomLoadStrategy extends LoadStrategy with StrictLogging {

def list(
storageHandler: StorageHandler,
path: Path,
extension: String = "",
since: LocalDateTime = LocalDateTime.MIN,
recursive: Boolean
): List[FileInfo] = ???
}

The list method must return a List[FileInfo] in the desired processing order. Use the since parameter to filter files by modification time. Use the extension parameter to filter by file type.

Step 2: Package and Deploy

  1. Compile and package your class into a JAR file.
  2. Place the JAR in Starlake's classpath (e.g., in the lib/ directory or via your build tool).

Step 3: Configure the Strategy

Reference your custom class in metadata/application.sl.yml:

metadata/application.sl.yml
application:
loadStrategyClass: my.own.CustomLoadStrategy

Frequently Asked Questions

What is a load strategy in Starlake?

A load strategy controls the order in which files are processed during a starlake load or starlake autoload run. It determines which files are loaded first. This is different from write strategies (APPEND, OVERWRITE, SCD2) which control how data is written to tables.

What load strategies does Starlake provide out of the box?

Starlake provides two built-in strategies: IngestionTimeStrategy (loads files by last modification time, the default) and IngestionNameStrategy (loads files in alphabetical order by name).

How do I switch from time-based to name-based file ordering?

Set loadStrategyClass: ai.starlake.job.load.IngestionNameStrategy in metadata/application.sl.yml.

Can I implement a custom load strategy?

Yes. Implement the ai.starlake.job.load.LoadStrategy interface in Scala. Your class must define a list method that returns files in the desired order.

What is the default load strategy?

The default is IngestionTimeStrategy, which processes files in chronological order based on their last modification timestamp.

Does the load strategy affect which files are loaded or only their order?

It controls both. The list method receives a since parameter (LocalDateTime) that can be used to filter files modified after a given timestamp, and an extension filter.

Where do I configure the load strategy?

In metadata/application.sl.yml, under the application.loadStrategyClass property.