Extracting Schema from OpenAPI
This documentation outlines the process to extract Starlake schemas from OpenAPI definitions using a YAML configuration map. The instructions provide a comprehensive explanation of how to map OpenAPI routes and schemas to Starlake.
Overview
The schema extraction process leverages YAML configuration to define relations between Starlake's domains and tables with OpenAPI definitions. These configuration support advanced features like excluding routes or fields, and handling explosion strategies for complex data structures. Table names are deduced from paths and normalized. Attribute names are normalized as well in order for them to be compatible with databases.
Generally speaking, what we are interested in an OpenAPI definition are GET operations and their schema. Starlake filters out schema's that are not objects or array of object. Furthermore, the extraction flatten root array and consider the output to match JSON Lines.
Extract step-by-step
1. Define OpenAPI-Starlake Mapping
In metadata/extract/
defines a YAML config file, here named my_openapi_extract_config.sl.yml
.
version: 1
extract:
connectionRef: "my_open_api"
openAPI:
basePath: /api/v2
domains:
- name: api
schemas:
exclude:
- Model\.Common\.Id
routes:
- paths:
- ^/api/v2/clients/\{id}/details$
explode:
on: ARRAY
rename:
postal_adresses: adresses
The given example don't highlight all possible configuration. Let's break the configuration down and see how you may define yours:
- sanitizeAttributeName: One of :
- ON_EXTRACT: attribute name is sanitized and stored as field name. Default.
- ON_LOAD: attribute name is sanitized and stored as rename property when attribute's name differs from sanitized name
- connectionRef: reference a FS connection indicating where the open API file is
- openAPI:
- basePath: paths are used for table names. This base path is remove before table name is produced.
- formatTypeMapping: map of String, String allowing to map a custom format to a starlake type. By default, if the format is not standard, the type is resolved as String.
- domains:
- basePath: same as root's one but override its for the domain. May be interesting if root base path is /api/v2 and we want to create re referential domain where api are all located under /api/v2/referentials/. Otherwise we would get table name like
referentials_products
instead ofproducts
. - name: starlake domain name. Group multiple root under one specific domain.
- schemas: apply
include
thenexclude
list. OpenAPI have named schemas. If you are facing an OpenAPI definitions where you want to discard specific schemas, just add it toexclude
list. These may be an API where it requires you to list all ids before gettings the details of each of them. In contrary, if you are only interested by some schemas, useinclude
list. - routes:
- paths: all paths to include in the process. If regex is too complex, you may add multiple entries. By default, takes all paths.
- as: force the table name. Be sure that eligible
paths
are outputting only one schema. Otherwise an error is thrown during process because of collision. - operations: GET or POST. By default, it is GET.
- exclude: paths to exclude. Detect
paths
first and then exclude them. - excludeFields: in the response schema's, the API may define some deprecated attributes that you may want to drop. Just add their patterns.
- explode:
- on: May be one of :
- ALL: Keep properties of type object or array.
- ARRAY: Keep properties of type object. Don't dive on array type.
- OBJECT: Keep properties of type array. If encounters an object, dive deeper.
- exclude: list of properties to ignore. Sub properties are concatenated to thei parent with
_
. I.epostal.address.number
becomepostal_address_number
. - rename: each exploded schemas are saved by default to a table named:
{api_path}_{property_path}
. If you don't want, you can userename
to match a property path to a table name. It's a map oftable_name -> property pattern
. If you force the renaming to""
, then the output table will beapi_path
. It won't be suffixed with a dangling_
.
- on: May be one of :
- basePath: same as root's one but override its for the domain. May be interesting if root base path is /api/v2 and we want to create re referential domain where api are all located under /api/v2/referentials/. Otherwise we would get table name like
2. Define file connections to the OpenAPI definition file
version: 1
application:
connections:
my_open_api:
type: "fs"
options:
path: my_open_api_file.json
path
defines the location to the openAPI definition. By convention, these file will be in metadata/extract/openapi
but you are free to specify an absolute location or use a substituted variable.
3. Drop your OpenAPI definition
According to the defined connection, you can drop your openAPI definition into metadata/extract/openapi/my_open_api_file.json
.
4. Extract
Now, you just have to launch schema extraction. Schema extraction will extract to the load
folder by default but you can specify another one with outputDir
.
If any domain or table exists, they are merged together.
$ starlake extract-schema --config my_openapi_extract_config
Name normalization
You may need to know how Starlake normalize names in order to apply those same transformation in your code and get a perfect match.
Table name
Table names are deduced from path. This is how they are normalized:
- remove path parameters
- remove all accents
- replace non alpha numeric with underscore
- replace consecutive/trailing underscores
- add underscore before capitals preceeded by lowercase
- add underscore after a group of capitals
- lower all
Attribute name
Attribute names are normalized as follow:
- remove all accents
- replace non alpha numeric with underscore
- replace consecutive/trailing underscores
- add underscore before capitals preceeded by lowercase
- add underscore after a group of capitals