This page describes the Schema resolution algorithm which is standard for all Iglu clients. Currently only Iglu Scala client fully follow this algorithm, while other clients may miss some parts, but we’re working on making their behaviour consistent.
Before going further it is important to understand basic Iglu client configuration and essential concepts like Resolver, Registry (or Repository), Schema. Here is a quick overview of these concepts, if you’re familiar with them you may want to skip this section.
Iglu clients are configured via JSON object described in dedicated Schema called resolver-config. Here we’ll be ussing JSON resolver configuration which is platform independent and most wide-spread.
Resolver is an primary object of Iglu Client library, which contains all logic necessary to fetch requested Schema from appropriate registry (repository) and cache it properly. Resolver has two main properties: cache size (
cacheSize) and list of registries (
NOTE: term repository was deprecated. Registry is default term to use when referring to Schema storage. So far, we’ve not renamed all occurrences, so for now they can be used interchangeable.
Each registry in resolver configuration has several values common for all types of registries, such as
priority. Also each registry has type, which is defined inside
connection property. The only one important thing here about type of repository is that each type has its own priority hardcoded inside client library. Below we’ll refer to this hard-coded priority by
classPriority and to user-defined priority by
instancePriority Usually, the “safer” registry – the higher
classPriority it has, so local repositories are more preferable than remote.
All Iglu clients use internal cache to store registry responses. By virtue of it, it is absolutely safe to launch Hadoop/Spark jobs with Iglu client embedded as it will not generate enormous amount of IO calls.
Cache stores not just plain Schemas, but information about responses from each registry. It allows us to make different decisions depending on what exactly went wrong with particular request. Since Schema was successfuly fetched it will be stored until moment it get evicted by LRU cache algorithm. This eviction it turn happens only if cache map reached its limit (defined in
cacheSize) and particular Schema wasn’t requested for longer time than all other.
Since version 0.5.0, Iglu Scala Client supports
cacheTtl property. It is especially useful for real-time pipelines as they can store “failure” for very long time and TTL is a mechanism to ensure that day-long data won’t go to bad stream. Note however that client also tries to re-resolve successfully fetched schemas, this allows operators to patch (re-upload) schemas without bringing pipeline down (although it is not recommended).
cacheTtl is available since
1-0-2 version of resolver config.
Overall, Schema Resolution algorithm can be described by following flowchart:
Few important things to note:
- If registry responded with “NotFound” error – “missing” value will be cached and this repository won’t be queried again, until this “missing” value not evicted by LRU-algorithm
- If registry responded with error other than “NotFound”, for example “TimeoutError”, “NetworkError”, “ServerFault” etc – “needToRetry” value will be cached and Resolver will give this registry 3 chances more. After three failed lookups – “missing” value will be cached
- These “missing” and “needToRetry” values in cache are per-registry, not per-schema, which means if
registryAresponded “NotFound” for Schema
registryBresponded with TimeoutError – resolver will immediately abandon
registryAand keep try to query
registryBfor 3 more times.
For each particular Schema lookup, registries will be prioritized. In other words they will be sorted according following input parameters (ordered by their significance):
vendorPrefix– Resolver always will look first into those registries which
SchemaKey‘s vendor. It does not mean registries with unmatched
vendorPrefixwill be skipped, it means they will be queried last.
classPriority– hardcoded in client library value for each type of registry. It means that whatever high priority (low integer value) was seted up in configuration for particular registry – it will be overriden by
classPriority, so embedded repository will always be checked before HTTP (unless priority influenced by
instancePriority– user-defined value. Influence only repositories within same
One important thing to note is that both priorities (
instancePriority) order registries in ascending order. That means lower number means higher priority. Think of it as ascending list of number:
[1,2,3,4] – smaller will be always first.