Roadmap: Performance enhancements #804

mpvl · 2021-03-02T08:22:22Z

This Issue is an umbrella issue for all performance-related issues. Below is a small selection of the type of performance improvements that are being considered.

Background

Graph unification at its core approaches O(n) (in the same way a hash table lookup approaches O(1)). CUE, however, introduces a few concepts that make it more expensive: comprehensions and disjunctions. There is little one can do for comprehensions (especially list comprehensions), but at least they visually stand out and clearly signal complexity visually.

Another one is disjunctions. Naive evaluation of disjunctions can lead to exponential evaluation time. We observed, however, that for most applications, including Protobuf oneofs and disjunctions of Kubernetes types, for instance, the evaluation is still linear. We will discuss possible optimizations below.

Note that CUE also allows references, which makes it, like YAML and XML, susceptible to the “billion laughs” attack. However, although this is not implemented yet in CUE, graph unification allows for an algorithmic trick called structure sharing which would allow to evaluate such configurations in O(n) time again. The current algorithm of CUE is compatible with structure sharing and thus implementing this should not be a stretch.

Disjunctions

The following optimizations are possible for disjunctions

Early termination upon ambiguity

v0.2 deployed a strategy of first evaluating defaults and then non-defaults. This allowed for early termination:

as soon as all possibilities for defaults were processed, termination could stop if at least one remained
otherwise, termination could stop if at least two valid answers were obtained, as this would lead to ambiguity.

The implementation of v0.2 had other issues that were addressed in v0.3. This optimization was dropped in the process. However, this optimization is still valid for v0.3.

Postpone disjuncts with exclusive fields

For disjunctions of the form

#closed: {
    {a: int} | {b: int}
    {c: int} | {d: int}
}

where each top-level field in a disjunct is unique, it suffices to postpone evaluation until all other conjuncts are merged. As the fields are mutually exclusive per disjunction, it suffices to first determine which occur in the rest of the disjunction and then select the appropriate disjuncts accordingly. This would completely bypass disjunction expansion.

Mutually exclusive disjuncts / discriminator fields

Disjunctions that mutually do not unify can be handled more efficiently in the general case.

A simple example is the following:

a: 1 | 2 | 3
b: a
b: 1

Evaluating b is akin to a map lookup. Essentially, for each disjunct it takes constant effort to determine whether there is a match.

The same trick can be applied to structs that are mutually exclusive. Consider the following:

{kind: “Service”, name: string} | {kind: “Deployment”, name string}

as long as such a disjunction is matched against a struct that specifies a concrete kind field, only one struct can match. In this case, we call kind a discriminator fields. Just as with the trick for exclusive fields, the non-disjunction conjuncts can be unified beforehand to determine the presence of the discriminator field, after which disjuncts can be eliminated before fully unifying them. These mutually exclusive fields can be discovered during compilation or once when evaluating an arc.

A more general optimization along these lines is called the QuickCheck, (used in LinGo’s graph unification, for instance), where it can lead to considerable speedups. (Even though these implementations of graph unification are O(n), they often are used in a setting where many combinations of graphs are attempted, not unlike what CUE accomplishes with disjunctions within a single unification.) This encompasses detecting “hot” fields within a struct that are likely to fail unification. The indication of a required field (the foo!: bar annotation proposed elsewhere could be used for that), or just fields that are mutually exclusive as determined by anti-unification of disjuncts can be used for this purpose.

References:
https://profs.info.uaic.ro/~ciortuz/PAPERS/ALL/blue.pdf
https://www.cambridge.org/core/journals/natural-language-engineering/article/abs/pet-a-platform-for-experimentation-with-efficient-hpsg-processing-techniques/676B9162641E3DD8ABB846C0F5B105CF

Structure sharing

Currently, CUE injects the original conjuncts in a new context, triggering the recomputation of these values. When these values are not unified with any new conjuncts, however, there is really no need to trigger this recomputation. The arcs could then be shared between the two nodes. The result is a potentially significant reduction in recomputation.

This technique can CUE immune to the “billion laughs” attack and guarantee linear evaluation (disregarding comprehensions and complex disjunctions).

One requirement of this technique is that it has no upwards references. The current CUE implementation does have upward references, but they are used only in cases that do not impair the use of structure sharing:
to refer retrieve closedness constraints from the parent
to keep track of the path descended into with the API

Only the first directly affects evaluation. This is not an issue, however, as not unifying with new nodes also means no new closedness information, meaning that the arc has already passed a closedness test.

The second case should be fairly straightforward to track. The API could detect when an arc is shared, and copy the arc structure keeping a back reference to its original. That would only have to be done one node at a time. Obviously, users of the API would still need to not blindly traverse all “virtual” nodes of a structure to benefit from structure sharing. The asymmetry in up references can be used to flag a node as “borrowed”.

Structure sharing can also help make the API concurrent. References:
https://www.researchgate.net/publication/2415772_Quasi-Destructive_Graph_Unification
https://www.aclweb.org/anthology/P00-1045.pdf

General evaluation

Keep sorted lists of arcs

Currently arcs are maintained as lists and matching on labels is done by looping over them. This can be slow for structs with large sets of arcs. Merging sorted lists, on the other hand, allows for O(n) operation and is clearly more scalable and necessary to approximate a O(n) time complexity (note that a merge sort on sorted input is O(n)).

Reuse of computation

Right now, updating specific values in a configuration, for instance with Fill, triggers a reevaluation of the entire tree. The plan is to make incremental updates possible. One possibility is to keep track of dependencies, mark nodes a “dirty” upon changes, and recompute the tree, while reusing the values that remained unmarked.
Research in the field of spreadsheet computation and propagator networks can be helpful in this regard.

Delayed error computation

Generating an error can be quite expensive in CUE, as it compiles a lot of information in the message. This is especially an issue in the case of disjunctions, where errors are used for detecting elimination. The v0.3 data structures allow for a different error model where an error would just consist of a few pointers to the point of origin, delaying this computation until the error is actually printed or analysed by the user.

Aside from improving performance, this delayed error computation also allows for better error reporting, as it allows for more expensive computations, such as more advanced error filtering and showing more detailed context.

Warren Abstract Machine

This is an optimization that only makes sense once the above optimizations have been addressed.

A Warren Abstract Machine is a concept originating from Prolog to make the evaluation of Prolog more efficient. In the context of CUE it could especially be useful for reducing memory allocations. Right now CUE needs considerable amounts of allocation. v0.3 already greatly reduces this by keeping reusable per-node buffers. But there is still considerable room for further reducing allocations.

Other improvements

special-case list processing
reduce size of adt.Vertex
optimize boolean computation (now always creates an adt.Bool, but is often not needed)
special-case arithmetic of small numbers
less memory allocation

The text was updated successfully, but these errors were encountered:

cueckoo · 2021-07-03T13:04:08Z

This issue has been migrated to cue-lang/cue#804.

For more details about CUE's migration to a new home, please see cue-lang/cue#1078.

mpvl added the roadmap High level umbrella issue laying out important roadmap goals label Mar 2, 2021

verdverm mentioned this issue Mar 2, 2021

Component chaining kills performance dagger/dagger#102

Closed

myitcv mentioned this issue Mar 9, 2021

Performance in the Wild #630

Closed

rawkode mentioned this issue Apr 15, 2021

FillPath Performance Declines Linearly #899

Closed

This was referenced Jul 3, 2021

evaluator: general performance enhancements cue-lang/cue#804

Closed

Performance in the Wild cue-lang/cue#630

Closed

cueckoo closed this as completed Jul 3, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Roadmap: Performance enhancements #804

Roadmap: Performance enhancements #804

mpvl commented Mar 2, 2021

cueckoo commented Jul 3, 2021

Roadmap: Performance enhancements #804

Roadmap: Performance enhancements #804

Comments

mpvl commented Mar 2, 2021

Background

Disjunctions

Early termination upon ambiguity

Postpone disjuncts with exclusive fields

Mutually exclusive disjuncts / discriminator fields

Structure sharing

General evaluation

Keep sorted lists of arcs

Reuse of computation

Delayed error computation

Warren Abstract Machine

Other improvements

cueckoo commented Jul 3, 2021