Use *inline* caches #263
Replies: 17 comments 19 replies
-
This reminds me of Brunthaler's design -- IIRC he had 64-bit instructions and all the cache data was in that one 64-bit word. Would things get easier if we dropped the requirement that the quickened code and the original code must have the same format? What if there was no longer a simple mapping between addresses in either? (We could keep the original opcode args in the quickened code so we'd never have to refer to the original code except in the debugger, for which purpose we could have a slow but precise mapping back.) |
Beta Was this translation helpful? Give feedback.
-
If the quickened code and original code has a different format, we would need additional logic to determine which form we were executing and a lot of duplication of code in the interpreter. |
Beta Was this translation helpful? Give feedback.
-
In fact if we stick to 16 bit instructions, it would make this a lot simpler to implement. If we combine this with lazily creating with the For example, this Python function: def f(x):
return x.attr compiles to (stripping line numbers and offsets):
With the caches inline, it would compile to:
Accessing the cache involves less indirection, has better cache locality and doesn't require us to keep |
Beta Was this translation helpful? Give feedback.
-
Memory usageFor run-once code (classes, and modules), the code object is short lived, so we don't care. For code that is quickened, this will save memory as we don't need to duplicate the instructions and the cache is smaller. Quickened/inline bytes per instructions per instruction
So, we might actually save memory if 60% or more of functions are quickened. |
Beta Was this translation helpful? Give feedback.
-
At 50%, we'd only see a ~10% increase in the size of the code, which is probably only ~5% of the total size of the code objects. |
Beta Was this translation helpful? Give feedback.
-
One possible downside is that we will need more |
Beta Was this translation helpful? Give feedback.
-
I implemented it for |
Beta Was this translation helpful? Give feedback.
-
Another downside of performing the specialization in place is that we will need to copy the entire bytecode when adding instrumentation for The copy is necessary, as we have no way to get the original opcode if the quickened instruction does not map back to a unique original instruction. E.g. a We will need to implement instrumentation before making specialization in place. We can specialize inline before adding instrumentation, though. The initial implementation of instrumentation can be as dumb as making every instruction |
Beta Was this translation helpful? Give feedback.
-
A few more design comments/clarifications:
|
Beta Was this translation helpful? Give feedback.
-
A few more comments about how to handle pointers.
Going through this case by case for those instructions that need a cached object:
|
Beta Was this translation helpful? Give feedback.
-
Hi! I had experimented with instruction formats quite a bit and can provide some perspective here. TL;DR: My thesis has a dedicated chapter on instruction formats and the easiest and most efficient thing was to use two separate interpreters with two different instruction sets. Here's a quick overview of some of my implementation details:
Why does any of these things make sense?
So much for my rationale, now some thoughts on how I implemented caching. My implementation used a layer of indirection, where the upper half of an optimized instruction pointed into an array that held cached items, e.g., results of an Doing a lot of performance experiments, on 2009 hardware, static inline caches by having instructions with direct calls hardwired in were always considerably faster. Given that 16 bits for opcodes allows for way more instructions, I would easily add ten or more specialized instructions for whatever is reasonably frequent. For the Given that you have the possibility to also change the language, an option may be to also to require programmers to indicate somewhere that they're overloading Python system procedures. Although I personally do not like such directives, it would probably help to keep a clean, simple, and correct implementation. Some of the additional details are in my thesis and the multi-level quickening paper that is on arxiv, but was never accepted for publication. I'm happy to do another video conference, and have dug up some source code, so I'd be able to show some source code as well. Hope this helps somehow & all the best, |
Beta Was this translation helpful? Give feedback.
-
Word on the street (e.g. JavaScript, .NET) is indeed that having (at least :-) two separate interpreters makes sense, one that starts quickly and a second one for hot code only. We're not likely to change the language to let programmers indicate optimization opportunities, since most code is legacy code -- we want to speed that up too. |
Beta Was this translation helpful? Give feedback.
-
I assume "word on the street" is shorthand for "unattributed claim with no empirical evidence" 😉 Swapping contexts from one interpreter to another is not cheap and requires an additional "super interpreter" to determine which interpreter to run at any given point. |
Beta Was this translation helpful? Give feedback.
-
I wouldn't go that far. E.g. here's a write-up from Firefox: https://hacks.mozilla.org/2019/08/the-baseline-interpreter-a-faster-js-interpreter-in-firefox-70/. This seems to go into some detail about data structures shared between the two. |
Beta Was this translation helpful? Give feedback.
-
Having two interpreters is actually not that difficult, if you can make some simplifying assumptions. Mine were:
If these assumptions hold, or can be agreed upon, then the implementation is relatively easy (glossing over a lot of Python code details, it has been a couple of years, sorry!):
Since we're already getting close to implementation details. I did keep the original bytecodes in the code object around, too, since they did not use a lot of space, and added a new field that holds the instructions in the optimized instruction encoding. The translation from one to the other is straightforward and also described in the thesis. One could, also always go back to the default interpreter, by resetting the flag and/or the invocation count, of course, but I never needed this. (Could be necessary/required for tracing/profiling.) PS: I didn't look at the Mozilla post, maybe they describe the exact same thing. Apologies in advance! |
Beta Was this translation helpful? Give feedback.
-
You seem to be assuming that all Python calls map to C calls. That is no longer the case: https://github.com/python/cpython/blob/main/Python/ceval.c#L4628 |
Beta Was this translation helpful? Give feedback.
-
Thanks for the hint, I had indeed assumed that all calls entail an invocation of the C interpreter dispatch routine (which was still true, AFAIR, in Python 3.6, which was the last version I wanted to forward port my optimizations to). Glad to see that the current implementation stays in the dispatch loop! There is, however, an easy fix w.r.t. to my routine: In the |
Beta Was this translation helpful? Give feedback.
-
We currently use caches which are laid out before the quickened instructions.
Although, the mechanism we use to find the cache is reasonably efficient, it isn't as fast as an inline cache.
As always, there are many possible designs, but any mechanism that puts the caches inline has one important advantage and a number of disadvantages.
The advantage:
NOP
s andRESUME
s.The disadvantages:
frame->f_lasti
directly, but would need to store the instruction pointer, and computef_lasti
when needed.Outline of a possible design
32 bit instructions and cache entries. Currently, instructions are 16 bits and cache entries are 64 bits.
The overhead of using 32 bit instructions should be lessened by:
NOP
andRESUME
instructions. ~5% reduction.Pointers can no longer be stored in one cache entry. We can deal with this in three ways.
Beta Was this translation helpful? Give feedback.
All reactions