Use inline caches #263

markshannon · 2022-02-09T16:37:15Z

markshannon
Feb 9, 2022
Collaborator

We currently use caches which are laid out before the quickened instructions.
Although, the mechanism we use to find the cache is reasonably efficient, it isn't as fast as an inline cache.

As always, there are many possible designs, but any mechanism that puts the caches inline has one important advantage and a number of disadvantages.

The advantage:

Speed. The cache can be accessed with no extra address computation and is in the most cache-friendly location possible.
Since we are breaking the one-to-one relation with the original instructions we can skip NOPs and RESUMEs.

The disadvantages:

Complexity of tracing, profiling. The mapping from the basic instruction pointer to the quickened one is more complex.
Quickening needs to track jump offsets, this might slow quickening a bit.
Cache entries and instructions need to be the same size. This is most likely 32 bits as discussed below, which mean pointers cannot be stored directly in the cache (for 64 bit machines).
We would no longer be able to store frame->f_lasti directly, but would need to store the instruction pointer, and compute f_lasti when needed.

Outline of a possible design

32 bit instructions and cache entries. Currently, instructions are 16 bits and cache entries are 64 bits.

The overhead of using 32 bit instructions should be lessened by:

Superinstructions take 32 bits now, so will be the same size. ~15% reduction.
Can skip NOP and RESUME instructions. ~5% reduction.
Some data that is stored in the cache can be moved to the cache, e.g. the original instruction's oparg.
Any cache that only need 32 or 96bits of data will need less space (about 10% reduction?).

Pointers can no longer be stored in one cache entry. We can deal with this in three ways.

Store the pointer in two 32 bit words. This is a bit clunky and may not be portable.
4Gb is a lot of space. Almost all the pointers we are interested in can fit in a 32 bit address space. May not be portable.
Use a table. Simple and portable, but needs an extra memory read.

gvanrossum · 2022-02-09T17:01:43Z

gvanrossum
Feb 9, 2022
Maintainer

This reminds me of Brunthaler's design -- IIRC he had 64-bit instructions and all the cache data was in that one 64-bit word.

Would things get easier if we dropped the requirement that the quickened code and the original code must have the same format? What if there was no longer a simple mapping between addresses in either? (We could keep the original opcode args in the quickened code so we'd never have to refer to the original code except in the debugger, for which purpose we could have a slow but precise mapping back.)

0 replies

markshannon · 2022-02-17T17:08:35Z

markshannon
Feb 17, 2022
Collaborator Author

If the quickened code and original code has a different format, we would need additional logic to determine which form we were executing and a lot of duplication of code in the interpreter.

0 replies

markshannon · 2022-02-17T17:13:20Z

markshannon
Feb 17, 2022
Collaborator Author

In fact if we stick to 16 bit instructions, it would make this a lot simpler to implement.
Most CPUs can do unaligned loads fast, and even if they can't a bitwise or is very cheap, so spreading 32bit values across two 16bit values should be cheap.
An inline 32 bit loads looks like this: int16_t *ip; int32_t data = (ip[0] | (ip[1] << 16); ip += 2;
Which compiles to a 32 bit load on x86-64: https://godbolt.org/z/8bhz5q4Ko

If we combine this with lazily creating with the co_code attribute, we can generate the quickenable code in the (bytecode) compiler, reducing the runtime cost of quickening

For example, this Python function:

def f(x):
    return x.attr

compiles to (stripping line numbers and offsets):

RESUME                   0
LOAD_FAST                0 (x)
LOAD_ATTR                0 (attr)
RETURN_VALUE

With the caches inline, it would compile to:

RESUME                   0
LOAD_FAST                0 (x)
LOAD_ATTR                0 (attr)   # Includes implicit JUMP_FORWARD 6
CACHE_ENTRY              0
CACHE_ENTRY              1
CACHE_ENTRY              2
CACHE_ENTRY              3
CACHE_ENTRY              4
CACHE_ENTRY              5
CACHE_ENTRY              6
RETURN_VALUE

Accessing the cache involves less indirection, has better cache locality and doesn't require us to keep first_instr in a register.

13 replies

markshannon Mar 23, 2022
Collaborator Author

It would great if you were to provide a PR (assuming @brandtbucher isn't already working on it).
See #310

I'd suggest a keyword-only parameter quickened, e.g. dis.dis(foo, quickened=True)

markshannon Mar 23, 2022
Collaborator Author

There is a _co_code_adaptive attribute on code objects that provides a read-only memory view of the quickened code.

penguin-wwy Mar 24, 2022

I provide a PR (python/cpython#32099).
It will take some time to fix all the tests.
Another question is whether I should use _co_code_adaptive instead of all co_code in dis.py ?

brandtbucher Mar 24, 2022
Maintainer

You should use _co_code_adaptive if quickened=True, and co_code otherwise.

(Also, I prefer adaptive over quickened for the kwarg name. _co_code_adaptive isn't really guaranteed to be quickened.

brandtbucher Mar 24, 2022
Maintainer

Also, you shouldn't need to modify any C code (or existing tests) for this. The failing tests are likely because you changed how co_code is constructed.

markshannon · 2022-02-17T17:22:22Z

markshannon
Feb 17, 2022
Collaborator Author

Memory usage

For run-once code (classes, and modules), the code object is short lived, so we don't care.

For code that is quickened, this will save memory as we don't need to duplicate the instructions and the cache is smaller.
For code that is never quickened, this will consume memory.

Quickened/inline bytes per instructions per instruction

Code	Quickened	Inline cache
Quickened	~7 (2 for copied instructions plus 16 per cache entry)	~4 (~12 per cache entry)
Not quickened	0	~4 (~12 per cache entry)

So, we might actually save memory if 60% or more of functions are quickened.

1 reply

gvanrossum Feb 17, 2022
Maintainer

That would be an easy statistic to gather for the benchmarks, right? (The BM code itself is probably all quickened since BM authors tend to not leave unused code around, but the packages they depend on, such as Django or the logging module, might be a different story. So I am not optimistic that we'll hit 60% in real code.)

markshannon · 2022-02-17T17:43:57Z

markshannon
Feb 17, 2022
Collaborator Author

At 50%, we'd only see a ~10% increase in the size of the code, which is probably only ~5% of the total size of the code objects.

0 replies

markshannon · 2022-02-17T18:35:11Z

markshannon
Feb 17, 2022
Collaborator Author

One possible downside is that we will need more EXTENDED_ARGs before jumps and branches.
It shouldn't be too much of a problem, as we can easily add 32 bit branch/jump instructions with 24 bit operands.

0 replies

markshannon · 2022-02-18T14:20:08Z

markshannon
Feb 18, 2022
Collaborator Author

I implemented it for LOAD_GLOBAL and it seems to work.
All the tests pass apart from the inevitable test_dis failure, although it could do with some tidying up.
https://github.com/python/cpython/compare/main...faster-cpython:inline-cache-load-globals?expand=1

5 replies

arhadthedev Feb 18, 2022

Why MAGIC_NUMBER jumped seven positions up? 3479 seems never used in main.

brandtbucher Feb 18, 2022
Maintainer

Perhaps a cleaner approach could be to add cache entries just before (or maybe during) assembly? That would keep all of this logic together in one place, especially as we move more and more instructions over to the new cache scheme. It also reduces complications during the optimization pass.

(Also, I just realized that this scheme will allow us to quicken instructions that have EXTENDED_ARGs. Nice.)

markshannon Feb 18, 2022
Collaborator Author

Yes, that makes sense. We can treat the inline cache as part of the instruction, much like we do with EXTENDED_ARG.

brandtbucher Feb 18, 2022
Maintainer

Cool. Should we move forward with this? (I have some bandwidth to help out, if needed.)

markshannon Feb 18, 2022
Collaborator Author

Let's finish the work on specializing calls, BINARY_OP, and LOAD_METHOD first.
I don't think we want to be changing the specializations and doing this at the same time.

markshannon · 2022-02-21T18:04:56Z

markshannon
Feb 21, 2022
Collaborator Author

Another downside of performing the specialization in place is that we will need to copy the entire bytecode when adding instrumentation for sys.settrace and PEP 669.
Even if PEP 669 isn't accepted, we still need a way to support tracing and profiling.

The copy is necessary, as we have no way to get the original opcode if the quickened instruction does not map back to a unique original instruction. E.g. a DO_TRACING instruction could map to any instruction, so we need a copy of the bytecode to get the original instruction. Likewise for a BREAKPOINT instruction.

We will need to implement instrumentation before making specialization in place. We can specialize inline before adding instrumentation, though. The initial implementation of instrumentation can be as dumb as making every instruction DO_TRACING.

0 replies

markshannon · 2022-02-24T12:23:56Z

markshannon
Feb 24, 2022
Collaborator Author

A few more design comments/clarifications:

We should quicken in place, it saves a lot of copying and allocation.
When instrumenting, we will need to create an array of the original instructions. It adds 50% to the size of the code but it is either uncommon (breakpoints) or we don't care overly much about performance (coverage).
We should change pointer caches to indexes into an array. We can allocate the array when quickening, this saves space when not quickening without using much extra space when quickened. It is also a lot simpler than trying to break up pointers into 16 bit chunks.

0 replies

markshannon · 2022-03-01T12:21:14Z

markshannon
Mar 1, 2022
Collaborator Author

A few more comments about how to handle pointers.

We can either store the pointer at the point of use, or
Store the pointer on an object that is hot at the point of use.

Going through this case by case for those instructions that need a cached object:

`BINARY_SUBSCR_GET_ITEM`

Add a getitem_cache to PyHeapTypeObect, no need to local cache or index.
This also applies to possible future specializations for STORE_SUBSCR, LOAD_ATTR, etc for overridden special methods.

`LOAD_METHOD` (various specializations)

These need to be stored at the point of use.

`CALL` (specializations for `isinstance`, `len`, `list.append`)

We can cache these objects on the interpreter state or, better still, make them static.

`CALL` Python class.

We need to cache the __init__ method. We can add this to the PyHeapTypeObect as well.

Since we only need a locally cached object for LOAD_METHOD, then it probably makes sense to store the pointer inline.

0 replies

sbrunthaler · 2022-03-15T20:07:17Z

sbrunthaler
Mar 15, 2022

Hi!

I had experimented with instruction formats quite a bit and can provide some perspective here. TL;DR: My thesis has a dedicated chapter on instruction formats and the easiest and most efficient thing was to use two separate interpreters with two different instruction sets.

Here's a quick overview of some of my implementation details:

Using two interpreters, an optimizing interpreter (Io) and the default interpreter (Id), with the optimizing interpreter doing the profiling and calling the default interpreter for infrequently called functions. The additional call does not matter for cold code.
The optimizing interpreter using a word-sized instruction format, with the lower 32 bits divided into two 16 bit halves for opcode and oparg, and the upper 32 bits used for caching purposes (and some other uses). The PhD thesis has only oparg and opcode in two 32 bit halves, but the MLQ interpreter uses the upper 32 bits for something useful.
The Python code object holding an additional pointer to the optimized instruction format in addition to the standard code format.

Why does any of these things make sense?

Having two interpreters means that we only have to spend additional memory on where we actually need it. Since hot code is much less frequent and most real world programs use a lot more memory for data and not for code, additional memory requirements are low in practice. Additional memory cost everybody pays then is for the fixed part added to the Python executable (AFAIR about 100kb).
Having the default interpreter around is handy for all kinds of reasons, e.g., for tracing or profiling.
Having the optimized instruction format being completely transparent gives additional degrees of freedom and buys flexibility down the road.

So much for my rationale, now some thoughts on how I implemented caching.

My implementation used a layer of indirection, where the upper half of an optimized instruction pointed into an array that held cached items, e.g., results of an LOAD_GLOBAL instruction. Similar things also worked for BINARY_SUBSCR. A key advantage of having this layer of indirection is that it allows for simple cache invalidation: A STORE_GLOBAL just erases all cached data. If some of these data are stored directly inline, then invalidating becomes expensive. To avoid this overhead, and guarantee correctness, this level of indirection was tremendously helpful.

Doing a lot of performance experiments, on 2009 hardware, static inline caches by having instructions with direct calls hardwired in were always considerably faster. Given that 16 bits for opcodes allows for way more instructions, I would easily add ten or more specialized instructions for whatever is reasonably frequent. For the CALL instruction derivatives (which is what I called optimized instructions derived from the generic ones) outlined above, I did not do something along these lines for the same reason I had the layer of indirection: If somewhere a Python program re-defines a function somewhere (mostly identifier overloading), it maybe either hard to invalidate, or require expensive sanity checks in the interpreter instruction execution.

Given that you have the possibility to also change the language, an option may be to also to require programmers to indicate somewhere that they're overloading Python system procedures. Although I personally do not like such directives, it would probably help to keep a clean, simple, and correct implementation.

Some of the additional details are in my thesis and the multi-level quickening paper that is on arxiv, but was never accepted for publication. I'm happy to do another video conference, and have dug up some source code, so I'd be able to show some source code as well.

Hope this helps somehow & all the best,
--stefan

0 replies

gvanrossum · 2022-03-15T21:10:29Z

gvanrossum
Mar 15, 2022
Maintainer

Word on the street (e.g. JavaScript, .NET) is indeed that having (at least :-) two separate interpreters makes sense, one that starts quickly and a second one for hot code only.

We're not likely to change the language to let programmers indicate optimization opportunities, since most code is legacy code -- we want to speed that up too.

0 replies

markshannon · 2022-03-16T00:10:41Z

markshannon
Mar 16, 2022
Collaborator Author

I assume "word on the street" is shorthand for "unattributed claim with no empirical evidence" 😉

Swapping contexts from one interpreter to another is not cheap and requires an additional "super interpreter" to determine which interpreter to run at any given point.

0 replies

gvanrossum · 2022-03-16T00:48:38Z

gvanrossum
Mar 16, 2022
Maintainer

I wouldn't go that far. E.g. here's a write-up from Firefox: https://hacks.mozilla.org/2019/08/the-baseline-interpreter-a-faster-js-interpreter-in-firefox-70/. This seems to go into some detail about data structures shared between the two.

0 replies

sbrunthaler · 2022-03-16T19:41:15Z

sbrunthaler
Mar 16, 2022

Having two interpreters is actually not that difficult, if you can make some simplifying assumptions. Mine were:

Once you move to the optimized interpreter, never go back to the un-optimized, default interpreter.
Switching from un-optimized (Id) to optimized (Io) is decided either at function call boundary (i.e, when entering PyEval_EvalFrameEx), or when executing a hot loop in the default interpreter. (Though the latter one was a pain in the neck and I'm not really sure it's needed for idiomatic Python source code, so I would only implement this if it were absolutely needed.)

If these assumptions hold, or can be agreed upon, then the implementation is relatively easy (glossing over a lot of Python code details, it has been a couple of years, sorry!):

PyObject *PyEval_EvalFrameEx(PyFrameObject *frame) { // this is the optimized interpreter Io

  PyCodeObject *co= frame->co_code;
  if (co->optimized_flag == 0) {
    if (co->invocation_count++ < 10'000) {
       return PyEval_EvalFrameEx_SysDefault(frame); // the SysDefault is the original default interpreter Id
    } // if
    co->optimized_flag= 1;
    optimize_instrs(co);
  } // if
  ...
  // here you can already use the optimized instruction format...
}

Since we're already getting close to implementation details. I did keep the original bytecodes in the code object around, too, since they did not use a lot of space, and added a new field that holds the instructions in the optimized instruction encoding. The translation from one to the other is straightforward and also described in the thesis. One could, also always go back to the default interpreter, by resetting the flag and/or the invocation count, of course, but I never needed this. (Could be necessary/required for tracing/profiling.)

PS: I didn't look at the Mozilla post, maybe they describe the exact same thing. Apologies in advance!

0 replies

markshannon · 2022-03-18T17:25:35Z

markshannon
Mar 18, 2022
Collaborator Author

You seem to be assuming that all Python calls map to C calls. That is no longer the case: https://github.com/python/cpython/blob/main/Python/ceval.c#L4628

0 replies

sbrunthaler · 2022-03-18T18:34:58Z

sbrunthaler
Mar 18, 2022

Thanks for the hint, I had indeed assumed that all calls entail an invocation of the C interpreter dispatch routine (which was still true, AFAIR, in Python 3.6, which was the last version I wanted to forward port my optimizations to). Glad to see that the current implementation stays in the dispatch loop!

There is, however, an easy fix w.r.t. to my routine: In the CALL function one would be able to back out to the system default interpreter, if profiling indicates it isn't hot code (the additional function call wouldn't impact performance noticeably, I think). Conversely, if the code in question becomes hot in the CALL function, one could easily fold in the code I showed above. (Assuming, of course, that I don't miss any other important information.)

0 replies

Use *inline* caches #263

markshannon Feb 9, 2022 Collaborator

Outline of a possible design

Replies: 17 comments · 19 replies

gvanrossum Feb 9, 2022 Maintainer

markshannon Feb 17, 2022 Collaborator Author

markshannon Feb 17, 2022 Collaborator Author

markshannon Mar 23, 2022 Collaborator Author

markshannon Mar 23, 2022 Collaborator Author

penguin-wwy Mar 24, 2022

brandtbucher Mar 24, 2022 Maintainer

brandtbucher Mar 24, 2022 Maintainer

markshannon Feb 17, 2022 Collaborator Author

Memory usage

Quickened/inline bytes per instructions per instruction

gvanrossum Feb 17, 2022 Maintainer

markshannon Feb 17, 2022 Collaborator Author

markshannon Feb 17, 2022 Collaborator Author

markshannon Feb 18, 2022 Collaborator Author

arhadthedev Feb 18, 2022

brandtbucher Feb 18, 2022 Maintainer

markshannon Feb 18, 2022 Collaborator Author

brandtbucher Feb 18, 2022 Maintainer

markshannon Feb 18, 2022 Collaborator Author

markshannon Feb 21, 2022 Collaborator Author

markshannon Feb 24, 2022 Collaborator Author

markshannon Mar 1, 2022 Collaborator Author

BINARY_SUBSCR_GET_ITEM

LOAD_METHOD (various specializations)

CALL (specializations for isinstance, len, list.append)

CALL Python class.

sbrunthaler Mar 15, 2022

gvanrossum Mar 15, 2022 Maintainer

markshannon Mar 16, 2022 Collaborator Author

gvanrossum Mar 16, 2022 Maintainer

sbrunthaler Mar 16, 2022

markshannon Mar 18, 2022 Collaborator Author

sbrunthaler Mar 18, 2022

Use inline caches #263

markshannon
Feb 9, 2022
Collaborator

Replies: 17 comments 19 replies

gvanrossum
Feb 9, 2022
Maintainer

markshannon
Feb 17, 2022
Collaborator Author

markshannon
Feb 17, 2022
Collaborator Author

markshannon Mar 23, 2022
Collaborator Author

markshannon Mar 23, 2022
Collaborator Author

brandtbucher Mar 24, 2022
Maintainer

brandtbucher Mar 24, 2022
Maintainer

markshannon
Feb 17, 2022
Collaborator Author

gvanrossum Feb 17, 2022
Maintainer

markshannon
Feb 17, 2022
Collaborator Author

markshannon
Feb 17, 2022
Collaborator Author

markshannon
Feb 18, 2022
Collaborator Author

brandtbucher Feb 18, 2022
Maintainer

markshannon Feb 18, 2022
Collaborator Author

brandtbucher Feb 18, 2022
Maintainer

markshannon Feb 18, 2022
Collaborator Author

markshannon
Feb 21, 2022
Collaborator Author

markshannon
Feb 24, 2022
Collaborator Author

markshannon
Mar 1, 2022
Collaborator Author

`BINARY_SUBSCR_GET_ITEM`

`LOAD_METHOD` (various specializations)

`CALL` (specializations for `isinstance`, `len`, `list.append`)

`CALL` Python class.

sbrunthaler
Mar 15, 2022

gvanrossum
Mar 15, 2022
Maintainer

markshannon
Mar 16, 2022
Collaborator Author

gvanrossum
Mar 16, 2022
Maintainer

sbrunthaler
Mar 16, 2022

markshannon
Mar 18, 2022
Collaborator Author

sbrunthaler
Mar 18, 2022