Skip to content

bpo-31650: PEP 552 (Deterministic pycs) implementation #4575

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 14 commits into from
Dec 9, 2017
Merged
6 changes: 6 additions & 0 deletions Doc/glossary.rst
Original file line number Diff line number Diff line change
Expand Up @@ -458,6 +458,12 @@ Glossary
is believed that overcoming this performance issue would make the
implementation much more complicated and therefore costlier to maintain.


hash-based pyc
A bytecode cache file that uses the the hash rather than the last-modified
time of the corresponding source file to determine its validity. See
:ref:`pyc-invalidation`.

hashable
An object is *hashable* if it has a hash value which never changes during
its lifetime (it needs a :meth:`__hash__` method), and can be compared to
Expand Down
36 changes: 33 additions & 3 deletions Doc/library/compileall.rst
Original file line number Diff line number Diff line change
Expand Up @@ -83,6 +83,16 @@ compile Python sources.
If ``0`` is used, then the result of :func:`os.cpu_count()`
will be used.

.. cmdoption:: --invalidation-mode [timestamp|checked-hash|unchecked-hash]

Control how the generated pycs will be invalidated at runtime. The default
setting, ``timestamp``, means that ``.pyc`` files with the source timestamp
and size embedded will be generated. The ``checked-hash`` and
``unchecked-hash`` values cause hash-based pycs to be generated. Hash-based
pycs embed a hash of the source file contents rather than a timestamp. See
:ref:`pyc-invalidation` for more information on how Python validates bytecode
cache files at runtime.

.. versionchanged:: 3.2
Added the ``-i``, ``-b`` and ``-h`` options.

Expand All @@ -91,6 +101,9 @@ compile Python sources.
was changed to a multilevel value. ``-b`` will always produce a
byte-code file ending in ``.pyc``, never ``.pyo``.

.. versionchanged:: 3.7
Added the ``--invalidation-mode`` parameter.


There is no command-line option to control the optimization level used by the
:func:`compile` function, because the Python interpreter itself already
Expand All @@ -99,7 +112,7 @@ provides the option: :program:`python -O -m compileall`.
Public functions
----------------

.. function:: compile_dir(dir, maxlevels=10, ddir=None, force=False, rx=None, quiet=0, legacy=False, optimize=-1, workers=1)
.. function:: compile_dir(dir, maxlevels=10, ddir=None, force=False, rx=None, quiet=0, legacy=False, optimize=-1, workers=1, invalidation_mode=py_compile.PycInvalidationMode.TIMESTAMP)

Recursively descend the directory tree named by *dir*, compiling all :file:`.py`
files along the way. Return a true value if all the files compiled successfully,
Expand Down Expand Up @@ -140,6 +153,10 @@ Public functions
then sequential compilation will be used as a fallback. If *workers* is
lower than ``0``, a :exc:`ValueError` will be raised.

*invalidation_mode* should be a member of the
:class:`py_compile.PycInvalidationMode` enum and controls how the generated
pycs are invalidated at runtime.

.. versionchanged:: 3.2
Added the *legacy* and *optimize* parameter.

Expand All @@ -156,7 +173,10 @@ Public functions
.. versionchanged:: 3.6
Accepts a :term:`path-like object`.

.. function:: compile_file(fullname, ddir=None, force=False, rx=None, quiet=0, legacy=False, optimize=-1)
.. versionchanged:: 3.7
The *invalidation_mode* parameter was added.

.. function:: compile_file(fullname, ddir=None, force=False, rx=None, quiet=0, legacy=False, optimize=-1, invalidation_mode=py_compile.PycInvalidationMode.TIMESTAMP)

Compile the file with path *fullname*. Return a true value if the file
compiled successfully, and a false value otherwise.
Expand Down Expand Up @@ -184,6 +204,10 @@ Public functions
*optimize* specifies the optimization level for the compiler. It is passed to
the built-in :func:`compile` function.

*invalidation_mode* should be a member of the
:class:`py_compile.PycInvalidationMode` enum and controls how the generated
pycs are invalidated at runtime.

.. versionadded:: 3.2

.. versionchanged:: 3.5
Expand All @@ -193,7 +217,10 @@ Public functions
The *legacy* parameter only writes out ``.pyc`` files, not ``.pyo`` files
no matter what the value of *optimize* is.

.. function:: compile_path(skip_curdir=True, maxlevels=0, force=False, quiet=0, legacy=False, optimize=-1)
.. versionchanged:: 3.7
The *invalidation_mode* parameter was added.

.. function:: compile_path(skip_curdir=True, maxlevels=0, force=False, quiet=0, legacy=False, optimize=-1, invalidation_mode=py_compile.PycInvalidationMode.TIMESTAMP)

Byte-compile all the :file:`.py` files found along ``sys.path``. Return a
true value if all the files compiled successfully, and a false value otherwise.
Expand All @@ -213,6 +240,9 @@ Public functions
The *legacy* parameter only writes out ``.pyc`` files, not ``.pyo`` files
no matter what the value of *optimize* is.

.. versionchanged:: 3.7
The *invalidation_mode* parameter was added.

To force a recompile of all the :file:`.py` files in the :file:`Lib/`
subdirectory and all its subdirectories::

Expand Down
11 changes: 11 additions & 0 deletions Doc/library/importlib.rst
Original file line number Diff line number Diff line change
Expand Up @@ -67,6 +67,9 @@ generically as an :term:`importer`) to participate in the import process.
:pep:`489`
Multi-phase extension module initialization

:pep:`552`
Deterministic pycs

:pep:`3120`
Using UTF-8 as the Default Source Encoding

Expand Down Expand Up @@ -1327,6 +1330,14 @@ an :term:`importer`.
.. versionchanged:: 3.6
Accepts a :term:`path-like object`.

.. function:: source_hash(source_bytes)

Return the hash of *source_bytes* as bytes. A hash-based ``.pyc`` file embeds
the :func:`source_hash` of the corresponding source file's contents in its
header.

.. versionadded:: 3.7

.. class:: LazyLoader(loader)

A class which postpones the execution of the loader of a module until the
Expand Down
41 changes: 40 additions & 1 deletion Doc/library/py_compile.rst
Original file line number Diff line number Diff line change
Expand Up @@ -27,7 +27,7 @@ byte-code cache files in the directory containing the source code.
Exception raised when an error occurs while attempting to compile the file.


.. function:: compile(file, cfile=None, dfile=None, doraise=False, optimize=-1)
.. function:: compile(file, cfile=None, dfile=None, doraise=False, optimize=-1, invalidation_mode=PycInvalidationMode.TIMESTAMP)

Compile a source file to byte-code and write out the byte-code cache file.
The source code is loaded from the file named *file*. The byte-code is
Expand All @@ -53,6 +53,10 @@ byte-code cache files in the directory containing the source code.
:func:`compile` function. The default of ``-1`` selects the optimization
level of the current interpreter.

*invalidation_mode* should be a member of the :class:`PycInvalidationMode`
enum and controls how the generated ``.pyc`` files are invalidated at
runtime.

.. versionchanged:: 3.2
Changed default value of *cfile* to be :PEP:`3147`-compliant. Previous
default was *file* + ``'c'`` (``'o'`` if optimization was enabled).
Expand All @@ -65,6 +69,41 @@ byte-code cache files in the directory containing the source code.
caveat that :exc:`FileExistsError` is raised if *cfile* is a symlink or
non-regular file.

.. versionchanged:: 3.7
The *invalidation_mode* parameter was added as specified in :pep:`552`.


.. class:: PycInvalidationMode

A enumeration of possible methods the interpreter can use to determine
whether a bytecode file is up to date with a source file. The ``.pyc`` file
indicates the desired invalidation mode in its header. See
:ref:`pyc-invalidation` for more information on how Python invalidates
``.pyc`` files at runtime.

.. versionadded:: 3.7

.. attribute:: TIMESTAMP

The ``.pyc`` file includes the timestamp and size of the source file,
which Python will compare against the metadata of the source file at
runtime to determine if the ``.pyc`` file needs to be regenerated.

.. attribute:: CHECKED_HASH

The ``.pyc`` file includes a hash of the source file content, which Python
will compare against the source at runtime to determine if the ``.pyc``
file needs to be regenerated.

.. attribute:: UNCHECKED_HASH

Like :attr:`CHECKED_HASH`, the ``.pyc`` file includes a hash of the source
file content. However, Python will at runtime assume the ``.pyc`` file is
up to date and not validate the ``.pyc`` against the source file at all.

This option is useful when the ``.pycs`` are kept up to date by some
system external to Python like a build system.


.. function:: main(args=None)

Expand Down
27 changes: 27 additions & 0 deletions Doc/reference/import.rst
Original file line number Diff line number Diff line change
Expand Up @@ -675,6 +675,33 @@ Here are the exact rules used:
:meth:`~importlib.abc.Loader.module_repr` method, if defined, before
trying either approach described above. However, the method is deprecated.

.. _pyc-invalidation:

Cached bytecode invalidation
----------------------------

Before Python loads cached bytecode from ``.pyc`` file, it checks whether the
cache is up-to-date with the source ``.py`` file. By default, Python does this
by storing the source's last-modified timestamp and size in the cache file when
writing it. At runtime, the import system then validates the cache file by
checking the stored metadata in the cache file against at source's
metadata.

Python also supports "hash-based" cache files, which store a hash of the source
file's contents rather than its metadata. There are two variants of hash-based
``.pyc`` files: checked and unchecked. For checked hash-based ``.pyc`` files,
Python validates the cache file by hashing the source file and comparing the
resulting hash with the hash in the cache file. If a checked hash-based cache
file is found to be invalid, Python regenerates it and writes a new checked
hash-based cache file. For unchecked hash-based ``.pyc`` files, Python simply
assumes the cache file is valid if it exists. Hash-based ``.pyc`` files
validation behavior may be overridden with the :option:`--check-hash-based-pycs`
flag.

.. versionchanged:: 3.7
Added hash-based ``.pyc`` files. Previously, Python only supported
timestamp-based invalidation of bytecode caches.


The Path Based Finder
=====================
Expand Down
14 changes: 14 additions & 0 deletions Doc/using/cmdline.rst
Original file line number Diff line number Diff line change
Expand Up @@ -210,6 +210,20 @@ Miscellaneous options
import of source modules. See also :envvar:`PYTHONDONTWRITEBYTECODE`.


.. cmdoption:: --check-hash-based-pycs default|always|never

Control the validation behavior of hash-based ``.pyc`` files. See
:ref:`pyc-invalidation`. When set to ``default``, checked and unchecked
hash-based bytecode cache files are validated according to their default
semantics. When set to ``always``, all hash-based ``.pyc`` files, whether
checked or unchecked, are validated against their corresponding source
file. When set to ``never``, hash-based ``.pyc`` files are not validated
against their corresponding source files.

The semantics of timestamp-based ``.pyc`` files are unaffected by this
option.


.. cmdoption:: -d

Turn on parser debugging output (for expert only, depending on compilation
Expand Down
27 changes: 27 additions & 0 deletions Doc/whatsnew/3.7.rst
Original file line number Diff line number Diff line change
Expand Up @@ -197,6 +197,33 @@ variable is not set in practice.

See :option:`-X` ``dev`` for the details.

Hash-based pycs
---------------

Python has traditionally checked the up-to-dateness of bytecode cache files
(i.e., ``.pyc`` files) by comparing the source metadata (last-modified timestamp
and size) with source metadata saved in the cache file header when it was
generated. While effective, this invalidation method has its drawbacks. When
filesystem timestamps are too coarse, Python can miss source updates, leading to
user confusion. Additionally, having a timestamp in the cache file is
problematic for `build reproduciblity <https://reproducible-builds.org/>`_ and
content-based build systems.

:pep:`552` extends the pyc format to allow the hash of the source file to be
used for invalidation instead of the source timestamp. Such ``.pyc`` files are
called "hash-based". By default, Python still uses timestamp-based invalidation
and does not generate hash-based ``.pyc`` files at runtime. Hash-based ``.pyc``
files may be generated with :mod:`py_compile` or :mod:`compileall`.

Hash-based ``.pyc`` files come in two variants: checked and unchecked. Python
validates checked hash-based ``.pyc`` files against the corresponding source
files at runtime but doesn't do so for unchecked hash-based pycs. Unchecked
hash-based ``.pyc`` files are a useful performance optimization for environments
where a system external to Python (e.g., the build system) is responsible for
keeping ``.pyc`` files up-to-date.

See :ref:`pyc-invalidation` for more information.


Other Language Changes
======================
Expand Down
6 changes: 6 additions & 0 deletions Include/internal/hash.h
Original file line number Diff line number Diff line change
@@ -0,0 +1,6 @@
#ifndef Py_INTERNAL_HASH_H
#define Py_INTERNAL_HASH_H

uint64_t _Py_KeyedHash(uint64_t, const char *, Py_ssize_t);

#endif
6 changes: 6 additions & 0 deletions Include/internal/import.h
Original file line number Diff line number Diff line change
@@ -0,0 +1,6 @@
#ifndef Py_INTERNAL_IMPORT_H
#define Py_INTERNAL_IMPORT_H

extern const char *_Py_CheckHashBasedPycsMode;

#endif
9 changes: 8 additions & 1 deletion Include/pygetopt.h
Original file line number Diff line number Diff line change
Expand Up @@ -12,7 +12,14 @@ PyAPI_DATA(wchar_t *) _PyOS_optarg;

PyAPI_FUNC(void) _PyOS_ResetGetOpt(void);

PyAPI_FUNC(int) _PyOS_GetOpt(int argc, wchar_t **argv, wchar_t *optstring);
typedef struct {
const wchar_t *name;
int has_arg;
int val;
} _PyOS_LongOption;

PyAPI_FUNC(int) _PyOS_GetOpt(int argc, wchar_t **argv, wchar_t *optstring,
const _PyOS_LongOption *longopts, int *longindex);
#endif /* !Py_LIMITED_API */

#ifdef __cplusplus
Expand Down
Loading