gh-120771: Refactor xml.etree.ElementTree._namespaces #120772

danifus · 2024-06-20T02:45:48Z

The refactor also makes _namespaces slightly faster in most scenarios except when more than ~85% of tags are unique and qualified with a namespace. Above 85% it performs similarly to the current implementation. See below for the benchmark code.

import timeit
import xml.etree.ElementTree as ET

# Element = ET._Element_Py
Element = ET.Element

def print_res(name, results):
    results = ", ".join([f"{t:0.6f}" for t in results])
    print(f"{name} [{results}]")


def low_el_cnt_low_cardinality():
    e = Element("root")
    for i in range(10):
        e.append(Element(f"tagtag{i//2}"))

    def run_namespaces():
        ET._namespaces(e)
    print_res("no ns, low el cnt, unique low", timeit.repeat(run_namespaces, number=10000))


def high_el_cnt_high_cardinality():
    e = Element("root")
    for i in range(1000):
        e1 = Element(f"tagtag{i}")
        e.append(e1)

    def run_namespaces():
        ET._namespaces(e)
    print_res("no ns, high el cnt, unique high", timeit.repeat(run_namespaces, number=10000))


def ns_high_el_cnt_cardinality(unique_ratio, el_cnt=1000):
    e = Element("{one}root")
    for i in range(el_cnt):
        if i < el_cnt * (1 - unique_ratio):
            e1 = Element(f"{{nsns1}}tagtag1")
        else:
            e1 = Element(f"{{nsns1{i}}}tagtag{i}")
        e.append(e1)

    def run_namespaces():
        ET._namespaces(e)
    print_res(f"ns unique: {unique_ratio:0.2f}", timeit.repeat(run_namespaces, number=1000))


def ns_deep_cardinality(unique_ratio, el_cnt=1000):
    root = Element("{one}root")
    e = root
    for i in range(el_cnt):
        if i < el_cnt * (1 - unique_ratio):
            e1 = Element(f"{{nsns1}}tagtag1")
        else:
            e1 = Element(f"{{nsns1{i}}}tagtag{i}")
        e.append(e1)
        e = e1

    def run_namespaces():
        ET._namespaces(root)
    print_res(f"deep unique: {unique_ratio:0.2f}", timeit.repeat(run_namespaces, number=1000))


if __name__ == "__main__":
    # no namespaces
    low_el_cnt_low_cardinality()
    high_el_cnt_high_cardinality()

    # namespaces
    ns_high_el_cnt_cardinality(0.05)
    ns_high_el_cnt_cardinality(0.10)
    ns_high_el_cnt_cardinality(1/6)
    ns_high_el_cnt_cardinality(1/3)
    ns_high_el_cnt_cardinality(0.5)
    ns_high_el_cnt_cardinality(2/3)
    ns_high_el_cnt_cardinality(5/6)
    ns_high_el_cnt_cardinality(1)

    # deeply nested
    ns_deep_cardinality(.5)
    ns_deep_cardinality(5/6)
    ns_deep_cardinality(1)

and some results using this PR:

no ns, low el cnt, unique low [0.025025, 0.022563, 0.022072, 0.022136, 0.022028]
no ns, high el cnt, unique high [1.910636, 1.912451, 1.908962, 1.913008, 1.915974]
ns unique: 0.05 [0.132233, 0.131049, 0.130947, 0.131039, 0.131328]
ns unique: 0.10 [0.153051, 0.152679, 0.152685, 0.152955, 0.155044]
ns unique: 0.17 [0.180587, 0.179584, 0.180579, 0.181862, 0.181909]
ns unique: 0.33 [0.258851, 0.260982, 0.255448, 0.259111, 0.260048]
ns unique: 0.50 [0.324508, 0.326558, 0.322348, 0.323117, 0.323377]
ns unique: 0.67 [0.395765, 0.395696, 0.396616, 0.393595, 0.395383]
ns unique: 0.83 [0.470373, 0.468449, 0.465425, 0.465609, 0.466742]
ns unique: 1.00 [0.542408, 0.540645, 0.535599, 0.536222, 0.536216]
deep unique: 0.50 [0.328341, 0.330177, 0.330926, 0.331375, 0.331178]
deep unique: 0.83 [0.477198, 0.473172, 0.477037, 0.477402, 0.476627]
deep unique: 1.00 [0.546563, 0.546853, 0.545765, 0.547028, 0.547592]

main:

no ns, low el cnt, unique low [0.033431, 0.025402, 0.023029, 0.022654, 0.022640]
no ns, high el cnt, unique high [2.134641, 2.137859, 2.134739, 2.137339, 2.137709]
ns unique: 0.05 [0.156102, 0.155261, 0.156591, 0.155523, 0.155656]
ns unique: 0.10 [0.175433, 0.175223, 0.175160, 0.175437, 0.175383]
ns unique: 0.17 [0.202930, 0.202039, 0.202463, 0.202432, 0.202182]
ns unique: 0.33 [0.269995, 0.269267, 0.269315, 0.271844, 0.269823]
ns unique: 0.50 [0.334945, 0.334450, 0.335319, 0.334948, 0.335212]
ns unique: 0.67 [0.406310, 0.408849, 0.407256, 0.405006, 0.407780]
ns unique: 0.83 [0.476466, 0.473063, 0.478798, 0.480725, 0.480755]
ns unique: 1.00 [0.547285, 0.544387, 0.538366, 0.544754, 0.545708]
deep unique: 0.50 [0.340939, 0.339714, 0.337369, 0.341901, 0.340457]
deep unique: 0.83 [0.475656, 0.478056, 0.482441, 0.478939, 0.484476]
deep unique: 1.00 [0.543982, 0.547592, 0.547508, 0.544792, 0.545608]

Issue: Refactor xml.etree.ElementTree._namespaces #120771

Lib/xml/etree/ElementTree.py

danifus · 2024-09-10T00:56:44Z

@serhiy-storchaka are you able to review this please?

For some more context, this refactor is to replace the add_qname closure in xml.etree.ElementTree._namespaces with a generator that iterates over the qnames. I hope to simplify and deduplicate the logic around processing the qnames so that fixing the bugs / adding the features listed in #120771 is clearer (I have a branch in my fork that fixes all the mentioned issues and plan to create PRs sequentially as they build on eachother).

I did a bit more profiling / experimentation and found that the key variable affecting performance was the number of namespaces used. The more unique namespaces, the worse the performance (beyond 85% unique namespaces this patch is slightly slower than main). The uniqueness of tag names didn't have much of an impact. This means that this patch should be faster in almost all scenarios - I think it would be rare to have 85% of qnames with a unique namespace!

pythongh-120771 Refactor xml.etree.ElementTree._namespaces

5708509

bedevere-app bot mentioned this pull request Jun 20, 2024

Refactor xml.etree.ElementTree._namespaces #120771

Open

bedevere-app bot added the awaiting review label Jun 20, 2024

Use .values to get the prefixes from _namespace_map

862ec38

nineteendo reviewed Jun 20, 2024

View reviewed changes

Lib/xml/etree/ElementTree.py Outdated Show resolved Hide resolved

Remove whitespace change

61722bb

danifus mentioned this pull request Jun 22, 2024

gh-118416: Fix possible etree issues when registering default namespaces #118417

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

gh-120771: Refactor xml.etree.ElementTree._namespaces #120772

gh-120771: Refactor xml.etree.ElementTree._namespaces #120772

Uh oh!

danifus commented Jun 20, 2024 •

edited by bedevere-app bot

Loading

Uh oh!

Uh oh!

danifus commented Sep 10, 2024

Uh oh!

Uh oh!

Uh oh!

gh-120771: Refactor xml.etree.ElementTree._namespaces #120772

Are you sure you want to change the base?

gh-120771: Refactor xml.etree.ElementTree._namespaces #120772

Uh oh!

Conversation

danifus commented Jun 20, 2024 • edited by bedevere-app bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

danifus commented Sep 10, 2024

Uh oh!

Uh oh!

danifus commented Jun 20, 2024 •

edited by bedevere-app bot

Loading