Skip to content

gh-120771: Refactor xml.etree.ElementTree._namespaces #120772

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 3 commits into
base: main
Choose a base branch
from

Conversation

danifus
Copy link
Contributor

@danifus danifus commented Jun 20, 2024

The refactor also makes _namespaces slightly faster in most scenarios except when more than ~85% of tags are unique and qualified with a namespace. Above 85% it performs similarly to the current implementation. See below for the benchmark code.

import timeit
import xml.etree.ElementTree as ET

# Element = ET._Element_Py
Element = ET.Element

def print_res(name, results):
    results = ", ".join([f"{t:0.6f}" for t in results])
    print(f"{name} [{results}]")


def low_el_cnt_low_cardinality():
    e = Element("root")
    for i in range(10):
        e.append(Element(f"tagtag{i//2}"))

    def run_namespaces():
        ET._namespaces(e)
    print_res("no ns, low el cnt, unique low", timeit.repeat(run_namespaces, number=10000))


def high_el_cnt_high_cardinality():
    e = Element("root")
    for i in range(1000):
        e1 = Element(f"tagtag{i}")
        e.append(e1)

    def run_namespaces():
        ET._namespaces(e)
    print_res("no ns, high el cnt, unique high", timeit.repeat(run_namespaces, number=10000))


def ns_high_el_cnt_cardinality(unique_ratio, el_cnt=1000):
    e = Element("{one}root")
    for i in range(el_cnt):
        if i < el_cnt * (1 - unique_ratio):
            e1 = Element(f"{{nsns1}}tagtag1")
        else:
            e1 = Element(f"{{nsns1{i}}}tagtag{i}")
        e.append(e1)

    def run_namespaces():
        ET._namespaces(e)
    print_res(f"ns unique: {unique_ratio:0.2f}", timeit.repeat(run_namespaces, number=1000))


def ns_deep_cardinality(unique_ratio, el_cnt=1000):
    root = Element("{one}root")
    e = root
    for i in range(el_cnt):
        if i < el_cnt * (1 - unique_ratio):
            e1 = Element(f"{{nsns1}}tagtag1")
        else:
            e1 = Element(f"{{nsns1{i}}}tagtag{i}")
        e.append(e1)
        e = e1

    def run_namespaces():
        ET._namespaces(root)
    print_res(f"deep unique: {unique_ratio:0.2f}", timeit.repeat(run_namespaces, number=1000))


if __name__ == "__main__":
    # no namespaces
    low_el_cnt_low_cardinality()
    high_el_cnt_high_cardinality()

    # namespaces
    ns_high_el_cnt_cardinality(0.05)
    ns_high_el_cnt_cardinality(0.10)
    ns_high_el_cnt_cardinality(1/6)
    ns_high_el_cnt_cardinality(1/3)
    ns_high_el_cnt_cardinality(0.5)
    ns_high_el_cnt_cardinality(2/3)
    ns_high_el_cnt_cardinality(5/6)
    ns_high_el_cnt_cardinality(1)

    # deeply nested
    ns_deep_cardinality(.5)
    ns_deep_cardinality(5/6)
    ns_deep_cardinality(1)

and some results using this PR:

no ns, low el cnt, unique low [0.025025, 0.022563, 0.022072, 0.022136, 0.022028]
no ns, high el cnt, unique high [1.910636, 1.912451, 1.908962, 1.913008, 1.915974]
ns unique: 0.05 [0.132233, 0.131049, 0.130947, 0.131039, 0.131328]
ns unique: 0.10 [0.153051, 0.152679, 0.152685, 0.152955, 0.155044]
ns unique: 0.17 [0.180587, 0.179584, 0.180579, 0.181862, 0.181909]
ns unique: 0.33 [0.258851, 0.260982, 0.255448, 0.259111, 0.260048]
ns unique: 0.50 [0.324508, 0.326558, 0.322348, 0.323117, 0.323377]
ns unique: 0.67 [0.395765, 0.395696, 0.396616, 0.393595, 0.395383]
ns unique: 0.83 [0.470373, 0.468449, 0.465425, 0.465609, 0.466742]
ns unique: 1.00 [0.542408, 0.540645, 0.535599, 0.536222, 0.536216]
deep unique: 0.50 [0.328341, 0.330177, 0.330926, 0.331375, 0.331178]
deep unique: 0.83 [0.477198, 0.473172, 0.477037, 0.477402, 0.476627]
deep unique: 1.00 [0.546563, 0.546853, 0.545765, 0.547028, 0.547592]

main:

no ns, low el cnt, unique low [0.033431, 0.025402, 0.023029, 0.022654, 0.022640]
no ns, high el cnt, unique high [2.134641, 2.137859, 2.134739, 2.137339, 2.137709]
ns unique: 0.05 [0.156102, 0.155261, 0.156591, 0.155523, 0.155656]
ns unique: 0.10 [0.175433, 0.175223, 0.175160, 0.175437, 0.175383]
ns unique: 0.17 [0.202930, 0.202039, 0.202463, 0.202432, 0.202182]
ns unique: 0.33 [0.269995, 0.269267, 0.269315, 0.271844, 0.269823]
ns unique: 0.50 [0.334945, 0.334450, 0.335319, 0.334948, 0.335212]
ns unique: 0.67 [0.406310, 0.408849, 0.407256, 0.405006, 0.407780]
ns unique: 0.83 [0.476466, 0.473063, 0.478798, 0.480725, 0.480755]
ns unique: 1.00 [0.547285, 0.544387, 0.538366, 0.544754, 0.545708]
deep unique: 0.50 [0.340939, 0.339714, 0.337369, 0.341901, 0.340457]
deep unique: 0.83 [0.475656, 0.478056, 0.482441, 0.478939, 0.484476]
deep unique: 1.00 [0.543982, 0.547592, 0.547508, 0.544792, 0.545608]

@danifus
Copy link
Contributor Author

danifus commented Sep 10, 2024

@serhiy-storchaka are you able to review this please?

For some more context, this refactor is to replace the add_qname closure in xml.etree.ElementTree._namespaces with a generator that iterates over the qnames. I hope to simplify and deduplicate the logic around processing the qnames so that fixing the bugs / adding the features listed in #120771 is clearer (I have a branch in my fork that fixes all the mentioned issues and plan to create PRs sequentially as they build on eachother).

I did a bit more profiling / experimentation and found that the key variable affecting performance was the number of namespaces used. The more unique namespaces, the worse the performance (beyond 85% unique namespaces this patch is slightly slower than main). The uniqueness of tag names didn't have much of an impact. This means that this patch should be faster in almost all scenarios - I think it would be rare to have 85% of qnames with a unique namespace!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants