Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

COS-3126: overlay.d: replace 25rhcos-azure-udev #1776

Open
wants to merge 2 commits into
base: master
Choose a base branch
from

Conversation

marmijo
Copy link
Contributor

@marmijo marmijo commented Mar 19, 2025

The current 25rhcos-azure-udev overlay has been obsolete since we moved to EL9. Let's drop it and replace it with another overlay with the same name.

The new 25rhcos-azure-udev overlay includes udev rules that support Azure Managed NVMe disks. These udev rules are shipped as part of the azure-vm-utils package[1], but that package wont be included until EL10[2]. We can drop this overlay when we move to EL10, provided the package is included at that time.

[1] https://github.com/Azure/azure-vm-utils/blob/9c596916b6774f24420dac0ee7a72a6c9ddb5060/udev/80-azure-disk.rules
[2] https://issues.redhat.com/browse/RHEL-73904

Since el9, the WALinuxAgent-udev package installs udev rules
directly into the initramfs. As a result, this override is
now obsolete and can be removed.
@openshift-ci openshift-ci bot requested review from aaradhak and yasminvalim March 19, 2025 20:17
Copy link
Contributor

openshift-ci bot commented Mar 19, 2025

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: marmijo

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci openshift-ci bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Mar 19, 2025
@marmijo
Copy link
Contributor Author

marmijo commented Mar 20, 2025

/retest

@@ -8,6 +8,9 @@ remove-from-packages:
- - zram-generator
- "/usr/lib/systemd/zram-generator.conf"

ostree-layers:
- overlay/25rhcos-azure-udev
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Having the same comment from below here would be nice:

Suggested change
- overlay/25rhcos-azure-udev
# The azure-vm-utils package provides these udev rules[1], but it wont be added
# until EL10 [1]. This can be dropped when moving to EL10, provided the
# package is included at that time.
- overlay/25rhcos-azure-udev

ATTRS{nsid}=="?*", ENV{AZURE_DISK_TYPE}="data", PROGRAM="/bin/sh -ec 'echo $$((%s{nsid}-2))'", ENV{AZURE_DISK_LUN}="$result"

LABEL="azure_disk_nvme_id"
ATTRS{nsid}=="?*", IMPORT{program}="/usr/sbin/azure-nvme-id --udev"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is a diff from upstream (fedora RPM) here:

--- /usr/lib/udev/rules.d/80-azure-disk.rules   1970-01-01 00:00:00.000000000 +0000
+++ /var/mnt/workdir/80-azure-disk.rules        2025-03-20 16:12:18.020106392 +0000
@@ -32,7 +32,7 @@
 ATTRS{nsid}=="?*", ENV{AZURE_DISK_TYPE}="data", PROGRAM="/bin/sh -ec 'echo $$((%s{nsid}-2))'", ENV{AZURE_DISK_LUN}="$result"
 
 LABEL="azure_disk_nvme_id"
-ATTRS{nsid}=="?*", IMPORT{program}="/usr/bin/azure-nvme-id --udev"
+ATTRS{nsid}=="?*", IMPORT{program}="/usr/sbin/azure-nvme-id --udev"
 
 LABEL="azure_disk_symlink"
 # systemd v254 ships an updated 60-persistent-storage.rules that would allow

Though.. this really highlights a problem here that I didn't understand before. Previously the udev rules we provided were just rules and didn't require external software.

Here these rules are running /usr/sbin/azure-nvme-id, but we aren't shipping that here (because we aren't shipping an RPM).

I think this may be a blocker for us for el9.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is interesting because these rules are directly copied from the azure-vm-utils github repo, so there's a diff between that file and the udev rules shipped in the RPM.

as for the azure-nvme-id binary: The tool's description says:

azure-nvme-id provides identification metadata in the response to Identify Namespace command
for some models of NVMe devices. This is found in vendor-specific (vs) field which contains various
identification details with a comma-separated, key=value format.

I'm not sure what, within Azure, would rely on this vendor specific data, but there could be unexpected behavior if the binary is missing. However, the original bug was confirmed to be resolved by adding these udev rules without installing the whole RPM, and using them in an RHCOS Azure instance seems to work as expected.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is interesting because these rules are directly copied from the azure-vm-utils github repo, so there's a diff between that file and the udev rules shipped in the RPM.

yeah. In rawhide there was a usr/bin usr/sbin migration thing and that probably made its way into f41/f42 RPMs too so that probably is what this is about. It probably shouldn't matter too much I don't guess.

I'm not sure what, within Azure, would rely on this vendor specific data, but there could be unexpected behavior if the binary is missing. However, the original bug was confirmed to be resolved by adding these udev rules without installing the whole RPM, and using them in an RHCOS Azure instance seems to work as expected.

So I think you're telling me the rules we need aren't this one that runs this binary?

I guess it's OK then, but I'd almost rather remove the lines that rely on the binary (or comment them out and add a comment why).

Copy link
Contributor Author

@marmijo marmijo Mar 21, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, the other rules still work and set everything up correctly even if the binary is missing.

Suggested change
ATTRS{nsid}=="?*", IMPORT{program}="/usr/sbin/azure-nvme-id --udev"
#ATTRS{nsid}=="?*", IMPORT{program}="/usr/sbin/azure-nvme-id --udev"

I commented out the line, but left LABEL="azure_disk_nvme_id" intact so the other rules with GOTO="azure_disk_nvme_id" continue to function properly. The behavior was the same with or without that line commented out:

  • AZURE_DISK_LUN is set to a value on NVMe data disks
  • data and os NVMe disks show AZURE_DISK_TYPE=data and AZURE_DISK_TYPE=os respectively
  • NVMe disks and their partitions show up in /dev/disk/azure and /dev/disk/azure/data/by-lun as expected

This means the udev rule that calls the binary must be failing silently and isn't necessary for this specific issue.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm happy to have it commented out here in the code with a comment that says why we commented it out and then we should be able to move forward I think.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done!

Add an overlay to install udev rules that support certain VM types that
present managed disks as NVMe devices instead of traditional SCSI.

These rules are provded by the azure-vm-utils package which will be
included in EL10[1], this overlay can be dropped at that time.

[1] https://issues.redhat.com/browse/RHEL-73904
Copy link
Contributor

openshift-ci bot commented Mar 22, 2025

@marmijo: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
ci/prow/e2e-aws 5226f5f link false /test e2e-aws
ci/prow/okd-scos-e2e-aws-ovn 5226f5f link false /test okd-scos-e2e-aws-ovn

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

@travier
Copy link
Member

travier commented Mar 24, 2025

That looks correct, but that seems risky, as we don't know what would need those attributes and how important they are.

@dustymabe
Copy link
Member

That looks correct, but that seems risky, as we don't know what would need those attributes and how important they are.

Yeah. I agree it seems risky. But really this is our only option if we want this now.

We should probably name it as a limitation and then move forward based on the feedback.

@travier travier changed the title [COS-3126] overlay.d: replace 25rhcos-azure-udev COS-3126: overlay.d: replace 25rhcos-azure-udev Mar 26, 2025
@openshift-ci-robot openshift-ci-robot added the jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. label Mar 26, 2025
@openshift-ci-robot
Copy link

openshift-ci-robot commented Mar 26, 2025

@marmijo: This pull request references COS-3126 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "4.19.0" version, but no target version was set.

In response to this:

The current 25rhcos-azure-udev overlay has been obsolete since we moved to EL9. Let's drop it and replace it with another overlay with the same name.

The new 25rhcos-azure-udev overlay includes udev rules that support Azure Managed NVMe disks. These udev rules are shipped as part of the azure-vm-utils package[1], but that package wont be included until EL10[2]. We can drop this overlay when we move to EL10, provided the package is included at that time.

[1] https://github.com/Azure/azure-vm-utils/blob/9c596916b6774f24420dac0ee7a72a6c9ddb5060/udev/80-azure-disk.rules
[2] https://issues.redhat.com/browse/RHEL-73904

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. jira/valid-reference Indicates that this PR references a valid Jira ticket of any type.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants