Thomas Goirand
2024-11-28 21:40:01 UTC
Package: release.debian.org
Severity: normal
Tags: bookworm
User: ***@packages.debian.org
Usertags: pu
X-Debbugs-Cc: live-***@packages.debian.org
Control: affects -1 + src:live-boot
Dear stable release team,
[ Reason ]
I'd like to update live-boot to fix its PXE-booting process where,
currently in Bookworm, live-boot, when it reaches the phase where
it should do DHCP and fetch its squashfs over network, only attemps
DHCP on the first NIC that it detect has a link up.
[ Impact ]
tl;dr: my patch makes live-boot tries DHCP on each and every NIC
that has a link. Before the patch, live-boot instead miserably fails
to do DHCP. So if 2 NICs are connected, but only one has DHCP, it
may just fail if the one without DHCP is discovered first.
More in details with my production system example:
Let me describe our current use case where live-boot currently fails
in Bookworm.
We use a live Debian distribution (custom live image) that is booted
over PXE to install Debian. DHCP / PXE is only available only in the
first NIC (for us, a 1Gbits/s NIC, most of the time). The other 2
NICs (they are 25Gbits/s) are to be used in production. We do things
this way for security, to segretate networking for boot and production.
Unfortunately, if live-boot "sees" the 2x 25Gbits/s NIC first, before
the 1Gbits/s, it will attempt to do DHCP on them (well, in fact, one
of them), as they are connected (ie: they have a "link up"), and then
since they have no DHCP offer, live-boot will fail. It will never
attempt to do DHCP request on the 1Gbits/s NIC.
The patch that I'm proposing, we already use it in production. What
it does, is that with it, live-boot attempts to do DHCP on each and
every NIC that has a link, one by one. So in the above scenario, it
tries on the 2x25Gbits/s NICs (since they are discovered first), but
as it fails on them, live-boot will continue and try on the 1Gbits/s
NIC as well.
[ Tests ]
We've been patching our live systems ramdisk with the modified
components/9990-select-eth-device.sh script (uncompress the ramdisk,
replace the script with the new version, and recompress). This made
our servers magically boot-up, trying all NICs.
Doing this is painful, and has to be done manually after live-boot
builds its initrd. One cannot simply use a modified live-boot package
as live-boot is downloaded by live-build (from the provided URL in
the config file) when the live image is created. So it would be very
useful to have this fixed in stable, rather than pointing our users
how to fix...
[ Risks ]
The code is trivial and easy to understand (a simple bash script).
[ Checklist ]
[x] *all* changes are documented in the d/changelog
[x] I reviewed all changes and I approve them
[x] attach debdiff against the package in (old)stable
[x] the issue is verified as fixed in unstable
[ Changes ]
See attached diff file.
Severity: normal
Tags: bookworm
User: ***@packages.debian.org
Usertags: pu
X-Debbugs-Cc: live-***@packages.debian.org
Control: affects -1 + src:live-boot
Dear stable release team,
[ Reason ]
I'd like to update live-boot to fix its PXE-booting process where,
currently in Bookworm, live-boot, when it reaches the phase where
it should do DHCP and fetch its squashfs over network, only attemps
DHCP on the first NIC that it detect has a link up.
[ Impact ]
tl;dr: my patch makes live-boot tries DHCP on each and every NIC
that has a link. Before the patch, live-boot instead miserably fails
to do DHCP. So if 2 NICs are connected, but only one has DHCP, it
may just fail if the one without DHCP is discovered first.
More in details with my production system example:
Let me describe our current use case where live-boot currently fails
in Bookworm.
We use a live Debian distribution (custom live image) that is booted
over PXE to install Debian. DHCP / PXE is only available only in the
first NIC (for us, a 1Gbits/s NIC, most of the time). The other 2
NICs (they are 25Gbits/s) are to be used in production. We do things
this way for security, to segretate networking for boot and production.
Unfortunately, if live-boot "sees" the 2x 25Gbits/s NIC first, before
the 1Gbits/s, it will attempt to do DHCP on them (well, in fact, one
of them), as they are connected (ie: they have a "link up"), and then
since they have no DHCP offer, live-boot will fail. It will never
attempt to do DHCP request on the 1Gbits/s NIC.
The patch that I'm proposing, we already use it in production. What
it does, is that with it, live-boot attempts to do DHCP on each and
every NIC that has a link, one by one. So in the above scenario, it
tries on the 2x25Gbits/s NICs (since they are discovered first), but
as it fails on them, live-boot will continue and try on the 1Gbits/s
NIC as well.
[ Tests ]
We've been patching our live systems ramdisk with the modified
components/9990-select-eth-device.sh script (uncompress the ramdisk,
replace the script with the new version, and recompress). This made
our servers magically boot-up, trying all NICs.
Doing this is painful, and has to be done manually after live-boot
builds its initrd. One cannot simply use a modified live-boot package
as live-boot is downloaded by live-build (from the provided URL in
the config file) when the live image is created. So it would be very
useful to have this fixed in stable, rather than pointing our users
how to fix...
[ Risks ]
The code is trivial and easy to understand (a simple bash script).
[ Checklist ]
[x] *all* changes are documented in the d/changelog
[x] I reviewed all changes and I approve them
[x] attach debdiff against the package in (old)stable
[x] the issue is verified as fixed in unstable
[ Changes ]
See attached diff file.