Skip to content

deps(cilium): upgrade Cilium to v1.19.2#2148

Open
nddq wants to merge 7 commits intomainfrom
cilium/v1.19-rebase
Open

deps(cilium): upgrade Cilium to v1.19.2#2148
nddq wants to merge 7 commits intomainfrom
cilium/v1.19-rebase

Conversation

@nddq
Copy link
Copy Markdown
Member

@nddq nddq commented Mar 30, 2026

Description

Cilium v1.19.2 pins cilium/ebpf at a post-v0.20.0 commit that removed CollectionSpec.RewriteConstants(). Inspektor-gadget v0.27.0 calls this method in every package Retina imports, causing a fatal compile error. Upgrading IG is not an option — v0.42.0 removed the built-in trace/dns and trace/tcpretrans gadgets entirely (inspektor-gadget/inspektor-gadget#4573), and the earliest IG version compatible with cilium/ebpf v0.20.x is v0.47.0+ where these packages are long gone. The DNS and tcpretrans plugins are rewritten using native cilium/ebpf.

Key dependency changes:

  • github.com/cilium/cilium v1.18.0-pre.1 → v1.19.2
  • github.com/cilium/ebpf v0.18.0 → v0.20.1
  • k8s.io/* v0.32.x → v0.35.3
  • sigs.k8s.io/controller-runtime v0.20.4 → v0.23.3
  • go 1.24.11 → 1.25.0
  • golangci-lint v1 → v2

Changes by area:

  • Logging: Add zap-backed slog.Handler and logr.Logger bridges (pkg/log/zap.go). Migrate all DI-injected loggers from logrus.FieldLogger to *slog.Logger across agent, operator, and Hubble components
  • DNS plugin: Rewrite with native cilium/ebpf (bpf2go). Socket filter captures DNS queries/responses with qtype extracted in BPF, Go-side parses names/addresses with gopacket. Uses bpf_skb_load_bytes for zero-conversion IP extraction. Same metrics as before (dns_request_count, dns_response_count)
  • tcpretrans plugin: Rewrite with native cilium/ebpf (bpf2go). Uses tracepoint/tcp/tcp_retransmit_skb (stable kernel API) with BPF_CORE_READ for verifier-safe field access. TCP flags read from tcp_skb_cb. Supports both IPv4 and IPv6
  • Hubble control plane: Adapt to v1.19 Hive cell changes — hubblecell.Corehubblecell.Cell, force metricscell.Server instantiation. DI cell broken into sub-cells (infrastructureCell, ipcacheCell, watcherCell, stubsCell) for fault isolation. Uses Cilium's pkg/datapath/fake/types and pkg/testutils/identity instead of hand-rolled stubs
  • Operator: Migrate to slog, adapt CRD helper APIs, replace local resource constructors with imports from cilium/operator/k8s (behind _linux build tag), add in-memory kvstore client for identity GC

Related Issue

Fixes #1788

Checklist

  • I have read the contributing documentation.
  • I signed and signed-off the commits (git commit -S -s ...).
  • I have correctly attributed the author(s) of the code.
  • I have tested the changes locally.
  • I have followed the project's style guidelines.
  • I have updated the documentation, if necessary.
  • I have added tests, if applicable.

Testing Completed

E2E run: https://github.com/microsoft/retina/actions/runs/23731276655

Additionally verified on manually provisioned AKS clusters:

  • Standard mode (advanced metrics): adv_dns_request_count / adv_dns_response_count with full pod-level labels (ip, namespace, podname, query, query_type, workload_kind, workload_name), forward_count/forward_bytes, tcp_state, drop_count/drop_bytes, adv_node_apiserver_latency, adv_tcpretrans_count
  • Hubble mode: hubble_dns_queries_total, hubble_flows_processed_total, hubble_drop_total on port 9965
  • TCP retransmissions: Verified with forced SYN retransmissions to non-routable IP — adv_tcpretrans_count correctly attributed to source pod
  • Multi-arch: Agent running on both amd64 and arm64 nodes

Additional Notes

  • The inspektor-gadget dependency is fully removed
  • DNS and tcpretrans BPF programs are pre-compiled via bpf2go and embedded in the binary — no runtime compilation needed. Compile() and Generate() are no-ops retained for plugin manager lifecycle compatibility

@nddq nddq requested a review from a team as a code owner March 30, 2026 00:43
@nddq nddq requested review from jimassa and mainred March 30, 2026 00:43
@nddq nddq force-pushed the cilium/v1.19-rebase branch 26 times, most recently from 5aa10cd to 0df8cbd Compare March 30, 2026 22:35
Signed-off-by: Quang Nguyen <nguyenquang@microsoft.com>
nddq added 6 commits March 30, 2026 23:47
Signed-off-by: Quang Nguyen <nguyenquang@microsoft.com>
Signed-off-by: Quang Nguyen <nguyenquang@microsoft.com>
Signed-off-by: Quang Nguyen <nguyenquang@microsoft.com>
Signed-off-by: Quang Nguyen <nguyenquang@microsoft.com>
Signed-off-by: Quang Nguyen <nguyenquang@microsoft.com>
Signed-off-by: Quang Nguyen <nguyenquang@microsoft.com>
@nddq nddq force-pushed the cilium/v1.19-rebase branch from 0df8cbd to f1bcf41 Compare March 30, 2026 23:48
Copy link
Copy Markdown
Member

@SRodi SRodi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for driving this @nddq I agree the IG incompatibility forces us to move DNS/tcpretrans off inspektor-gadget as part of getting to Cilium v1.19.2.

One suggestion to make this easier to review/debug (without overcomplicating rollout): could we split this into two PRs, but keep the plugin names the same and do a straight replacement (no dual-plugin naming / feature flags)?

Proposed split:

  1. PR: Replace the existing dns + tcpretrans plugins with the ebpf-native implementations and remove the inspektor-gadget dependency (keep external behavior/plugin names the same).
  2. PR: Upgrade github.com/cilium/cilium to v1.19.2 (and any required Go/tooling + Hubble/operator adaptation changes).

Rationale: this isolates the “plugin implementation change” from the broader Cilium/control-plane dependency churn, making review and failure attribution simpler. Rollback is still straightforward at the Retina release level if needed (we can do a release straight after the plugins change), so I don’t think we need to introduce extra flags or temporary plugin naming.

@nddq
Copy link
Copy Markdown
Member Author

nddq commented Mar 31, 2026

@SRodi Sounds good! I'll make that change

@nddq
Copy link
Copy Markdown
Member Author

nddq commented Apr 1, 2026

@SRodi let's try and get this in first #2049 to give us the ebpf testing framework for the other two plugins

github-merge-queue bot pushed a commit that referenced this pull request Apr 9, 2026
# Description

**PR 2 of 2** for removing inspektor-gadget from retina plugins (split
from #2148).

Replaces the IG-based DNS tracer with a native eBPF socket filter and
adds comprehensive kernel-level and Go unit tests.

- **BPF program** (`dns.c`): Socket filter on AF_PACKET captures DNS
queries/responses (UDP + TCP, IPv4 + IPv6). Uses `bpf_skb_load_bytes`
helpers compatible with `BPF_PROG_TEST_RUN`. Query type extracted in BPF
via QNAME walking.
- **Per-interface dedup filter**: Socket binds to all interfaces
(`ifindex=0`) with a BPF-side filter that only captures `PACKET_HOST` on
non-default interfaces (veths) while capturing both directions on the
default interface (eth0). This avoids double-counting while preserving
visibility into same-node pod-to-CoreDNS and NodeLocal DNS traffic.
- **Go plugin** (`dns_linux.go`): Perf buffer event loop with gopacket
DNS parsing. Populates `dns_config` BPF map with default interface index
at Init. No runtime compilation — pre-compiled via `bpf2go@v0.18.0`.
- **Test infrastructure** (`ebpftest/packet.go`):
`TCPPacketOpts.Payload`, IPv6 `BuildUDPPacket`, DNS packet builders
(`BuildDNSQueryPacket`, `BuildDNSResponsePacket`,
`BuildDNSTCPQueryPacket`).
- **eBPF + unit tests**: 28 tests covering IPv4/IPv6, TCP/UDP, mDNS,
edge cases, counter deltas, mock enricher, and `parseDNSPayload`.

## Related Issue

Partial fix for #1788
Split from #2148

## Checklist

- [x] I have read the [contributing
documentation](https://retina.sh/docs/Contributing/overview).
- [x] I signed and signed-off the commits (`git commit -S -s ...`).
- [x] I have correctly attributed the author(s) of the code.
- [x] I have tested the changes locally.
- [x] I have followed the project's style guidelines.
- [ ] I have updated the documentation, if necessary.
- [x] I have added tests, if applicable.

## Testing Completed

All 28 tests pass (14 eBPF + 14 unit). Validated on AKS cluster
(`retinaTest-v119`, 3-node, k8s v1.33.6):

| Scenario | Result |
|---|---|
| Pod → CoreDNS same node | Captured, count=1 per query |
| Pod → CoreDNS different node | Captured, count=1 per query |
| NodeLocal DNS (169.254.20.10) | Both hops captured (pod→cache,
cache→CoreDNS) — expected |
| Multi-NIC | Not tested (single-NIC AKS nodes) |
| Double-counting | None detected |

Hubble and standard mode DNS metrics flowing with correct pod-level
labels.

## Additional Notes

- After both plugin PRs merge, inspektor-gadget can be fully removed in
the Cilium v1.19 upgrade PR
- The old IG tracer used `ifindex=0` with no dedup, which caused
double-counting on multi-interface paths. The new BPF dedup filter
resolves this while maintaining same-node and NodeLocal DNS visibility.

---

Please refer to the [CONTRIBUTING.md](../CONTRIBUTING.md) file for more
information on how to contribute to this project.

Signed-off-by: Quang Nguyen <nguyenquang@microsoft.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Upgrade cilium to stable 1.19

2 participants