Update ceph-with-rook.md #11120

suse-coder · 2025-05-29T20:57:09Z

add nbd module one has to enable and single node setup

Pull Request

What? (description)

Why? (reasoning)

Acceptance

Please use the following checklist:

you linked an issue (if applicable)
you included tests (if applicable)
you ran conformance (make conformance)
you formatted your code (make fmt)
you linted your code (make lint)
you generated documentation (make docs)
you ran unit-tests (make unit-tests)

See make help for a description of the available targets.

add nbd module one has to enable and single node setup Signed-off-by: suse-coder <[email protected]>

dajrivera · 2025-05-30T18:35:57Z

website/content/v1.8/kubernetes-guides/configuration/ceph-with-rook.md

+### 1. Enable the Ceph Kernel Module
+
+Talos includes the `nbd` kernel module, but it needs to be explicitly enabled.
+
+**Create a patch file** (`patch.values.yaml`):
+
+```yaml
+machine:
+  kernel:
+    modules:
+      - name: nbd
+```
+**Apply the kernel module patch**:
+```shell
+talosctl -n 192.168.178.79 patch mc --patch @./terraform/talos/patch/patch.yaml


krbd kernel module works by default. Is there a reason to change to rbd-nbd and does the rook ceph cluster choose the nbd module over krbd if it's available?

when I installed ceph it had an error that nbd was not enabled. So I did. nbd ist more modern and has way better way to snapshot volumes with not getting inconsistent journals: https://engineering.salesforce.com/mapping-kubernetes-ceph-volumes-the-rbd-nbd-way-21f7c4161f04/

That post seems fairly old since it's referencing ceph jewel and centos 7. Although rbd-nbd could provide earlier access to newer features, I believe there might be performance impact vs using the krbd module. I've provisioned multiple talos servers (1.9.x - 1.10.x) with ceph rook and did not require to add additional kernel modules loaded. What version of talos and rook are you using maybe I can spin up a quick test to check confirm this?

Talos: v.1.10.3
ceph-rook (operator): 1.17.2
rook-ceph-cluster: 1.17.2

values.yaml:

storage: useAllNodes: false useAllDevices: true config: allowMultiplePerNode: true nodes: - name: talos-mec-lba placement: all: nodeAffinity: requiredDuringSchedulingIgnoredDuringExecution: nodeSelectorTerms: - matchExpressions: - key: kubernetes.io/hostname operator: In values: - talos-mec-lba tolerations: - key: "node-role.kubernetes.io/control-plane" operator: "Exists" effect: "NoSchedule" cephClusterSpec: mon: count: 1 allowMultiplePerNode: true mgr: count: 1 allowMultiplePerNode: true mds: count: 0 allowMultiplePerNode: true rgw: count: 0 allowMultiplePerNode: true crashCollector: disable: true dashboard: enabled: true pool: replicated: size: 1 minSize: 1 cephCSI: csiCephFS: provisionerReplicas: 1 pluginReplicas: 1 placement: podAntiAffinity: null csiRBD: provisionerReplicas: 1 pluginReplicas: 1 placement: podAntiAffinity: null snapshotclass.yaml: apiVersion: snapshot.storage.k8s.io/v1 kind: VolumeSnapshotClass metadata: name: csi-ceph-rbd-snapclass annotations: k10.kasten.io/is-snapshot-class: "true" driver: rook-ceph.rbd.csi.ceph.com deletionPolicy: Delete parameters: csi.storage.k8s.io/snapshotter-secret-name: rook-csi-rbd-provisioner csi.storage.k8s.io/snapshotter-secret-namespace: rook-ceph clusterID: rook-ceph

Where I am stuck is

2025-06-02 21:55:00.685887 E | clusterdisruption-controller: failed to get OSD status: failed to get osd metadata: exit status 1 2025-06-02 21:55:13.396848 I | op-mon: mons running: [a b] 2025-06-02 21:55:15.791891 E | clusterdisruption-controller: failed to get OSD status: failed to get osd metadata: exit status 1

I have a /dev/sdb mounted into one worker node.

Do I need to wipe that or do else?

# ------------------------------------------------------------------------------ cephClusterSpec: mon: count: 1 allowMultiplePerNode: true dashboard: enabled: true ssl: false # easier for a lab; switch to true in prod # ---------- Storage (OSDs) ---------- storage: useAllNodes: false # we will list the single node explicitly useAllDevices: false # don’t blindly grab every block dev nodes: - name: talos-mec-lba # MUST match `kubectl get nodes -o wide` devices: - name: /dev/sdb

Yes, you need to wipe the disks. You can see that the disk probably still belongs to a previous ceph cluster by looking at the osd-prepare logs.

Follow these instruction to wipe them: https://rook.io/docs/rook/latest-release/Getting-Started/ceph-teardown/#zapping-devices

I am stuck even before that phase in: configuring MONs

2552906598] state: up:replay) since 675.073 debug 2025-06-03T14:10:53.096+0000 7f1c2880d640 0 cephx server client.admin: unexpected key: req.key=8973fbb8653eb28e expected_key=1c5ab51d2f84e6bc debug 2025-06-03T14:10:53.356+0000 7f1c2880d640 0 cephx server client.admin: unexpected key: req.key=4a4cd93568452069 expected_key=93f905b60d3c2bbf debug 2025-06-03T14:10:54.764+0000 7f1c2a9cf640 0 log_channel(audit) log [DBG] : from='admin socket' entity='admin socket' cmd='mon_status' args=[]: dispatch debug 2025-06-03T14:10:54.768+0000 7f1c2a9cf640 0 log_channel(audit) log [DBG] : from='admin socket' entity='admin socket' cmd=mon_status args=[]: finished audit 2025-06-03T14:10:54.771317+0000 mon.a (mon.0) 144 : audit [DBG] from='admin socket' entity='admin socket' cmd='mon_status' args=[]: dispatch audit 2025-06-03T14:10:54.771543+0000 mon.a (mon.0) 145 : audit [DBG] from='admin socket' entity='admin socket' cmd=mon_status args=[]: finished

Please read the documentation link provided above (scroll to the top); you need to cleanup the /var/lib/rook directory before creating a new cluster.

Thanks. This was it:
cat <<EOF | kubectl apply -f -
apiVersion: v1
kind: Pod
metadata:
name: disk-clean
spec:
restartPolicy: Never
nodeName: talos-mec-lba
volumes:

name: rook-data-dir
hostPath:
path: /var/lib/rook
containers:

name: disk-clean
image: busybox
securityContext:
privileged: true
volumeMounts:

name: rook-data-dir
mountPath: /node/rook-data
command: ["/bin/sh", "-c", "rm -rf /node/rook-data/*"]
EOF

Maybe one has to write to it more that when no new osd are created and it is stuck in configure mons phase one needs to do that.

smira · 2025-06-04T14:25:23Z

website/content/v1.8/kubernetes-guides/configuration/ceph-with-rook.md

@@ -100,6 +100,188 @@ ceph-bucket            rook-ceph.ceph.rook.io/bucket   Delete          Immediate
 ceph-filesystem        rook-ceph.cephfs.csi.ceph.com   Delete          Immediate           true                   77m
 ```

+## 🔧 Single Node Setup Instructions


we don't usually support/modify old Talos documentation, you should probably take this to website/content/v1.11

Also I don't know, but this is very niche use case which shouldn't be used in general (single-node setup). I would probably move this to another document, probably in advanced/ folder and link it from here

what also would be great if one copy a section that it doesnt copy the results of the cli (currently in the docs). Or dont have the results at all in the cli to copy.

Update ceph-with-rook.md

0b25361

add nbd module one has to enable and single node setup Signed-off-by: suse-coder <[email protected]>

dajrivera reviewed May 30, 2025

View reviewed changes

smira reviewed Jun 4, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Update ceph-with-rook.md #11120

Update ceph-with-rook.md #11120

Uh oh!

suse-coder commented May 29, 2025

Uh oh!

dajrivera May 30, 2025

Uh oh!

suse-coder May 30, 2025

Uh oh!

dajrivera May 30, 2025

Uh oh!

suse-coder May 30, 2025

Uh oh!

suse-coder May 30, 2025 •

edited

Loading

Uh oh!

suse-coder Jun 2, 2025

Uh oh!

dajrivera Jun 2, 2025

Uh oh!

suse-coder Jun 3, 2025

Uh oh!

dajrivera Jun 3, 2025

Uh oh!

suse-coder Jun 3, 2025

Uh oh!

smira Jun 4, 2025

Uh oh!

smira Jun 4, 2025

Uh oh!

suse-coder Jun 4, 2025

Uh oh!

Uh oh!

Update ceph-with-rook.md #11120

Are you sure you want to change the base?

Update ceph-with-rook.md #11120

Uh oh!

Conversation

suse-coder commented May 29, 2025

Pull Request

What? (description)

Why? (reasoning)

Acceptance

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

suse-coder May 30, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

suse-coder May 30, 2025 •

edited

Loading