-
Notifications
You must be signed in to change notification settings - Fork 682
Update ceph-with-rook.md #11120
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Update ceph-with-rook.md #11120
Conversation
add nbd module one has to enable and single node setup Signed-off-by: suse-coder <[email protected]>
### 1. Enable the Ceph Kernel Module | ||
|
||
Talos includes the `nbd` kernel module, but it needs to be explicitly enabled. | ||
|
||
**Create a patch file** (`patch.values.yaml`): | ||
|
||
```yaml | ||
machine: | ||
kernel: | ||
modules: | ||
- name: nbd | ||
``` | ||
**Apply the kernel module patch**: | ||
```shell | ||
talosctl -n 192.168.178.79 patch mc --patch @./terraform/talos/patch/patch.yaml |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
krbd kernel module works by default. Is there a reason to change to rbd-nbd and does the rook ceph cluster choose the nbd module over krbd if it's available?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
when I installed ceph it had an error that nbd was not enabled. So I did. nbd ist more modern and has way better way to snapshot volumes with not getting inconsistent journals: https://engineering.salesforce.com/mapping-kubernetes-ceph-volumes-the-rbd-nbd-way-21f7c4161f04/
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That post seems fairly old since it's referencing ceph jewel and centos 7. Although rbd-nbd could provide earlier access to newer features, I believe there might be performance impact vs using the krbd module. I've provisioned multiple talos servers (1.9.x - 1.10.x) with ceph rook and did not require to add additional kernel modules loaded. What version of talos and rook are you using maybe I can spin up a quick test to check confirm this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Talos: v.1.10.3
ceph-rook (operator): 1.17.2
rook-ceph-cluster: 1.17.2
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
values.yaml:
storage:
useAllNodes: false
useAllDevices: true
config:
allowMultiplePerNode: true
nodes:
- name: talos-mec-lba
placement:
all:
nodeAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
nodeSelectorTerms:
- matchExpressions:
- key: kubernetes.io/hostname
operator: In
values:
- talos-mec-lba
tolerations:
- key: "node-role.kubernetes.io/control-plane"
operator: "Exists"
effect: "NoSchedule"
cephClusterSpec:
mon:
count: 1
allowMultiplePerNode: true
mgr:
count: 1
allowMultiplePerNode: true
mds:
count: 0
allowMultiplePerNode: true
rgw:
count: 0
allowMultiplePerNode: true
crashCollector:
disable: true
dashboard:
enabled: true
pool:
replicated:
size: 1
minSize: 1
cephCSI:
csiCephFS:
provisionerReplicas: 1
pluginReplicas: 1
placement:
podAntiAffinity: null
csiRBD:
provisionerReplicas: 1
pluginReplicas: 1
placement:
podAntiAffinity: null
snapshotclass.yaml:
apiVersion: snapshot.storage.k8s.io/v1
kind: VolumeSnapshotClass
metadata:
name: csi-ceph-rbd-snapclass
annotations:
k10.kasten.io/is-snapshot-class: "true"
driver: rook-ceph.rbd.csi.ceph.com
deletionPolicy: Delete
parameters:
csi.storage.k8s.io/snapshotter-secret-name: rook-csi-rbd-provisioner
csi.storage.k8s.io/snapshotter-secret-namespace: rook-ceph
clusterID: rook-ceph
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Where I am stuck is
2025-06-02 21:55:00.685887 E | clusterdisruption-controller: failed to get OSD status: failed to get osd metadata: exit status 1
2025-06-02 21:55:13.396848 I | op-mon: mons running: [a b]
2025-06-02 21:55:15.791891 E | clusterdisruption-controller: failed to get OSD status: failed to get osd metadata: exit status 1
I have a /dev/sdb mounted into one worker node.
Do I need to wipe that or do else?
# ------------------------------------------------------------------------------
cephClusterSpec:
mon:
count: 1
allowMultiplePerNode: true
dashboard:
enabled: true
ssl: false # easier for a lab; switch to true in prod
# ---------- Storage (OSDs) ----------
storage:
useAllNodes: false # we will list the single node explicitly
useAllDevices: false # don’t blindly grab every block dev
nodes:
- name: talos-mec-lba # MUST match `kubectl get nodes -o wide`
devices:
- name: /dev/sdb
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, you need to wipe the disks. You can see that the disk probably still belongs to a previous ceph cluster by looking at the osd-prepare logs.
Follow these instruction to wipe them: https://rook.io/docs/rook/latest-release/Getting-Started/ceph-teardown/#zapping-devices
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am stuck even before that phase in: configuring MONs
2552906598] state: up:replay) since 675.073
debug 2025-06-03T14:10:53.096+0000 7f1c2880d640 0 cephx server client.admin: unexpected key: req.key=8973fbb8653eb28e expected_key=1c5ab51d2f84e6bc
debug 2025-06-03T14:10:53.356+0000 7f1c2880d640 0 cephx server client.admin: unexpected key: req.key=4a4cd93568452069 expected_key=93f905b60d3c2bbf
debug 2025-06-03T14:10:54.764+0000 7f1c2a9cf640 0 log_channel(audit) log [DBG] : from='admin socket' entity='admin socket' cmd='mon_status' args=[]: dispatch
debug 2025-06-03T14:10:54.768+0000 7f1c2a9cf640 0 log_channel(audit) log [DBG] : from='admin socket' entity='admin socket' cmd=mon_status args=[]: finished
audit 2025-06-03T14:10:54.771317+0000 mon.a (mon.0) 144 : audit [DBG] from='admin socket' entity='admin socket' cmd='mon_status' args=[]: dispatch
audit 2025-06-03T14:10:54.771543+0000 mon.a (mon.0) 145 : audit [DBG] from='admin socket' entity='admin socket' cmd=mon_status args=[]: finished
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please read the documentation link provided above (scroll to the top); you need to cleanup the /var/lib/rook
directory before creating a new cluster.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks. This was it:
cat <<EOF | kubectl apply -f -
apiVersion: v1
kind: Pod
metadata:
name: disk-clean
spec:
restartPolicy: Never
nodeName: talos-mec-lba
volumes:
- name: rook-data-dir
hostPath:
path: /var/lib/rook
containers: - name: disk-clean
image: busybox
securityContext:
privileged: true
volumeMounts:- name: rook-data-dir
mountPath: /node/rook-data
command: ["/bin/sh", "-c", "rm -rf /node/rook-data/*"]
EOF
- name: rook-data-dir
Maybe one has to write to it more that when no new osd are created and it is stuck in configure mons phase one needs to do that.
@@ -100,6 +100,188 @@ ceph-bucket rook-ceph.ceph.rook.io/bucket Delete Immediate | |||
ceph-filesystem rook-ceph.cephfs.csi.ceph.com Delete Immediate true 77m | |||
``` | |||
|
|||
## 🔧 Single Node Setup Instructions |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we don't usually support/modify old Talos documentation, you should probably take this to website/content/v1.11
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also I don't know, but this is very niche use case which shouldn't be used in general (single-node setup). I would probably move this to another document, probably in advanced/
folder and link it from here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what also would be great if one copy a section that it doesnt copy the results of the cli (currently in the docs). Or dont have the results at all in the cli to copy.
add nbd module one has to enable and single node setup
Pull Request
What? (description)
Why? (reasoning)
Acceptance
Please use the following checklist:
make conformance
)make fmt
)make lint
)make docs
)make unit-tests
)