Libceph connect error 101 - Исправление ошибок и поиск оптимальных решений проблем

After reboot host, FailedMount PersistentVolumeClaim (CSI) #165

Comments

dllozhkin commented Feb 5, 2019

After rebooting host, my POD doesn’t mount PVC

I saw, logs:
rpc error: code = DeadlineExceeded desc = context deadline exceeded

and logs docker:
kubernetes.io/csi: mounter.SetupAt failed: rpc error: code = DeadlineExceeded desc = context deadline exceeded

Helps, service docker restart

The text was updated successfully, but these errors were encountered:

rootfs commented Feb 5, 2019

can you post the rbd (or cephfs) plugin logs?

dllozhkin commented Feb 5, 2019 •

rootfs commented Feb 5, 2019

ceph still thinks the old client is connecting to the rbd image after the reboot.

Per ceph doc, there is a timeout.

@leseb any idea what to do to remove the watcher?

rootfs commented Feb 5, 2019

cc @dillaman, will the watcher go away, eventually?

lelvisl commented Mar 6, 2019

Madhu-1 commented Mar 11, 2019

after rebooting node (used minikube for testing) not able to ssh to the machine getting below logs

Madhu-1 commented Mar 11, 2019

after few minutes of waiting all pods came into running state

note: used csi v1.0.0 images

yanchicago commented Jan 22, 2020

humblec commented Apr 1, 2020

after few minutes of waiting all pods came into running state

note: used csi v1.0.0 images

Closing this one as this is no longer an issue. Please feel free to reopen if the issue persists.

Источник

Ceph Common Issues

Many of these problem cases are hard to summarize down to a short phrase that adequately describes the problem. Each problem will start with a bulleted list of symptoms. Keep in mind that all symptoms may not apply depending on the configuration of Rook. If the majority of the symptoms are seen there is a fair chance you are experiencing that problem.

If after trying the suggestions found on this page and the problem is not resolved, the Rook team is very happy to help you troubleshoot the issues in their Slack channel. Once you have registered for the Rook Slack, proceed to the #ceph channel to ask for assistance.

Troubleshooting Techniques¶

There are two main categories of information you will need to investigate issues in the cluster:

Kubernetes status and logs documented here
Ceph cluster status (see upcoming Ceph tools section)

Ceph Tools¶

After you verify the basic health of the running pods, next you will want to run Ceph tools for status of the storage components. There are two ways to run the Ceph tools, either in the Rook toolbox or inside other Rook pods that are already running.

Logs on a specific node to find why a PVC is failing to mount
See the log collection topic for a script that will help you gather the logs
Other artifacts:
The monitors that are expected to be in quorum: kubectl -n get configmap rook-ceph-mon-endpoints -o yaml | grep data

Tools in the Rook Toolbox¶

The rook-ceph-tools pod provides a simple environment to run Ceph tools. Once the pod is up and running, connect to the pod to execute Ceph commands to evaluate that current state of the cluster.

Ceph Commands¶

Here are some common commands to troubleshoot a Ceph cluster:

ceph status
ceph osd status
ceph osd df
ceph osd utilization
ceph osd pool stats
ceph osd tree
ceph pg stat

The first two status commands provide the overall cluster health. The normal state for cluster operations is HEALTH_OK, but will still function when the state is in a HEALTH_WARN state. If you are in a WARN state, then the cluster is in a condition that it may enter the HEALTH_ERROR state at which point all disk I/O operations are halted. If a HEALTH_WARN state is observed, then one should take action to prevent the cluster from halting when it enters the HEALTH_ERROR state.

There are many Ceph sub-commands to look at and manipulate Ceph objects, well beyond the scope this document. See the Ceph documentation for more details of gathering information about the health of the cluster. In addition, there are other helpful hints and some best practices located in the Advanced Configuration section. Of particular note, there are scripts for collecting logs and gathering OSD information there.

Cluster failing to service requests¶

Symptoms¶

Execution of the ceph command hangs
PersistentVolumes are not being created
Large amount of slow requests are blocking
Large amount of stuck requests are blocking
One or more MONs are restarting periodically

Investigation¶

Create a rook-ceph-tools pod to investigate the current state of Ceph. Here is an example of what one might see. In this case the ceph status command would just hang so a CTRL-C needed to be sent.

Another indication is when one or more of the MON pods restart frequently. Note the ‘mon107’ that has only been up for 16 minutes in the following output.

Solution¶

What is happening here is that the MON pods are restarting and one or more of the Ceph daemons are not getting configured with the proper cluster information. This is commonly the result of not specifying a value for dataDirHostPath in your Cluster CRD.

The dataDirHostPath setting specifies a path on the local host for the Ceph daemons to store configuration and data. Setting this to a path like /var/lib/rook , reapplying your Cluster CRD and restarting all the Ceph daemons (MON, MGR, OSD, RGW) should solve this problem. After the Ceph daemons have been restarted, it is advisable to restart the rook-tools pod.

Monitors are the only pods running¶

Symptoms¶

Rook operator is running
Either a single mon starts or the mons start very slowly (at least several minutes apart)
The crash-collector pods are crashing
No mgr, osd, or other daemons are created except the CSI driver

Investigation¶

When the operator is starting a cluster, the operator will start one mon at a time and check that they are healthy before continuing to bring up all three mons. If the first mon is not detected healthy, the operator will continue to check until it is healthy. If the first mon fails to start, a second and then a third mon may attempt to start. However, they will never form quorum and the orchestration will be blocked from proceeding.

The crash-collector pods will be blocked from starting until the mons have formed quorum the first time.

There are several common causes for the mons failing to form quorum:

The operator pod does not have network connectivity to the mon pod(s). The network may be configured incorrectly.
One or more mon pods are in running state, but the operator log shows they are not able to form quorum
A mon is using configuration from a previous installation. See the cleanup guide for cleaning the previous cluster.
A firewall may be blocking the ports required for the Ceph mons to form quorum. Ensure ports 6789 and 3300 are enabled. See the Ceph networking guide for more details.
There may be MTU mismatch between different networking components. Some networks may be more susceptible to mismatch than others. If Kubernetes CNI or hosts enable jumbo frames (MTU 9000), Ceph will use large packets to maximize network bandwidth. If other parts of the networking chain don’t support jumbo frames, this could result in lost or rejected packets unexpectedly.

Operator fails to connect to the mon¶

First look at the logs of the operator to confirm if it is able to connect to the mons.

Likely you will see an error similar to the following that the operator is timing out when connecting to the mon. The last command is ceph mon_status , followed by a timeout message five minutes later.

The error would appear to be an authentication error, but it is misleading. The real issue is a timeout.

Solution¶

If you see the timeout in the operator log, verify if the mon pod is running (see the next section). If the mon pod is running, check the network connectivity between the operator pod and the mon pod. A common issue is that the CNI is not configured correctly.

To verify the network connectivity:

Get the endpoint for a mon
Curl the mon from the operator pod

For example, this command will curl the first mon from the operator:

If «ceph v2» is printed to the console, the connection was successful. If the command does not respond or otherwise fails, the network connection cannot be established.

Failing mon pod¶

Second we need to verify if the mon pod started successfully.

If the mon pod is failing as in this example, you will need to look at the mon pod status or logs to determine the cause. If the pod is in a crash loop backoff state, you should see the reason by describing the pod.

See the solution in the next section regarding cleaning up the dataDirHostPath on the nodes.

Solution¶

This is a common problem reinitializing the Rook cluster when the local directory used for persistence has not been purged. This directory is the dataDirHostPath setting in the cluster CRD and is typically set to /var/lib/rook . To fix the issue you will need to delete all components of Rook and then delete the contents of /var/lib/rook (or the directory specified by dataDirHostPath ) on each of the hosts in the cluster. Then when the cluster CRD is applied to start a new cluster, the rook-operator should start all the pods as expected.

Deleting the dataDirHostPath folder is destructive to the storage. Only delete the folder if you are trying to permanently purge the Rook cluster.

See the Cleanup Guide for more details.

PVCs stay in pending state¶

Symptoms¶

When you create a PVC based on a rook storage class, it stays pending indefinitely

For the WordPress example, you might see two PVCs in pending state.

Investigation¶

There are two common causes for the PVCs staying in pending state:

There are no OSDs in the cluster
The CSI provisioner pod is not running or is not responding to the request to provision the storage

Confirm if there are OSDs¶

To confirm if you have OSDs in your cluster, connect to the Rook Toolbox and run the ceph status command. You should see that you have at least one OSD up and in . The minimum number of OSDs required depends on the replicated.size setting in the pool created for the storage class. In a «test» cluster, only one OSD is required (see storageclass-test.yaml ). In the production storage class example ( storageclass.yaml ), three OSDs would be required.

OSD Prepare Logs¶

If you don’t see the expected number of OSDs, let’s investigate why they weren’t created. On each node where Rook looks for OSDs to configure, you will see an «osd prepare» pod.

See the section on why OSDs are not getting created to investigate the logs.

CSI Driver¶

The CSI driver may not be responding to the requests. Look in the logs of the CSI provisioner pod to see if there are any errors during the provisioning.

There are two provisioner pods:

Get the logs of each of the pods. One of them should be the «leader» and be responding to requests.

Operator unresponsiveness¶

Lastly, if you have OSDs up and in , the next step is to confirm the operator is responding to the requests. Look in the Operator pod logs around the time when the PVC was created to confirm if the request is being raised. If the operator does not show requests to provision the block image, the operator may be stuck on some other operation. In this case, restart the operator pod to get things going again.

Solution¶

If the «osd prepare» logs didn’t give you enough clues about why the OSDs were not being created, please review your cluster.yaml configuration. The common misconfigurations include:

If useAllDevices: true , Rook expects to find local devices attached to the nodes. If no devices are found, no OSDs will be created.
If useAllDevices: false , OSDs will only be created if deviceFilter is specified.
Only local devices attached to the nodes will be configurable by Rook. In other words, the devices must show up under /dev .
The devices must not have any partitions or filesystems on them. Rook will only configure raw devices. Partitions are not yet supported.

OSD pods are failing to start¶

Symptoms¶

OSD pods are failing to start
You have started a cluster after tearing down another cluster

Investigation¶

When an OSD starts, the device or directory will be configured for consumption. If there is an error with the configuration, the pod will crash and you will see the CrashLoopBackoff status for the pod. Look in the osd pod logs for an indication of the failure.

One common case for failure is that you have re-deployed a test cluster and some state may remain from a previous deployment. If your cluster is larger than a few nodes, you may get lucky enough that the monitors were able to start and form quorum. However, now the OSDs pods may fail to start due to the old state. Looking at the OSD pod logs you will see an error about the file already existing.

Solution¶

If the error is from the file that already exists, this is a common problem reinitializing the Rook cluster when the local directory used for persistence has not been purged. This directory is the dataDirHostPath setting in the cluster CRD and is typically set to /var/lib/rook . To fix the issue you will need to delete all components of Rook and then delete the contents of /var/lib/rook (or the directory specified by dataDirHostPath ) on each of the hosts in the cluster. Then when the cluster CRD is applied to start a new cluster, the rook-operator should start all the pods as expected.

OSD pods are not created on my devices¶

Symptoms¶

No OSD pods are started in the cluster
Devices are not configured with OSDs even though specified in the Cluster CRD
One OSD pod is started on each node instead of multiple pods for each device

Investigation¶

First, ensure that you have specified the devices correctly in the CRD. The Cluster CRD has several ways to specify the devices that are to be consumed by the Rook storage:

useAllDevices: true : Rook will consume all devices it determines to be available
deviceFilter : Consume all devices that match this regular expression
devices : Explicit list of device names on each node to consume

Second, if Rook determines that a device is not available (has existing partitions or a formatted filesystem), Rook will skip consuming the devices. If Rook is not starting OSDs on the devices you expect, Rook may have skipped it for this reason. To see if a device was skipped, view the OSD preparation log on the node where the device was skipped. Note that it is completely normal and expected for OSD prepare pod to be in the completed state. After the job is complete, Rook leaves the pod around in case the logs need to be investigated.

Here are some key lines to look for in the log:

Solution¶

Either update the CR with the correct settings, or clean the partitions or filesystem from your devices. To clean devices from a previous install see the cleanup guide.

After the settings are updated or the devices are cleaned, trigger the operator to analyze the devices again by restarting the operator. Each time the operator starts, it will ensure all the desired devices are configured. The operator does automatically deploy OSDs in most scenarios, but an operator restart will cover any scenarios that the operator doesn’t detect automatically.

Node hangs after reboot¶

This issue is fixed in Rook v1.3 or later.

Symptoms¶

After issuing a reboot command, node never returned online
Only a power cycle helps

Investigation¶

On a node running a pod with a Ceph persistent volume

When the reboot command is issued, network interfaces are terminated before disks are unmounted. This results in the node hanging as repeated attempts to unmount Ceph persistent volumes fail with the following error:

Solution¶

The node needs to be drained before reboot. After the successful drain, the node can be rebooted as usual.

Because kubectl drain command automatically marks the node as unschedulable ( kubectl cordon effect), the node needs to be uncordoned once it’s back online.

Uncordon the node:

Using multiple shared filesystem (CephFS) is attempted on a kernel version older than 4.7¶

Symptoms¶

More than one shared filesystem (CephFS) has been created in the cluster
A pod attempts to mount any other shared filesystem besides the first one that was created
The pod incorrectly gets the first filesystem mounted instead of the intended filesystem

Solution¶

The only solution to this problem is to upgrade your kernel to 4.7 or higher. This is due to a mount flag added in the kernel version 4.7 which allows to chose the filesystem by name.

For additional info on the kernel version requirement for multiple shared filesystems (CephFS), see Filesystem — Kernel version requirement.

Set debug log level for all Ceph daemons¶

You can set a given log level and apply it to all the Ceph daemons at the same time. For this, make sure the toolbox pod is running, then determine the level you want (between 0 and 20). You can find the list of all subsystems and their default values in Ceph logging and debug official guide. Be careful when increasing the level as it will produce very verbose logs.

Assuming you want a log level of 1, you will run:

Once you are done debugging, you can revert all the debug flag to their default value by running the following:

Activate log to file for a particular Ceph daemon¶

They are cases where looking at Kubernetes logs is not enough for diverse reasons, but just to name a few:

not everyone is familiar for Kubernetes logging and expects to find logs in traditional directories
logs get eaten (buffer limit from the log engine) and thus not requestable from Kubernetes

So for each daemon, dataDirHostPath is used to store logs, if logging is activated. Rook will bindmount dataDirHostPath for every pod. Let’s say you want to enable logging for mon.a , but only for this daemon. Using the toolbox or from inside the operator run:

This will activate logging on the filesystem, you will be able to find logs in dataDirHostPath/$NAMESPACE/log , so typically this would mean /var/lib/rook/rook-ceph/log . You don’t need to restart the pod, the effect will be immediate.

To disable the logging on file, simply set log_to_file to false .

A worker node using RBD devices hangs up¶

Symptoms¶

There is no progress on I/O from/to one of RBD devices ( /dev/rbd* or /dev/nbd* ).
After that, the whole worker node hangs up.

Investigation¶

This happens when the following conditions are satisfied.

The problematic RBD device and the corresponding OSDs are co-located.
There is an XFS filesystem on top of this device.

In addition, when this problem happens, you can see the following messages in dmesg .

It’s so-called hung_task problem and means that there is a deadlock in the kernel. For more detail, please refer to the corresponding issue comment.

Solution¶

This problem will be solve by the following two fixes.

Linux kernel: A minor feature that is introduced by this commit. It will be included in Linux v5.6.
Ceph: A fix that uses the above-mentioned kernel’s feature. The Ceph community will probably discuss this fix after releasing Linux v5.6.

You can bypass this problem by using ext4 or any other filesystems rather than XFS. Filesystem type can be specified with csi.storage.k8s.io/fstype in StorageClass resource.

Too few PGs per OSD warning is shown¶

Symptoms¶

ceph status shows «too few PGs per OSD» warning as follows.

Solution¶

The meaning of this warning is written in the document. However, in many cases it is benign. For more information, please see the blog entry. Please refer to Configuring Pools if you want to know the proper pg_num of pools and change these values.

LVM metadata can be corrupted with OSD on LV-backed PVC¶

Symptoms¶

There is a critical flaw in OSD on LV-backed PVC. LVM metadata can be corrupted if both the host and OSD container modify it simultaneously. For example, the administrator might modify it on the host, while the OSD initialization process in a container could modify it too. In addition, if lvmetad is running, the possibility of occurrence gets higher. In this case, the change of LVM metadata in OSD container is not reflected to LVM metadata cache in host for a while.

If you still decide to configure an OSD on LVM, please keep the following in mind to reduce the probability of this issue.

Solution¶

Disable lvmetad.
Avoid configuration of LVs from the host. In addition, don’t touch the VGs and physical volumes that back these LVs.
Avoid incrementing the count field of storageClassDeviceSets and create a new LV that backs an OSD simultaneously.

You can know whether the above-mentioned tag exists with the command: sudo lvs -o lv_name,lv_tags . If the lv_tag field is empty in an LV corresponding to the OSD lv_tags, this OSD encountered the problem. In this case, please retire this OSD or replace with other new OSD before restarting.

This problem doesn’t happen in newly created LV-backed PVCs because OSD container doesn’t modify LVM metadata anymore. The existing lvm mode OSDs work continuously even thought upgrade your Rook. However, using the raw mode OSDs is recommended because of the above-mentioned problem. You can replace the existing OSDs with raw mode OSDs by retiring them and adding new OSDs one by one. See the documents Remove an OSD and Add an OSD on a PVC.

OSD prepare job fails due to low aio-max-nr setting¶

If the Kernel is configured with a low aio-max-nr setting, the OSD prepare job might fail with the following error:

To overcome this, you need to increase the value of fs.aio-max-nr of your sysctl configuration (typically /etc/sysctl.conf ). You can do this with your favorite configuration management system.

Alternatively, you can have a DaemonSet to apply the configuration for you on all your nodes.

Unexpected partitions created¶

Symptoms¶

Users running Rook versions v1.6.0-v1.6.7 may observe unwanted OSDs on partitions that appear unexpectedly and seemingly randomly, which can corrupt existing OSDs.

Unexpected partitions are created on host disks that are used by Ceph OSDs. This happens more often on SSDs than HDDs and usually only on disks that are 875GB or larger. Many tools like lsblk , blkid , udevadm , and parted will not show a partition table type for the partition. Newer versions of blkid are generally able to recognize the type as «atari».

The underlying issue causing this is Atari partition (sometimes identified as AHDI) support in the Linux kernel. Atari partitions have very relaxed specifications compared to other partition types, and it is relatively easy for random data written to a disk to appear as an Atari partition to the Linux kernel. Ceph’s Bluestore OSDs have an anecdotally high probability of writing data on to disks that can appear to the kernel as an Atari partition.

Below is an example of lsblk output from a node where phantom Atari partitions are present. Note that sdX1 is never present for the phantom partitions, and sdX2 is 48G on all disks. sdX3 is a variable size and may not always be present. It is possible for sdX4 to appear, though it is an anecdotally rare event.

Solution¶

Recover from corruption (v1.6.0-v1.6.7)¶

If you are using Rook v1.6, you must first update to v1.6.8 or higher to avoid further incidents of OSD corruption caused by these Atari partitions.

An old workaround suggested using deviceFilter: ^sd[a-z]+$ , but this still results in unexpected partitions. Rook will merely stop creating new OSDs on the partitions. It does not fix a related issue that ceph-volume that is unaware of the Atari partition problem. Users who used this workaround are still at risk for OSD failures in the future.

To resolve the issue, immediately update to v1.6.8 or higher. After the update, no corruption should occur on OSDs created in the future. Next, to get back to a healthy Ceph cluster state, focus on one corrupted disk at a time and remove all OSDs on each corrupted disk one disk at a time.

As an example, you may have /dev/sdb with two unexpected partitions ( /dev/sdb2 and /dev/sdb3 ) as well as a second corrupted disk /dev/sde with one unexpected partition ( /dev/sde2 ).

First, remove the OSDs associated with /dev/sdb , /dev/sdb2 , and /dev/sdb3 . There might be only one, or up to 3 OSDs depending on how your system was affected. Again see the OSD management doc.
Use dd to wipe the first sectors of the partitions followed by the disk itself. E.g.,
- dd if=/dev/zero of=/dev/sdb2 bs=1M
- dd if=/dev/zero of=/dev/sdb3 bs=1M
- dd if=/dev/zero of=/dev/sdb bs=1M
Then wipe clean /dev/sdb to prepare it for a new OSD. See the teardown document for details.
After this, scale up the Rook operator to deploy a new OSD to /dev/sdb . This will allow Ceph to use /dev/sdb for data recovery and replication while the next OSDs are removed.
Now Repeat steps 1-4 for /dev/sde and /dev/sde2 , and continue for any other corrupted disks.

If your Rook cluster does not have any critical data stored in it, it may be simpler to uninstall Rook completely and redeploy with v1.6.8 or higher.

Operator environment variables are ignored¶

Symptoms¶

Configuration settings passed as environment variables do not take effect as expected. For example, the discover daemonset is not created, even though ROOK_ENABLE_DISCOVERY_DAEMON=»true» is set.

Investigation¶

Inspect the rook-ceph-operator-config ConfigMap for conflicting settings. The ConfigMap takes precedence over the environment. The ConfigMap must exist, even if all actual configuration is supplied through the environment.

Look for lines with the op-k8sutil prefix in the operator logs. These lines detail the final values, and source, of the different configuration variables.

Verify that both of the following messages are present in the operator logs:

Solution¶

If it does not exist, create an empty ConfigMap:

Источник

-c driver-registrar

I0205 11:54:06.183084    3621 main.go:135] Attempting to open a gRPC connection with: "/var/lib/kubelet/plugins/csi-rbdplugin/csi.sock"
I0205 11:54:06.183094    3621 connection.go:68] Connecting to /var/lib/kubelet/plugins/csi-rbdplugin/csi.sock
I0205 11:54:06.211846    3621 connection.go:95] Still trying, connection is CONNECTING
I0205 11:54:06.211978    3621 connection.go:95] Still trying, connection is TRANSIENT_FAILURE
I0205 11:54:07.219463    3621 connection.go:95] Still trying, connection is TRANSIENT_FAILURE
I0205 11:54:08.268501    3621 connection.go:92] Connected
I0205 11:54:08.268523    3621 main.go:143] Calling CSI driver to discover driver name.
I0205 11:54:08.268555    3621 connection.go:136] GRPC call: /csi.v0.Identity/GetPluginInfo
I0205 11:54:08.268561    3621 connection.go:137] GRPC request:
I0205 11:54:08.288980    3621 connection.go:139] GRPC response: name:"csi-rbdplugin" vendor_version:"0.3.0"
I0205 11:54:08.289033    3621 connection.go:140] GRPC error: <nil>
I0205 11:54:08.289040    3621 main.go:151] CSI driver name: "csi-rbdplugin"
I0205 11:54:08.289046    3621 main.go:154] Calling CSI driver to discover node ID.
I0205 11:54:08.289058    3621 connection.go:136] GRPC call: /csi.v0.Node/NodeGetId
I0205 11:54:08.289062    3621 connection.go:137] GRPC request:
I0205 11:54:08.289451    3621 connection.go:139] GRPC response: node_id:"master"
I0205 11:54:08.289486    3621 connection.go:140] GRPC error: <nil>
I0205 11:54:08.289492    3621 main.go:162] CSI driver node ID: "master"
I0205 11:54:08.289536    3621 main.go:185] Starting Registration Server at: /registration/csi-rbdplugin-reg.sock
I0205 11:54:08.289578    3621 main.go:192] Registration Server started at: /registration/csi-rbdplugin-reg.sock
I0205 11:54:08.290922    3621 main.go:93] Received GetInfo call: &InfoRequest{}
I0205 11:54:08.593231    3621 main.go:103] Received NotifyRegistrationStatus call: &RegistrationStatus{PluginRegistered:true,Error:,}

-c csi-rbdplugin

Watchers:
	watcher=10.16.184.2:0/3356087603 client.335848 cookie=2
I0205 12:10:38.160174    4024 rbd_util.go:170] rbd: status pvc-9eecb201289e11e9 using mon 192.168.1.1:6789,192.168.1.2:6789,192.168.1.3:6789, pool rbd id admin key *
I0205 12:10:38.240097    4024 rbd_util.go:189] rbd: watchers on pvc-9eecb201289e11e9: 2019-02-05 12:10:38.174194 7fcf99b71d40 -1 did not load config file, using default settings.
2019-02-05 12:10:38.179275 7fcf99b71d40 -1 Errors while parsing config file!
2019-02-05 12:10:38.179284 7fcf99b71d40 -1 parse_file: cannot open /etc/ceph/ceph.conf: (2) No such file or directory
2019-02-05 12:10:38.179285 7fcf99b71d40 -1 parse_file: cannot open ~/.ceph/ceph.conf: (2) No such file or directory
2019-02-05 12:10:38.179286 7fcf99b71d40 -1 parse_file: cannot open ceph.conf: (2) No such file or directory
2019-02-05 12:10:38.180714 7fcf99b71d40 -1 Errors while parsing config file!
2019-02-05 12:10:38.180724 7fcf99b71d40 -1 parse_file: cannot open /etc/ceph/ceph.conf: (2) No such file or directory
2019-02-05 12:10:38.180725 7fcf99b71d40 -1 parse_file: cannot open ~/.ceph/ceph.conf: (2) No such file or directory
2019-02-05 12:10:38.180726 7fcf99b71d40 -1 parse_file: cannot open ceph.conf: (2) No such file or directory
2019-02-05 12:10:38.206468 7fcf99b71d40 -1 auth: unable to find a keyring on /etc/ceph/ceph.client.admin.keyring,/etc/ceph/ceph.keyring,/etc/ceph/keyring,/etc/ceph/keyring.bin,: (2) No such file or directory
Watchers:
	watcher=10.16.184.2:0/3356087603 client.335848 cookie=1
I0205 12:10:38.240129    4024 keymutex.go:59] UnlockKey(...) called for id "rbd/pvc-9eecb201289e11e9"
I0205 12:10:38.240147    4024 keymutex.go:66] UnlockKey(...) for id. Mutex found, trying to unlock it. "rbd/pvc-9eecb201289e11e9"
I0205 12:10:38.240154    4024 keymutex.go:69] UnlockKey(...) for id "rbd/pvc-9eecb201289e11e9" completed.
I0205 12:10:38.240160    4024 keymutex.go:59] UnlockKey(...) called for id "/var/lib/kubelet/pods/d0a45204-2923-11e9-8dbe-0050562b002b/volumes/kubernetes.io~csi/pvc-9eecb201289e11e9/mount"
I0205 12:10:38.240171    4024 keymutex.go:66] UnlockKey(...) for id. Mutex found, trying to unlock it. "/var/lib/kubelet/pods/d0a45204-2923-11e9-8dbe-0050562b002b/volumes/kubernetes.io~csi/pvc-9eecb201289e11e9/mount"
I0205 12:10:38.240178    4024 keymutex.go:69] UnlockKey(...) for id "/var/lib/kubelet/pods/d0a45204-2923-11e9-8dbe-0050562b002b/volumes/kubernetes.io~csi/pvc-9eecb201289e11e9/mount" completed.
E0205 12:10:38.240187    4024 utils.go:100] GRPC error: rbd image rbd/pvc-9eecb201289e11e9 is still being used
I0205 12:10:42.915961    4024 rbd_util.go:170] rbd: status pvc-11589d68286911e9 using mon 192.168.1.1:6789,192.168.1.2:6789,192.168.1.3:6789, pool rbd id admin key *
I0205 12:10:42.986794    4024 rbd_util.go:189] rbd: watchers on pvc-11589d68286911e9: 2019-02-05 12:10:42.930796 7f6cf58b2d40 -1 did not load config file, using default settings.
2019-02-05 12:10:42.935513 7f6cf58b2d40 -1 Errors while parsing config file!
2019-02-05 12:10:42.935522 7f6cf58b2d40 -1 parse_file: cannot open /etc/ceph/ceph.conf: (2) No such file or directory
2019-02-05 12:10:42.935523 7f6cf58b2d40 -1 parse_file: cannot open ~/.ceph/ceph.conf: (2) No such file or directory
2019-02-05 12:10:42.935524 7f6cf58b2d40 -1 parse_file: cannot open ceph.conf: (2) No such file or directory
2019-02-05 12:10:42.937417 7f6cf58b2d40 -1 Errors while parsing config file!
2019-02-05 12:10:42.937425 7f6cf58b2d40 -1 parse_file: cannot open /etc/ceph/ceph.conf: (2) No such file or directory
2019-02-05 12:10:42.937425 7f6cf58b2d40 -1 parse_file: cannot open ~/.ceph/ceph.conf: (2) No such file or directory
2019-02-05 12:10:42.937426 7f6cf58b2d40 -1 parse_file: cannot open ceph.conf: (2) No such file or directory
2019-02-05 12:10:42.962895 7f6cf58b2d40 -1 auth: unable to find a keyring on /etc/ceph/ceph.client.admin.keyring,/etc/ceph/ceph.keyring,/etc/ceph/keyring,/etc/ceph/keyring.bin,: (2) No such file or directory
Watchers:
	watcher=10.16.184.2:0/3356087603 client.335848 cookie=2
I0205 12:10:57.744845    4024 rbd_util.go:170] rbd: status pvc-11589d68286911e9 using mon 192.168.1.1:6789,192.168.1.2:6789,192.168.1.3:6789, pool rbd id admin key *
I0205 12:10:57.815428    4024 rbd_util.go:189] rbd: watchers on pvc-11589d68286911e9: 2019-02-05 12:10:57.758475 7f3817cd3d40 -1 did not load config file, using default settings.
2019-02-05 12:10:57.763387 7f3817cd3d40 -1 Errors while parsing config file!
2019-02-05 12:10:57.763396 7f3817cd3d40 -1 parse_file: cannot open /etc/ceph/ceph.conf: (2) No such file or directory
2019-02-05 12:10:57.763397 7f3817cd3d40 -1 parse_file: cannot open ~/.ceph/ceph.conf: (2) No such file or directory
2019-02-05 12:10:57.763398 7f3817cd3d40 -1 parse_file: cannot open ceph.conf: (2) No such file or directory
2019-02-05 12:10:57.764948 7f3817cd3d40 -1 Errors while parsing config file!
2019-02-05 12:10:57.764956 7f3817cd3d40 -1 parse_file: cannot open /etc/ceph/ceph.conf: (2) No such file or directory
2019-02-05 12:10:57.764956 7f3817cd3d40 -1 parse_file: cannot open ~/.ceph/ceph.conf: (2) No such file or directory
2019-02-05 12:10:57.764957 7f3817cd3d40 -1 parse_file: cannot open ceph.conf: (2) No such file or directory
2019-02-05 12:10:57.791878 7f3817cd3d40 -1 auth: unable to find a keyring on /etc/ceph/ceph.client.admin.keyring,/etc/ceph/ceph.keyring,/etc/ceph/keyring,/etc/ceph/keyring.bin,: (2) No such file or directory
Watchers:
	watcher=10.16.184.2:0/3356087603 client.335848 cookie=2
I0205 12:10:57.815449    4024 keymutex.go:59] UnlockKey(...) called for id "rbd/pvc-11589d68286911e9"
I0205 12:10:57.815474    4024 keymutex.go:66] UnlockKey(...) for id. Mutex found, trying to unlock it. "rbd/pvc-11589d68286911e9"
I0205 12:10:57.815480    4024 keymutex.go:69] UnlockKey(...) for id "rbd/pvc-11589d68286911e9" completed.
I0205 12:10:57.815486    4024 keymutex.go:59] UnlockKey(...) called for id "/var/lib/kubelet/pods/656fdb80-293e-11e9-b9bd-0050562b002b/volumes/kubernetes.io~csi/pvc-11589d68286911e9/mount"
I0205 12:10:57.815497    4024 keymutex.go:66] UnlockKey(...) for id. Mutex found, trying to unlock it. "/var/lib/kubelet/pods/656fdb80-293e-11e9-b9bd-0050562b002b/volumes/kubernetes.io~csi/pvc-11589d68286911e9/mount"
I0205 12:10:57.815505    4024 keymutex.go:69] UnlockKey(...) for id "/var/lib/kubelet/pods/656fdb80-293e-11e9-b9bd-0050562b002b/volumes/kubernetes.io~csi/pvc-11589d68286911e9/mount" completed.
E0205 12:10:57.815514    4024 utils.go:100] GRPC error: rbd image rbd/pvc-11589d68286911e9 is still being used
I0205 12:10:57.815547    4024 keymutex.go:53] LockKey(...) for id "/var/lib/kubelet/pods/656fdb80-293e-11e9-b9bd-0050562b002b/volumes/kubernetes.io~csi/pvc-11589d68286911e9/mount" completed.
I0205 12:10:57.815578    4024 nsenter_mount.go:166] findmnt: directory /var/lib/kubelet/pods/656fdb80-293e-11e9-b9bd-0050562b002b/volumes/kubernetes.io~csi/pvc-11589d68286911e9/mount does not exist
I0205 12:10:57.815666    4024 keymutex.go:50] LockKey(...) called for id "rbd/pvc-11589d68286911e9"
I0205 12:10:57.815674    4024 keymutex.go:53] LockKey(...) for id "rbd/pvc-11589d68286911e9" completed.
I0205 12:10:57.817341    4024 rbd_util.go:170] rbd: status pvc-11589d68286911e9 using mon 192.168.1.1:6789,192.168.1.2:6789,192.168.1.3:6789, pool rbd id admin key *
I0205 12:10:57.897885    4024 rbd_util.go:189] rbd: watchers on pvc-11589d68286911e9: 2019-02-05 12:10:57.841254 7fe75cb23d40 -1 did not load config file, using default settings.

Источник

Hi,

I have a 3 node cluster with ceph. On this cluster I have some nodes that use this same ceph cluster (ceph-csi on nomad). This all worked fine.

Two days ago in did an update from proxmox 6.4 to 7.2. Also did the ceph upgrade from Octopus to Pacific. All seemed to work fine. But then I did a restart of a VM that also uses the ceph cluster. The initial mapping of my rbd volume is not an issue. Ceph-csi uses the rbd kernel module for this. But after a while I get spammed with:

«`

[ 984.270858] libceph: connect (1)10.25.70.9:6789 error -101

[ 984.272609] libceph: mon2 (1)10.25.70.9:6789 connect error

[ 988.685044] libceph: connect (1)10.25.70.7:6811 error -101

[ 988.686704] libceph: osd5 (1)10.25.70.7:6811 connect error

[ 992.268982] libceph: connect (1)10.25.70.9:6789 error -101

[ 992.271040] libceph: mon2 (1)10.25.70.9:6789 connect error

[ 1004.556631] libceph: connect (1)10.25.70.7:6811 error -101

[ 1004.558273] libceph: osd5 (1)10.25.70.7:6811 connect error

[ 1006.604652] libceph: connect (1)10.25.70.7:6789 error -101

[ 1006.606391] libceph: mon0 (1)10.25.70.7:6789 connect error

[ 1006.860603] libceph: connect (1)10.25.70.7:6789 error -101

[ 1006.862242] libceph: mon0 (1)10.25.70.7:6789 connect error

[ 1007.364579] libceph: connect (1)10.25.70.7:6789 error -101

[ 1007.366211] libceph: mon0 (1)10.25.70.7:6789 connect error

[ 1008.269474] libceph: connect (1)10.25.70.7:6789 error -101

[ 1008.271358] libceph: mon0 (1)10.25.70.7:6789 connect error

[ 1010.252523] libceph: connect (1)10.25.70.7:6789 error -101

[ 1010.254167] libceph: mon0 (1)10.25.70.7:6789 connect error

[ 1014.284420] libceph: connect (1)10.25.70.7:6789 error -101

[ 1014.286057] libceph: mon0 (1)10.25.70.7:6789 connect error

[ 1020.428274] libceph: connect (1)10.25.70.7:6811 error -101

[ 1020.429976] libceph: osd5 (1)10.25.70.7:6811 connect error

[ 1022.476178] libceph: connect (1)10.25.70.7:6789 error -101

[ 1022.478113] libceph: mon0 (1)10.25.70.7:6789 connect error

«`

And my volume goes stale. All my processes crap out and I have no idea why. I checked all the network connections and I can without any problem connected to all adresses mentioned above. I have the idea that as soon and I set up a new session with a monitor I get this issue. The old sessions (pre update) were working fine.

Anyone have any idea? Am I missing something?

Источник

client reboot stuck if the ceph node is not reachable or shutdown

Description

1) mount cephfs on client,
2) shutdown osd+mon node or make it not reachable
3) while client is accessing the mount(simple ls on dir), reboot the client
It will be stuck forever until it can reach the ceph nodes or unless hard reset is done

Expected behavior:
reboot should work when ceph nodes are not reachable

[ubuntu@mira101 ~]$ ceph -v
ceph version 10.2.0-910-gab42bc5 (ab42bc5925cc1aaaa837522c4cbcf60afb2ac764)

[ubuntu@mira101 ~]$ cat /etc/redhat-release 
CentOS Linux release 7.2.1511 (Core) 
[ubuntu@mira101 ~]$ uname -a
Linux mira101 3.10.0-327.el7.x86_64 #1 SMP Thu Nov 19 22:10:57 UTC 2015 x86_64 x86_64 x86_64 GNU/Linux

[vakulkar@vakulkar ~]$ ipmitool -H mira101.ipmi.sepia.ceph.com -U inktank -I lanplus sol activate
[SOL Session operational.  Use ~? for help]
[ 9116.919063] INFO: task systemd:32250 blocked for more than 120 seconds.
[ 9116.925739] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 9116.933721] systemd         D 000000000000001b     0 32250      0 0x00000080
[ 9116.941004]  ffff88042784fe48 0000000000000082 ffff880426a35080 ffff88042784ffd8
[ 9116.948789]  ffff88042784ffd8 ffff88042784ffd8 ffff880426a35080 ffff8803ac8f5308
[ 9116.956497]  ffff8803ac8f5630 0000000000000000 ffff8804272a1000 000000000000001b
[ 9116.964199] Call Trace:
[ 9116.966721]  [<ffffffff8163a909>] schedule+0x29/0x70
[ 9116.971844]  [<ffffffffa04529c3>] ceph_mdsc_sync+0x3a3/0x600 [ceph]
[ 9116.978265]  [<ffffffff810a6ae0>] ? wake_up_atomic_t+0x30/0x30
[ 9116.984258]  [<ffffffff8120f9b0>] ? do_fsync+0xa0/0xa0
[ 9116.989560]  [<ffffffffa0430652>] ceph_sync_fs+0x62/0xd0 [ceph]
[ 9116.995636]  [<ffffffff8120f9d0>] sync_fs_one_sb+0x20/0x30
[ 9117.001219]  [<ffffffff811e20d2>] iterate_supers+0xb2/0x110
[ 9117.006872]  [<ffffffff8120fae4>] sys_sync+0x64/0xb0
[ 9117.011941]  [<ffffffff81645909>] system_call_fastpath+0x16/0x1b
[ 9218.991528] libceph: mds0 172.21.5.138:6812 socket closed (con state OPEN)
[ 9220.038994] libceph: connect 172.21.5.138:6812 error -101
[ 9220.044413] libceph: mds0 172.21.5.138:6812 connect error
[ 9221.038566] libceph: connect 172.21.5.138:6812 error -101
[ 9221.043982] libceph: mds0 172.21.5.138:6812 connect error
[ 9223.037427] libceph: connect 172.21.5.138:6812 error -101
[ 9223.042839] libceph: mds0 172.21.5.138:6812 connect error
[ 9227.035265] libceph: connect 172.21.5.138:6812 error -101
[ 9227.040678] libceph: mds0 172.21.5.138:6812 connect error
[ 9231.784910] libceph: mon0 172.21.5.138:6789 socket closed (con state OPEN)
[ 9231.791802] libceph: mon0 172.21.5.138:6789 session lost, hunting for new mon
[ 9231.798965] libceph: connect 172.21.5.138:6789 error -101
[ 9231.804489] libceph: mon0 172.21.5.138:6789 connect error
[ 9235.046992] libceph: connect 172.21.5.138:6812 error -101
[ 9235.052416] libceph: mds0 172.21.5.138:6812 connect error
[ 9236.953992] INFO: task systemd:32250 blocked for more than 120 seconds.
[ 9236.960638] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 9236.968467] systemd         D 000000000000001b     0 32250      0 0x00000080
[ 9236.975619]  ffff88042784fe48 0000000000000082 ffff880426a35080 ffff88042784ffd8
[ 9236.983260]  ffff88042784ffd8 ffff88042784ffd8 ffff880426a35080 ffff8803ac8f5308
[ 9236.990856]  ffff8803ac8f5630 0000000000000000 ffff8804272a1000 000000000000001b
[ 9236.998497] Call Trace:
[ 9237.000952]  [<ffffffff8163a909>] schedule+0x29/0x70
[ 9237.005936]  [<ffffffffa04529c3>] ceph_mdsc_sync+0x3a3/0x600 [ceph]
[ 9237.012203]  [<ffffffff810a6ae0>] ? wake_up_atomic_t+0x30/0x30
[ 9237.018042]  [<ffffffff8120f9b0>] ? do_fsync+0xa0/0xa0
[ 9237.023191]  [<ffffffffa0430652>] ceph_sync_fs+0x62/0xd0 [ceph]
[ 9237.029112]  [<ffffffff8120f9d0>] sync_fs_one_sb+0x20/0x30
[ 9237.034613]  [<ffffffff811e20d2>] iterate_supers+0xb2/0x110
[ 9237.040190]  [<ffffffff8120fae4>] sys_sync+0x64/0xb0
[ 9237.045161]  [<ffffffff81645909>] system_call_fastpath+0x16/0x1b
[ 9242.083366] libceph: connect 172.21.5.138:6789 error -101
[ 9242.088780] libceph: mon0 172.21.5.138:6789 connect error
[ 9251.038495] libceph: connect 172.21.5.138:6812 error -101
[ 9251.043912] libceph: mds0 172.21.5.138:6812 connect error
[ 9252.093949] libceph: connect 172.21.5.138:6789 error -101
[ 9252.099365] libceph: mon0 172.21.5.138:6789 connect error
[ 9262.104746] libceph: connect 172.21.5.138:6789 error -101
[ 9262.110161] libceph: mon0 172.21.5.138:6789 connect error

History

#1

Updated by Greg Farnum over 6 years ago

Status changed from New to Won’t Fix

If the cluster is unavailable, we can’t do a clean shutdown. I guess we could try and distinguish between dirty requests and simple information ones, but that would be more difficult than it sounds.

If you do a force unmount I believe it all goes away; that’s part of the reason teuthology includes the -f flag.

#3

Updated by Ilya Dryomov over 6 years ago

Also http://www.spinics.net/lists/ceph-devel/msg27376.html, http://tracker.ceph.com/issues/13189.

In the cephfs case, if you know your cluster is gone, you can do umount -f. In the local-FS-on-rbd case, umount -f on a local FS won’t help, so it’s much worse. That’s the reason I picked on your teuthology-nuke pull request, Vasu.

Blindly aborting outstanding requests is bad, but, at least in the rbd case, if the init system wasn’t set up properly and shut the network down before umounting, or if the cluster is just gone, we are past the point of return and might as well abort. It’s on my TODO list.

#5

Updated by Ilya Dryomov almost 4 years ago

Assignee set to Ilya Dryomov

For kcephfs, I believe umount -f is now more aggressive and aborts OSD requests in addition to MDS requests, but that code is definitely not in RHEL 7.2. Just as before, you need to do umount -f before you do reboot though. Otherwise the kernel client wouldn’t know that you wanted your dirty state discarded and would still hang on to it.

There aren’t really any workarounds beyond «if you know that your cluster is inaccessible or gone, umount -f before rebooting», but it would be good if you could describe your scenario in more detail.

Also available in: Atom
PDF

Источник

Troubleshooting Techniques¶

There are two main categories of information you will need to investigate issues in the cluster:

Kubernetes status and logs documented here
Ceph cluster status (see upcoming Ceph tools section)

Ceph Tools¶

Logs on a specific node to find why a PVC is failing to mount
See the log collection topic for a script that will help you gather the logs
Other artifacts:
The monitors that are expected to be in quorum: kubectl -n <cluster-namespace> get configmap rook-ceph-mon-endpoints -o yaml | grep data

Tools in the Rook Toolbox¶

The rook-ceph-tools pod provides a simple environment to run Ceph tools. Once the pod is up and running, connect to the pod to execute Ceph commands to evaluate that current state of the cluster.

kubectl -n rook-ceph exec -it $(kubectl -n rook-ceph get pod -l "app=rook-ceph-tools" -o jsonpath='{.items[*].metadata.name}') bash

Ceph Commands¶

Here are some common commands to troubleshoot a Ceph cluster:

ceph status
ceph osd status
ceph osd df
ceph osd utilization
ceph osd pool stats
ceph osd tree
ceph pg stat

Cluster failing to service requests¶

Symptoms¶

Execution of the ceph command hangs
PersistentVolumes are not being created
Large amount of slow requests are blocking
Large amount of stuck requests are blocking
One or more MONs are restarting periodically

Investigation¶

Create a rook-ceph-tools pod to investigate the current state of Ceph. Here is an example of what one might see. In this case the ceph status command would just hang so a CTRL-C needed to be sent.

kubectl -n rook-ceph exec -it deploy/rook-ceph-tools -- ceph status

ceph status
^CCluster connection interrupted or timed out

Another indication is when one or more of the MON pods restart frequently. Note the ‘mon107’ that has only been up for 16 minutes in the following output.

$ kubectl -n rook-ceph get all -o wide --show-all
NAME                                 READY     STATUS    RESTARTS   AGE       IP               NODE
po/rook-ceph-mgr0-2487684371-gzlbq   1/1       Running   0          17h       192.168.224.46   k8-host-0402
po/rook-ceph-mon107-p74rj            1/1       Running   0          16m       192.168.224.28   k8-host-0402
rook-ceph-mon1-56fgm                 1/1       Running   0          2d        192.168.91.135   k8-host-0404
rook-ceph-mon2-rlxcd                 1/1       Running   0          2d        192.168.123.33   k8-host-0403
rook-ceph-osd-bg2vj                  1/1       Running   0          2d        192.168.91.177   k8-host-0404
rook-ceph-osd-mwxdm                  1/1       Running   0          2d        192.168.123.31   k8-host-0403

Solution¶

The dataDirHostPath setting specifies a path on the local host for the Ceph daemons to store configuration and data. Setting this to a path like /var/lib/rook, reapplying your Cluster CRD and restarting all the Ceph daemons (MON, MGR, OSD, RGW) should solve this problem. After the Ceph daemons have been restarted, it is advisable to restart the rook-tools pod.

Monitors are the only pods running¶

Symptoms¶

Rook operator is running
Either a single mon starts or the mons start very slowly (at least several minutes apart)
The crash-collector pods are crashing
No mgr, osd, or other daemons are created except the CSI driver

Investigation¶

The crash-collector pods will be blocked from starting until the mons have formed quorum the first time.

There are several common causes for the mons failing to form quorum:

The operator pod does not have network connectivity to the mon pod(s). The network may be configured incorrectly.
One or more mon pods are in running state, but the operator log shows they are not able to form quorum
A mon is using configuration from a previous installation. See the cleanup guide for cleaning the previous cluster.
A firewall may be blocking the ports required for the Ceph mons to form quorum. Ensure ports 6789 and 3300 are enabled. See the Ceph networking guide for more details.
There may be MTU mismatch between different networking components. Some networks may be more susceptible to mismatch than others. If Kubernetes CNI or hosts enable jumbo frames (MTU 9000), Ceph will use large packets to maximize network bandwidth. If other parts of the networking chain don’t support jumbo frames, this could result in lost or rejected packets unexpectedly.

Operator fails to connect to the mon¶

First look at the logs of the operator to confirm if it is able to connect to the mons.

kubectl -n rook-ceph logs -l app=rook-ceph-operator

Likely you will see an error similar to the following that the operator is timing out when connecting to the mon. The last command is ceph mon_status, followed by a timeout message five minutes later.

2018-01-21 21:47:32.375833 I | exec: Running command: ceph mon_status --cluster=rook --conf=/var/lib/rook/rook-ceph/rook.config --keyring=/var/lib/rook/rook-ceph/client.admin.keyring --format json --out-file /tmp/442263890
2018-01-21 21:52:35.370533 I | exec: 2018-01-21 21:52:35.071462 7f96a3b82700  0 monclient(hunting): authenticate timed out after 300
2018-01-21 21:52:35.071462 7f96a3b82700  0 monclient(hunting): authenticate timed out after 300
2018-01-21 21:52:35.071524 7f96a3b82700  0 librados: client.admin authentication error (110) Connection timed out
2018-01-21 21:52:35.071524 7f96a3b82700  0 librados: client.admin authentication error (110) Connection timed out
[errno 110] error connecting to the cluster

The error would appear to be an authentication error, but it is misleading. The real issue is a timeout.

Solution¶

To verify the network connectivity:

Get the endpoint for a mon
Curl the mon from the operator pod

For example, this command will curl the first mon from the operator:

$ kubectl -n rook-ceph exec deploy/rook-ceph-operator -- curl $(kubectl -n rook-ceph get svc -l app=rook-ceph-mon -o jsonpath='{.items[0].spec.clusterIP}'):3300 2>/dev/null
ceph v2

If «ceph v2» is printed to the console, the connection was successful. If the command does not respond or otherwise fails, the network connection cannot be established.

Failing mon pod¶

Second we need to verify if the mon pod started successfully.

$ kubectl -n rook-ceph get pod -l app=rook-ceph-mon
NAME                                READY     STATUS               RESTARTS   AGE
rook-ceph-mon-a-69fb9c78cd-58szd    1/1       CrashLoopBackOff     2          47s

# The pod shows a termination status that the keyring does not match the existing keyring
$ kubectl -n rook-ceph describe pod -l mon=rook-ceph-mon0
...
    Last State:    Terminated
      Reason:    Error
      Message:    The keyring does not match the existing keyring in /var/lib/rook/rook-ceph-mon0/data/keyring.
                    You may need to delete the contents of dataDirHostPath on the host from a previous deployment.
...

See the solution in the next section regarding cleaning up the dataDirHostPath on the nodes.

Solution¶

This is a common problem reinitializing the Rook cluster when the local directory used for persistence has not been purged. This directory is the dataDirHostPath setting in the cluster CRD and is typically set to /var/lib/rook. To fix the issue you will need to delete all components of Rook and then delete the contents of /var/lib/rook (or the directory specified by dataDirHostPath) on each of the hosts in the cluster. Then when the cluster CRD is applied to start a new cluster, the rook-operator should start all the pods as expected.

Caution

Deleting the dataDirHostPath folder is destructive to the storage. Only delete the folder if you are trying to permanently purge the Rook cluster.

See the Cleanup Guide for more details.

PVCs stay in pending state¶

Symptoms¶

When you create a PVC based on a rook storage class, it stays pending indefinitely

For the WordPress example, you might see two PVCs in pending state.

$ kubectl get pvc
NAME             STATUS    VOLUME   CAPACITY   ACCESS MODES   STORAGECLASS      AGE
mysql-pv-claim   Pending                                      rook-ceph-block   8s
wp-pv-claim      Pending                                      rook-ceph-block   16s

Investigation¶

There are two common causes for the PVCs staying in pending state:

There are no OSDs in the cluster
The CSI provisioner pod is not running or is not responding to the request to provision the storage

Confirm if there are OSDs¶

To confirm if you have OSDs in your cluster, connect to the Rook Toolbox and run the ceph status command. You should see that you have at least one OSD up and in. The minimum number of OSDs required depends on the replicated.size setting in the pool created for the storage class. In a «test» cluster, only one OSD is required (see storageclass-test.yaml). In the production storage class example (storageclass.yaml), three OSDs would be required.

$ ceph status
  cluster:
    id:     a0452c76-30d9-4c1a-a948-5d8405f19a7c
    health: HEALTH_OK

  services:
    mon: 3 daemons, quorum a,b,c (age 11m)
    mgr: a(active, since 10m)
    osd: 1 osds: 1 up (since 46s), 1 in (since 109m)

OSD Prepare Logs¶

If you don’t see the expected number of OSDs, let’s investigate why they weren’t created. On each node where Rook looks for OSDs to configure, you will see an «osd prepare» pod.

$ kubectl -n rook-ceph get pod -l app=rook-ceph-osd-prepare
NAME                                 ...  READY   STATUS      RESTARTS   AGE
rook-ceph-osd-prepare-minikube-9twvk   0/2     Completed   0          30m

See the section on why OSDs are not getting created to investigate the logs.

CSI Driver¶

The CSI driver may not be responding to the requests. Look in the logs of the CSI provisioner pod to see if there are any errors during the provisioning.

There are two provisioner pods:

kubectl -n rook-ceph get pod -l app=csi-rbdplugin-provisioner

Get the logs of each of the pods. One of them should be the «leader» and be responding to requests.

kubectl -n rook-ceph logs csi-cephfsplugin-provisioner-d77bb49c6-q9hwq csi-provisioner

Operator unresponsiveness¶

Lastly, if you have OSDs up and in, the next step is to confirm the operator is responding to the requests. Look in the Operator pod logs around the time when the PVC was created to confirm if the request is being raised. If the operator does not show requests to provision the block image, the operator may be stuck on some other operation. In this case, restart the operator pod to get things going again.

Solution¶

If the «osd prepare» logs didn’t give you enough clues about why the OSDs were not being created, please review your cluster.yaml configuration. The common misconfigurations include:

If useAllDevices: true, Rook expects to find local devices attached to the nodes. If no devices are found, no OSDs will be created.
If useAllDevices: false, OSDs will only be created if deviceFilter is specified.
Only local devices attached to the nodes will be configurable by Rook. In other words, the devices must show up under /dev.
The devices must not have any partitions or filesystems on them. Rook will only configure raw devices. Partitions are not yet supported.

OSD pods are failing to start¶

Symptoms¶

OSD pods are failing to start
You have started a cluster after tearing down another cluster

Investigation¶

$ kubectl -n rook-ceph logs rook-ceph-osd-fl8fs
...

$ kubectl -n rook-ceph logs rook-ceph-osd-fl8fs
...
2017-10-31 20:13:11.187106 I | mkfs-osd0: 2017-10-31 20:13:11.186992 7f0059d62e00 -1 bluestore(/var/lib/rook/osd0) _read_fsid unparsable uuid
2017-10-31 20:13:11.187208 I | mkfs-osd0: 2017-10-31 20:13:11.187026 7f0059d62e00 -1 bluestore(/var/lib/rook/osd0) _setup_block_symlink_or_file failed to create block symlink to /dev/disk/by-partuuid/651153ba-2dfc-4231-ba06-94759e5ba273: (17) File exists
2017-10-31 20:13:11.187233 I | mkfs-osd0: 2017-10-31 20:13:11.187038 7f0059d62e00 -1 bluestore(/var/lib/rook/osd0) mkfs failed, (17) File exists
2017-10-31 20:13:11.187254 I | mkfs-osd0: 2017-10-31 20:13:11.187042 7f0059d62e00 -1 OSD::mkfs: ObjectStore::mkfs failed with error (17) File exists
2017-10-31 20:13:11.187275 I | mkfs-osd0: 2017-10-31 20:13:11.187121 7f0059d62e00 -1  ** ERROR: error creating empty object store in /var/lib/rook/osd0: (17) File exists

Solution¶

If the error is from the file that already exists, this is a common problem reinitializing the Rook cluster when the local directory used for persistence has not been purged. This directory is the dataDirHostPath setting in the cluster CRD and is typically set to /var/lib/rook. To fix the issue you will need to delete all components of Rook and then delete the contents of /var/lib/rook (or the directory specified by dataDirHostPath) on each of the hosts in the cluster. Then when the cluster CRD is applied to start a new cluster, the rook-operator should start all the pods as expected.

OSD pods are not created on my devices¶

Symptoms¶

No OSD pods are started in the cluster
Devices are not configured with OSDs even though specified in the Cluster CRD
One OSD pod is started on each node instead of multiple pods for each device

Investigation¶

First, ensure that you have specified the devices correctly in the CRD. The Cluster CRD has several ways to specify the devices that are to be consumed by the Rook storage:

useAllDevices: true: Rook will consume all devices it determines to be available
deviceFilter: Consume all devices that match this regular expression
devices: Explicit list of device names on each node to consume

Second, if Rook determines that a device is not available (has existing partitions or a formatted filesystem), Rook will skip consuming the devices. If Rook is not starting OSDs on the devices you expect, Rook may have skipped it for this reason. To see if a device was skipped, view the OSD preparation log on the node where the device was skipped. Note that it is completely normal and expected for OSD prepare pod to be in the completed state. After the job is complete, Rook leaves the pod around in case the logs need to be investigated.

# Get the prepare pods in the cluster
$ kubectl -n rook-ceph get pod -l app=rook-ceph-osd-prepare
NAME                                   READY     STATUS      RESTARTS   AGE
rook-ceph-osd-prepare-node1-fvmrp      0/1       Completed   0          18m
rook-ceph-osd-prepare-node2-w9xv9      0/1       Completed   0          22m
rook-ceph-osd-prepare-node3-7rgnv      0/1       Completed   0          22m

# view the logs for the node of interest in the "provision" container
$ kubectl -n rook-ceph logs rook-ceph-osd-prepare-node1-fvmrp provision
[...]

Here are some key lines to look for in the log:

# A device will be skipped if Rook sees it has partitions or a filesystem
2019-05-30 19:02:57.353171 W | cephosd: skipping device sda that is in use
2019-05-30 19:02:57.452168 W | skipping device "sdb5": ["Used by ceph-disk"]

# Other messages about a disk being unusable by ceph include:
Insufficient space (<5GB) on vgs
Insufficient space (<5GB)
LVM detected
Has BlueStore device label
locked
read-only

# A device is going to be configured
2019-05-30 19:02:57.535598 I | cephosd: device sdc to be configured by ceph-volume

# For each device configured you will see a report printed to the log
2019-05-30 19:02:59.844642 I |   Type            Path                                                    LV Size         % of device
2019-05-30 19:02:59.844651 I | ----------------------------------------------------------------------------------------------------
2019-05-30 19:02:59.844677 I |   [data]          /dev/sdc                                                7.00 GB         100%

Solution¶

Either update the CR with the correct settings, or clean the partitions or filesystem from your devices. To clean devices from a previous install see the cleanup guide.

# Restart the operator to ensure devices are configured. A new pod will automatically be started when the current operator pod is deleted.
$ kubectl -n rook-ceph delete pod -l app=rook-ceph-operator
[...]

Node hangs after reboot¶

This issue is fixed in Rook v1.3 or later.

Symptoms¶

After issuing a reboot command, node never returned online
Only a power cycle helps

Investigation¶

On a node running a pod with a Ceph persistent volume

mount | grep rbd
# _netdev mount option is absent, also occurs for cephfs
# OS is not aware PV is mounted over network
/dev/rbdx on ... (rw,relatime, ..., noquota)

libceph: connect [monitor-ip]:6789 error -101

Solution¶

The node needs to be drained before reboot. After the successful drain, the node can be rebooted as usual.

Because kubectl drain command automatically marks the node as unschedulable (kubectl cordon effect), the node needs to be uncordoned once it’s back online.

Drain the node:

kubectl drain <node-name> --ignore-daemonsets --delete-local-data

Uncordon the node:

kubectl uncordon <node-name>

Symptoms¶

More than one shared filesystem (CephFS) has been created in the cluster
A pod attempts to mount any other shared filesystem besides the first one that was created
The pod incorrectly gets the first filesystem mounted instead of the intended filesystem

Solution¶

The only solution to this problem is to upgrade your kernel to 4.7 or higher. This is due to a mount flag added in the kernel version 4.7 which allows to chose the filesystem by name.

For additional info on the kernel version requirement for multiple shared filesystems (CephFS), see Filesystem — Kernel version requirement.

Set debug log level for all Ceph daemons¶

Assuming you want a log level of 1, you will run:

$ kubectl -n rook-ceph exec deploy/rook-ceph-tools -- set-ceph-debug-level 1
ceph config set global debug_context 1
ceph config set global debug_lockdep 1
[...]

Once you are done debugging, you can revert all the debug flag to their default value by running the following:

kubectl -n rook-ceph exec deploy/rook-ceph-tools -- set-ceph-debug-level default

Activate log to file for a particular Ceph daemon¶

They are cases where looking at Kubernetes logs is not enough for diverse reasons, but just to name a few:

not everyone is familiar for Kubernetes logging and expects to find logs in traditional directories
logs get eaten (buffer limit from the log engine) and thus not requestable from Kubernetes

So for each daemon, dataDirHostPath is used to store logs, if logging is activated. Rook will bindmount dataDirHostPath for every pod. Let’s say you want to enable logging for mon.a, but only for this daemon. Using the toolbox or from inside the operator run:

ceph config set mon.a log_to_file true

This will activate logging on the filesystem, you will be able to find logs in dataDirHostPath/$NAMESPACE/log, so typically this would mean /var/lib/rook/rook-ceph/log. You don’t need to restart the pod, the effect will be immediate.

To disable the logging on file, simply set log_to_file to false.

A worker node using RBD devices hangs up¶

Symptoms¶

There is no progress on I/O from/to one of RBD devices (/dev/rbd* or /dev/nbd*).
After that, the whole worker node hangs up.

Investigation¶

This happens when the following conditions are satisfied.

The problematic RBD device and the corresponding OSDs are co-located.
There is an XFS filesystem on top of this device.

In addition, when this problem happens, you can see the following messages in dmesg.

$ dmesg
...
[51717.039319] INFO: task kworker/2:1:5938 blocked for more than 120 seconds.
[51717.039361]       Not tainted 4.15.0-72-generic #81-Ubuntu
[51717.039388] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
...

It’s so-called hung_task problem and means that there is a deadlock in the kernel. For more detail, please refer to the corresponding issue comment.

Solution¶

This problem will be solve by the following two fixes.

Linux kernel: A minor feature that is introduced by this commit. It will be included in Linux v5.6.
Ceph: A fix that uses the above-mentioned kernel’s feature. The Ceph community will probably discuss this fix after releasing Linux v5.6.

You can bypass this problem by using ext4 or any other filesystems rather than XFS. Filesystem type can be specified with csi.storage.k8s.io/fstype in StorageClass resource.

Too few PGs per OSD warning is shown¶

Symptoms¶

ceph status shows «too few PGs per OSD» warning as follows.

$ ceph status
  cluster:
    id:     fd06d7c3-5c5c-45ca-bdea-1cf26b783065
    health: HEALTH_WARN
            too few PGs per OSD (16 < min 30)
[...]

Solution¶

Symptoms¶

There is a critical flaw in OSD on LV-backed PVC. LVM metadata can be corrupted if both the host and OSD container modify it simultaneously. For example, the administrator might modify it on the host, while the OSD initialization process in a container could modify it too. In addition, if lvmetad is running, the possibility of occurrence gets higher. In this case, the change of LVM metadata in OSD container is not reflected to LVM metadata cache in host for a while.

If you still decide to configure an OSD on LVM, please keep the following in mind to reduce the probability of this issue.

Solution¶

Disable lvmetad.
Avoid configuration of LVs from the host. In addition, don’t touch the VGs and physical volumes that back these LVs.
Avoid incrementing the count field of storageClassDeviceSets and create a new LV that backs an OSD simultaneously.

You can know whether the above-mentioned tag exists with the command: sudo lvs -o lv_name,lv_tags. If the lv_tag field is empty in an LV corresponding to the OSD lv_tags, this OSD encountered the problem. In this case, please retire this OSD or replace with other new OSD before restarting.

OSD prepare job fails due to low aio-max-nr setting¶

If the Kernel is configured with a low aio-max-nr setting, the OSD prepare job might fail with the following error:

exec: stderr: 2020-09-17T00:30:12.145+0000 7f0c17632f40 -1 bdev(0x56212de88700 /var/lib/ceph/osd/ceph-0//block) _aio_start io_setup(2) failed with EAGAIN; try increasing /proc/sys/fs/aio-max-nr

To overcome this, you need to increase the value of fs.aio-max-nr of your sysctl configuration (typically /etc/sysctl.conf). You can do this with your favorite configuration management system.

Alternatively, you can have a DaemonSet to apply the configuration for you on all your nodes.

Unexpected partitions created¶

Symptoms¶

Users running Rook versions v1.6.0-v1.6.7 may observe unwanted OSDs on partitions that appear unexpectedly and seemingly randomly, which can corrupt existing OSDs.

Unexpected partitions are created on host disks that are used by Ceph OSDs. This happens more often on SSDs than HDDs and usually only on disks that are 875GB or larger. Many tools like lsblk, blkid, udevadm, and parted will not show a partition table type for the partition. Newer versions of blkid are generally able to recognize the type as «atari».

Below is an example of lsblk output from a node where phantom Atari partitions are present. Note that sdX1 is never present for the phantom partitions, and sdX2 is 48G on all disks. sdX3 is a variable size and may not always be present. It is possible for sdX4 to appear, though it is an anecdotally rare event.

# lsblk
NAME   MAJ:MIN RM   SIZE RO TYPE MOUNTPOINT
sdb      8:16   0     3T  0 disk
├─sdb2   8:18   0    48G  0 part
└─sdb3   8:19   0   6.1M  0 part
sdc      8:32   0     3T  0 disk
├─sdc2   8:34   0    48G  0 part
└─sdc3   8:35   0   6.2M  0 part
sdd      8:48   0     3T  0 disk
├─sdd2   8:50   0    48G  0 part
└─sdd3   8:51   0   6.3M  0 part

You can see GitHub rook/rook — Issue 7940 unexpected partition on disks >= 1TB (atari partitions) for more detailed information and discussion.

Solution¶

Recover from corruption (v1.6.0-v1.6.7)¶

If you are using Rook v1.6, you must first update to v1.6.8 or higher to avoid further incidents of OSD corruption caused by these Atari partitions.

An old workaround suggested using deviceFilter: ^sd[a-z]+$, but this still results in unexpected partitions. Rook will merely stop creating new OSDs on the partitions. It does not fix a related issue that ceph-volume that is unaware of the Atari partition problem. Users who used this workaround are still at risk for OSD failures in the future.

As an example, you may have /dev/sdb with two unexpected partitions (/dev/sdb2 and /dev/sdb3) as well as a second corrupted disk /dev/sde with one unexpected partition (/dev/sde2).

First, remove the OSDs associated with /dev/sdb, /dev/sdb2, and /dev/sdb3. There might be only one, or up to 3 OSDs depending on how your system was affected. Again see the OSD management doc.
Use dd to wipe the first sectors of the partitions followed by the disk itself. E.g.,
- dd if=/dev/zero of=/dev/sdb2 bs=1M
- dd if=/dev/zero of=/dev/sdb3 bs=1M
- dd if=/dev/zero of=/dev/sdb bs=1M
Then wipe clean /dev/sdb to prepare it for a new OSD. See the teardown document for details.
After this, scale up the Rook operator to deploy a new OSD to /dev/sdb. This will allow Ceph to use /dev/sdb for data recovery and replication while the next OSDs are removed.
Now Repeat steps 1-4 for /dev/sde and /dev/sde2, and continue for any other corrupted disks.

If your Rook cluster does not have any critical data stored in it, it may be simpler to uninstall Rook completely and redeploy with v1.6.8 or higher.

Operator environment variables are ignored¶

Symptoms¶

Configuration settings passed as environment variables do not take effect as expected. For example, the discover daemonset is not created, even though ROOK_ENABLE_DISCOVERY_DAEMON="true" is set.

Investigation¶

Inspect the rook-ceph-operator-config ConfigMap for conflicting settings. The ConfigMap takes precedence over the environment. The ConfigMap must exist, even if all actual configuration is supplied through the environment.

Look for lines with the op-k8sutil prefix in the operator logs. These lines detail the final values, and source, of the different configuration variables.

Verify that both of the following messages are present in the operator logs:

rook-ceph-operator-config-controller successfully started
rook-ceph-operator-config-controller done reconciling

Solution¶

If it does not exist, create an empty ConfigMap:

kind: ConfigMap
apiVersion: v1
metadata:
  name: rook-ceph-operator-config
  namespace: rook-ceph # namespace:operator
data: {}

If the ConfigMap exists, remove any keys that you wish to configure through the environment.

Источник

On Fri, Nov 20, 2015 at 3:19 AM, Wukongming <wu.kongm...@h3c.com> wrote:
> Hi Sage,
>
> I created a rbd image, and mapped to a local which means I can find 
> /dev/rbd0, at this time I reboot the system, in last step of shutting down, 
> it blocked with an error
>
> [235618.0202207] libceph: connect 172.16.57.252:6789 error -101.
>
> My Works’ Env:
>
> Ubuntu kernel 3.19.0
> Ceph 0.94.5
> A cluster of 2 Servers with iscsitgt and open-iscsi, both as server and 
> client. Multipath process is on but not affect this issue. I’ve tried 
> stopping multipath, but the issue still there.
> I map a rbd image to a local, why show me a connect error?
>
> I saw your reply on 
> http://permalink.gmane.org/gmane.comp.file-systems.ceph.devel/13077, but just 
> apart. Is this issue resolved and how?

Yeah, this has been a long standing problem with libceph/rbd.  The
issue is that you *have* to umount (and ideally also unmap, but unmap
isn't strictly necessary) before you reboot.  Otherwise (and I assume
by mapped to a local you mean you've got MONs and OSDs on the same node
as you do rbd map), when you issue a reboot, daemons get killed and the
kernel client ends up waiting for the them to come back, because of
outstanding writes issued by umount called by systemd (or whatever).
There are other variations of this, but it all comes down to you having
to cold reboot.

The right fix is to have all init systems sequence the killing of ceph
daemons after the umount/unmap.  I also toyed with adding a reboot
notifier for libceph to save a cold reboot, but the problem with that
in the general case is data integrity.  However, in cases like the one
I described above, there is no going back so we might as well kill
libceph through a notifier.  I have an incomplete patch somewhere, but
it really shouldn't be necessary...

Thanks,

                Ilya
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Источник

After reboot host, FailedMount PersistentVolumeClaim (CSI) #165

Comments

dllozhkin commented Feb 5, 2019

rootfs commented Feb 5, 2019

dllozhkin commented Feb 5, 2019 •

rootfs commented Feb 5, 2019

rootfs commented Feb 5, 2019

lelvisl commented Mar 6, 2019

Madhu-1 commented Mar 11, 2019

Madhu-1 commented Mar 11, 2019

Madhu-1 commented Mar 11, 2019

yanchicago commented Jan 22, 2020

humblec commented Apr 1, 2020

Ceph Common Issues

Troubleshooting Techniques¶

Ceph Tools¶

Tools in the Rook Toolbox¶

Ceph Commands¶

Cluster failing to service requests¶

Symptoms¶

Investigation¶

Solution¶

Monitors are the only pods running¶

Symptoms¶

Investigation¶

Operator fails to connect to the mon¶

Solution¶

Failing mon pod¶

Solution¶

PVCs stay in pending state¶

Symptoms¶

Investigation¶

Confirm if there are OSDs¶

OSD Prepare Logs¶

CSI Driver¶

Operator unresponsiveness¶

Solution¶

OSD pods are failing to start¶

Symptoms¶

Investigation¶

Solution¶

OSD pods are not created on my devices¶

Symptoms¶

Investigation¶

Solution¶

Node hangs after reboot¶

Symptoms¶

Investigation¶

Solution¶

Using multiple shared filesystem (CephFS) is attempted on a kernel version older than 4.7¶

Symptoms¶

Solution¶

Set debug log level for all Ceph daemons¶

Activate log to file for a particular Ceph daemon¶

A worker node using RBD devices hangs up¶

Symptoms¶

Investigation¶

Solution¶

Too few PGs per OSD warning is shown¶

Symptoms¶

Solution¶

LVM metadata can be corrupted with OSD on LV-backed PVC¶

Symptoms¶

Solution¶

OSD prepare job fails due to low aio-max-nr setting¶

Unexpected partitions created¶

Symptoms¶

Solution¶

Recover from corruption (v1.6.0-v1.6.7)¶

Operator environment variables are ignored¶

Symptoms¶

Investigation¶

Solution¶

client reboot stuck if the ceph node is not reachable or shutdown

History

#1 Updated by Greg Farnum over 6 years ago

#3 Updated by Ilya Dryomov over 6 years ago

#5 Updated by Ilya Dryomov almost 4 years ago

Troubleshooting Techniques¶

Ceph Tools¶

#1

Updated by Greg Farnum over 6 years ago

#3

Updated by Ilya Dryomov over 6 years ago

#5

Updated by Ilya Dryomov almost 4 years ago