-
Notifications
You must be signed in to change notification settings - Fork 183
Auto failover features for Morpheus #3854
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: release/8.0
Are you sure you want to change the base?
Auto failover features for Morpheus #3854
Conversation
* Allow Autofailover for ephemeral bucket w/o relica feature doc (DOC-12191) * Support auto-failover for exceptionally slow/hanging disks (DOC-12073) Manually ported over changes from prior branch because attempts to merge resulted in huge numbers of comflicts potentially with Supritha's changes to the underlying docs.
…ng that somehow there was a conflict based on changes the remote branch **WHICH DID NOT CHANGE** at least as far as Guthub showed me. Revernted AGAIN my changes to this doc and AGAIN manually applied them.
https://jira.issues.couchbase.com/browse/MB-34155[MB-34155] Support Auto-failover for exceptionally slow/hanging disks:: | ||
You can now configure Couchbase Server to trigger an auto-failover on a node if its data disk is slow to respond or is hanging. | ||
Before version 8.0, you could only configure Couchbase Server to auto-failover a node if the data disk returned errors for a set period of time. | ||
The new `failoverOnDataDiskNonResponsiveness` setting and correspinding settings in the Couchbase Web Console *Settings* page sets the nuber of seconds allowed for read or write oprtions to complete. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Typo "oprtions" -- operations
* In no circumstances where data-loss might result: for example, when a bucket has no replicas. | ||
Therefore, even a single event may not trigger a response; and an administrator-specified maximum number of failed nodes may not be reached. | ||
* By default, Couchbase Server does not allow an auto-failover if it may result in data loss. | ||
For example, with default settings Couchbase Server does not allows the auto-failover of a node that contains a bucket with no replicas. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Line 85: Typo -- "does not allows"
Line 73: The sentence "Running a rebalance will reset the count value back to 0." is repeated (seen twice).
|
||
+ | ||
If the unreplicated ephemeral bucket is indexed, Couchbase Server rebuilds the index after it auto-fails over the node even if the index is not on the failed node. | ||
After this type of failover, the index must be rebuild because it indexes data lost in the failed node's vBuckets. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Typo -- "the index must be rebuild"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Had a few typo comments.
Otherwise, LGTM.
https://jira.issues.couchbase.com/browse/MB-34155[MB-34155] Support Auto-failover for exceptionally slow/hanging disks:: | ||
You can now configure Couchbase Server to trigger an auto-failover on a node if its data disk is slow to respond or is hanging. | ||
Before version 8.0, you could only configure Couchbase Server to auto-failover a node if the data disk returned errors for a set period of time. | ||
The new `failoverOnDataDiskNonResponsiveness` setting and correspinding settings in the Couchbase Web Console *Settings* page sets the nuber of seconds allowed for read or write oprtions to complete. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Typos "correspinding" - corresponding, nuber - number
+ | ||
These parameters are _optional_, and are only supported by Couchbase Server Enterprise Edition. | ||
These parameters and their values are ignored, if `enabled` is set to `false`. | ||
If supply a value for this parameter while `failoverOnDataDiskIssues[enabled]` is `false`, Couchbase Server igtnores the setting. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If a value is supplied for this parameter?
Typo: igtnores
|
||
* If the value of `timeout` is incorrectly specified, `400 Bad Request` is returned, with the message `The value of "timeout" must be a positive integer in a range from 5 to 3600`. | ||
You must have one of the following roles makes changes to the auto-failover settings: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
to make changes?
@@ -16,63 +16,56 @@ GET /settings/autoFailover | |||
|
|||
The `GET /settings/autoFailover` HTTP method and URI retrieve auto-failover settings for the cluster. | |||
|
|||
Auto-failover settings are global, and apply to all nodes in the cluster. | |||
To read auto-failover settings, one of the following roles is required: Full Admin, Cluster Admin, Read-Only Admin, Backup Full Admin, Eventing Full Admin, Local User Security Admin, External User Security Admin. | |||
Auto-failover settings are global applying to all nodes in the cluster. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
global, applying
|
||
* `count`. | ||
The number of nodes that Couchbase Server has auto-failed over. | ||
COucbase Server resets this value to zero either when the cluster rebalances to remove or rejoin the failed nodes, or when an administrator manually resets the count (see xref:rest-api:rest-cluster-autofailover-reset.adoc[]). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Typo: Coucbase
+ | ||
-- | ||
** `enabled` indicates whether Couchbase Server initiates an auto-failover on a node when when its data disk has failed to complete an operation in the period set by `timePeriod`. | ||
This value can be `true`, which enables the auto-failover, or teh default `false` which does not trigger a failover due to disk unresponsiveness. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Typo: teh
|
||
* `canAbortRebalance`. | ||
Whether or not auto-failover can be triggered if a _rebalance_ is in progress. | ||
Sets whether Couchbase Server can perform an auto-failover can while a rebalance is taking place. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can while a rebalance > while a rebalance
If an ephemeral bucket lacks replicas, it loses the data in vBuckets on any node that fails and restarts. | ||
To prevent this data loss, by default Couchbase Server does not allow auto-failover of a node that contains vBuckets for an unreplicated ephemeral bucket. | ||
In this case, you must manually fail over the node if it is unresponsive. | ||
However, all of ephemeral bucket's data on the node is lost. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
all of the ephemeral bucket's data
This PR includes docs for the following doc tickets:
Preview URL:
https://preview.docs-test.couchbase.com/docs-server-DOC-12191_allow_auto-failover_ephemeral_buckets_w-o_replica_reconcile
You will need the Docs Team credentials on Confluence.
Primary changes (with links to preview):