为什么是2呢,因为以前写过一篇couchbase 相关的坑了,而且其实遇到了好多乱七八糟的坑了,有精力写出来的话,还有 3和4

背景

前段时间发现couchbase java sdk 2.2.6版本在测试环境一台机器挂掉之后,整个sdk就不能正常工作了,couchbase作为容灾性很强的集群,不应该会出现整个集群,所以估计是客户端有问题,所以下意识升级了版本,升级到2.4.5版本,顺利解决这个问题。正常运行了一个月之后,私有云的哥们某天反馈客户的环境一台机器挂了,服务的couchbase起不来,日志如下

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
2017-06-21 16:09:56.459 [RxComputationScheduler-6] WARN [null][KeyValueEndpoint]: Socket connect took longer than specified timeout.
2017-06-21 16:09:56.460 [RxComputationScheduler-6] WARN Error during reconnect:
com.couchbase.client.deps.io.netty.channel.ConnectTimeoutException: Connect callback did not return, hit safeguarding timeout.
at com.couchbase.client.core.endpoint.AbstractEndpoint$3.call(AbstractEndpoint.java:346) ~[dev-api-v1.jar:na]
at com.couchbase.client.core.endpoint.AbstractEndpoint$3.call(AbstractEndpoint.java:339) ~[dev-api-v1.jar:na]
at rx.internal.operators.SingleOperatorOnErrorResumeNext$2.onError(SingleOperatorOnErrorResumeNext.java:69) [dev-api-v1.jar:na]
at rx.internal.operators.SingleTimeout$TimeoutSingleSubscriber$OtherSubscriber.onError(SingleTimeout.java:133) [dev-api-v1.jar:na]
at rx.Single$1.call(Single.java:460) [dev-api-v1.jar:na]
at rx.Single$1.call(Single.java:456) [dev-api-v1.jar:na]
at rx.internal.operators.SingleTimeout$TimeoutSingleSubscriber.call(SingleTimeout.java:110) [dev-api-v1.jar:na]
at rx.internal.schedulers.EventLoopsScheduler$EventLoopWorker$2.call(EventLoopsScheduler.java:189) [dev-api-v1.jar:na]
at rx.internal.schedulers.ScheduledAction.run(ScheduledAction.java:55) [dev-api-v1.jar:na]
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) [na:1.8.0_121]
at java.util.concurrent.FutureTask.run(FutureTask.java:266) [na:1.8.0_121]
at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180) [na:1.8.0_121]
at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293) [na:1.8.0_121]
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) [na:1.8.0_121]
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) [na:1.8.0_121]
at java.lang.Thread.run(Thread.java:745) [na:1.8.0_121]
2017-06-21 16:09:56.460 [RxComputationScheduler-6] WARN [null][KeyValueEndpoint]: Could not connect to endpoint, retrying with delay 4096 MILLISECONDS:
com.couchbase.client.deps.io.netty.channel.ConnectTimeoutException: Connect callback did not return, hit safeguarding timeout.
at com.couchbase.client.core.endpoint.AbstractEndpoint$3.call(AbstractEndpoint.java:346) ~[dev-api-v1.jar:na]
at com.couchbase.client.core.endpoint.AbstractEndpoint$3.call(AbstractEndpoint.java:339) ~[dev-api-v1.jar:na]
at rx.internal.operators.SingleOperatorOnErrorResumeNext$2.onError(SingleOperatorOnErrorResumeNext.java:69) [dev-api-v1.jar:na]
at rx.internal.operators.SingleTimeout$TimeoutSingleSubscriber$OtherSubscriber.onError(SingleTimeout.java:133) [dev-api-v1.jar:na]
at rx.Single$1.call(Single.java:460) [dev-api-v1.jar:na]
at rx.Single$1.call(Single.java:456) [dev-api-v1.jar:na]
at rx.internal.operators.SingleTimeout$TimeoutSingleSubscriber.call(SingleTimeout.java:110) [dev-api-v1.jar:na]
at rx.internal.schedulers.EventLoopsScheduler$EventLoopWorker$2.call(EventLoopsScheduler.java:189) [dev-api-v1.jar:na]
at rx.internal.schedulers.ScheduledAction.run(ScheduledAction.java:55) [dev-api-v1.jar:na]
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) [na:1.8.0_121]
at java.util.concurrent.FutureTask.run(FutureTask.java:266) [na:1.8.0_121]
at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180) [na:1.8.0_121]
at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293) [na:1.8.0_121]
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) [na:1.8.0_121]
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) [na:1.8.0_121]
at java.lang.Thread.run(Thread.java:745) [na:1.8.0_121]
2017-06-21 16:09:57.290 [cb-io-1-4] WARN [null][ConfigEndpoint]: Socket connect took longer than specified timeout.
2017-06-21 16:09:57.733 [cb-io-1-5] INFO [null][KeyValueEndpoint]: Got notified from Channel as inactive, attempting reconnect.
2017-06-21 16:09:57.734 [cb-io-1-6] INFO [null][KeyValueEndpoint]: Got notified from Channel as inactive, attempting reconnect.
2017-06-21 16:09:57.736 [cb-io-1-7] INFO [null][KeyValueEndpoint]: Got notified from Channel as inactive, attempting reconnect.
2017-06-21 16:09:57.738 [cb-io-1-8] INFO [null][KeyValueEndpoint]: Got notified from Channel as inactive, attempting reconnect.

因为比较紧急,所以尝试回退到2.2.6这个版本看能不能解决问题,结果回退后正常运行。因为遇到太多次这样子了,所以下血本看下为啥会这样子。

埋坑

向运维申请了一个3个节点的couchbase集群,版本4.6.2,运行了用2.4.5版本的服务后, 手动把其中一个节点的进程fail over掉,发现一切正常,恢复后杀掉进程,也是一切正常。初步估计是和couchbase server版本有关。恰巧要去客户那里出差解决问题,于是出差期间在客户的环境tcpdump了出错版本的包和正常运行版本的包。客户的couchbase server版本是4.1.1,但是客户的集群只有两个节点,挂了一个后只剩下一个节点。后来想到这一点,又把测试环境的集群杀剩一个节点,果然也出问题了

couchbase4.1抓包结果

couchbase java sdk 2.2.6 正常运行,可以看到hello request后立马返回一个hello resp,然后接下来会获取集群信息等的进一步操作

coucbase java sdk 2.4.5 初始化不了,无法正常工作,可以看到hello request后,服务端没有resp返回,所以客户端会重试的发hello到一定次数后就不再重试。所以下来的一系列初始化都不会进行了,客户端自然而然的不能工作了

仔细对比两个hello包,发现除了body 其他都一样,但是在一个节点的时候,2.4.5 hello request就是没有resp。

couchbase4.6抓包结果

couchbase java sdk 2.2.6 初始化不了,可以看到hello request后立马返回一个hello resp,然后接下来会获取集群信息等的进一步操作,但是拉取到了192.168.248.43这台机器上的couchbase已经被杀掉了,所以后续sdk去连这台机器的时候也连接不上,所以也初始化不了。

couchbase java sdk 2.4.5 初始化不了,可以看到hello request后立马返回一个hello resp,然后接下来会获取集群信息等的进一步操作,但是拉取到了192.168.248.43这台机器上的couchbase已经被杀掉了,所以后续sdk去连这台机器的时候也连接不上,所以也初始化不了。

总结

在couchbase剩下一个节点的时候,基本都会出现这些问题,应该尽量保持在三个节点及以上