zeroc ice是一个rpc框架,广泛应用于我们后台服务,在uwsgi+django使用icepy(ice python是在ice c++的基础上封装了一层供python使用,代码是v3.6.3)的时候遇到了下面这个问题

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
Gracefully killing worker 20 (pid: 36213)...
!! 02/14/17 09:55:41.823 error: 2 communicators not destroyed during global destruction.!!! uWSGI process 36213 got Segmentation Fault !!!
*** backtrace of 36213 ***
uwsgi(uwsgi_backtrace+0x2e) [0x469e0e]
uwsgi(uwsgi_segfault+0x21) [0x46a1a1]
/usr/lib64/libc.so.6(+0x35250) [0x7f9c52192250]
/usr/lib64/python2.7/site-packages/IcePy.so(_ZN3Ice6upCastEPNS_6LoggerE+0x8) [0x7f9c45430458]
/usr/lib64/python2.7/site-packages/IcePy.so(_ZN3Ice16setProcessLoggerERKN11IceInternal6HandleINS_6LoggerEEE+0xc2) [0x7f9c453bd9f2]
/usr/lib64/python2.7/site-packages/IcePy.so(_ZN5IcePy13cleanupLoggerEv+0x29) [0x7f9c452ed669]
/usr/lib64/python2.7/site-packages/IcePy.so(IcePy_cleanup+0x9) [0x7f9c452ec039]
/usr/lib64/libpython2.7.so.1.0(PyEval_EvalFrameEx+0x730a) [0x7f9c5283a00a]
/usr/lib64/libpython2.7.so.1.0(PyEval_EvalCodeEx+0x7ed) [0x7f9c5283be3d]
/usr/lib64/libpython2.7.so.1.0(+0x70798) [0x7f9c527c5798]
/usr/lib64/libpython2.7.so.1.0(PyObject_Call+0x43) [0x7f9c527a08e3]
/usr/lib64/libpython2.7.so.1.0(PyEval_CallObjectWithKeywords+0x47) [0x7f9c528326f7]
/usr/lib64/libpython2.7.so.1.0(Py_Finalize+0xb7) [0x7f9c52856f87]
uwsgi(uwsgi_plugins_atexit+0x71) [0x467161]
/usr/lib64/libc.so.6(+0x38a49) [0x7f9c52195a49]
/usr/lib64/libc.so.6(+0x38a95) [0x7f9c52195a95]
uwsgi() [0x421b2f]

这个是在uwsgi reload的时候引起的,先说下error: communicator not destroyed during global destruction。这个是细究起来也是有点长的,不过这文章的重点不是这个,所以简单说明下,这个是在程序调用exit()结束的时候,执行~Init()函数打印的日志

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
~Init()
{
{
IceUtilInternal::MutexPtrLock<IceUtil::Mutex> sync(staticMutex);
int notDestroyedCount = 0;
for(std::list<IceInternal::Instance*>::const_iterator p = instanceList->begin();
p != instanceList->end(); ++p)
{
if(!(*p)->destroyed())
{
notDestroyedCount++;
}
}
if(notDestroyedCount > 0)
{
cerr << "!! " << IceUtil::Time::now().toDateTime() << " error: ";
if(notDestroyedCount == 1)
{
cerr << "communicator ";
}
else
{
cerr << notDestroyedCount << " communicators ";
}
cerr << "not destroyed during global destruction.";
}
delete instanceList;
instanceList = 0;
}
delete staticMutex;
staticMutex = 0;
}

如果调用这函数但是communicator没有被destroy就会出现这个log。

现在开始讲重点了,看懂上面那个错误堆栈需要使用c++filt这个工具

1
2
3
4
5
6
$ c++filt _ZN3Ice6upCastEPNS_6LoggerE
Ice::upCast(Ice::Logger*)
$ c++filt _ZN3Ice16setProcessLoggerERKN11IceInternal6HandleINS_6LoggerEEE
Ice::setProcessLogger(IceInternal::Handle<Ice::Logger> const&)
$ c++filt _ZN5IcePy13cleanupLoggerEv
IcePy::cleanupLogger()

这里就把函数签名转换成代码里面具体的方法名了,以下是这个堆栈的代码

1
2
3
4
5
6
7
8
9
10
11
12
13
14
typedef ::IceInternal::Handle< ::Ice::Logger> LoggerPtr;
Ice::LoggerPtr processLogger;
void IcePy::cleanupLogger()
{
Ice::setProcessLogger(0);
}
void Ice::setProcessLogger(const LoggerPtr& logger)
{
IceUtilInternal::MutexPtrLock<IceUtil::Mutex> lock(globalMutex);
processLogger = logger;
}
ICE_API ::Ice::LocalObject* Ice::upCast(::Ice::Logger* p){ return p; }

Handle是ice封装的一个智能指针类,重载了operator= 以下是相关代码

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
~Handle()
{
if(this->_ptr)
{
upCast(this->_ptr)->__decRef();
}
}
Handle& operator=(T* p)
{
if(this->_ptr != p)
{
if(p)
{
upCast(p)->__incRef();
}
T* ptr = this->_ptr;
this->_ptr = p;
if(ptr)
{
upCast(ptr)->__decRef();
}
}
return *this;
}
void
IceUtil::Shared::__decRef()
{
assert(_ref > 0);
if(--_ref == 0 && !(_flags & NoDelete))
{
delete this;
}
}

看代码的话,我们调用了setProcessLogger(0),所以operator=函数里面调用了upcast()的情况只可能是ptr不为0的时候,但是光是空想并不知道为啥出了问题,于是用了gdb debug,gdb脾气不是特别好,我前几次使用他都无缘无故卡住了,或者直接没进到断点。所以搁置一天,然而还是不死心,于是继续尝试用gdb debug,估计前几次是我运气太背,后来几次debug都挺顺畅的,小概率出现进程卡死的情况。

所以在Ice::setProcessLogger这函数打了断点 gdb debug几次之后均显示ptr这个地址不为0,苦思无果,于是断断续续利用零散的时间debug,直到把断点打到Ice::upCast(::Ice::Logger* p)这个函数,发现upcast已经在~Handle()执行过一次了,于是突然想到会不会是这个析构后把ptr这东西给释放了之类的,导致这个地址的引用失败了,于是跳进~Handle()的__decRef();

gdb

执行的57行也就是上面__decRef函数的 delete this;这句话了,所以估计就是因为被析构后引用了个无效的地址了。

关于解决方式

其实reload后,进程都被杀掉了,不管也是可以的,如果不想看到这个错误堆栈的话 注释掉Ice.py下面这两行代码就可以了

1
2
import atexit
atexit.register(IcePy.cleanup)

关于processLogger的生命周期与atexit

uwsgi注册了若干atexit函数

1
2
3
4
5
6
7
8
9
10
// fallback config
atexit(uwsgi_fallback_config);
// manage/flush logs
atexit(uwsgi_flush_logs);
// clear sockets, pidfiles...
atexit(vacuum);
// call user scripts
atexit(uwsgi_exec_atexit);
// call plugin specific exit hooks
atexit(uwsgi_plugins_atexit);

调用cleanup的代码正是被uwsgi_plugins_atexit这个函数调用的,看代码的话就有个疑问了,processLogger是个全局变量,按道理应该是在atexit执行之后才被析构,但是实际上是析构在前,于是推测是不是因为uwsgi 调用python的关系,所以在启动uwsgi并且成功请求一次之前和之后各打印processLogger(如果不请求的话,uwsgi是不会加载相关python代码的)

1
2
3
4
5
6
7
(gdb) p 'Initialize.cpp'::processLogger
No symbol "Initialize.cpp" in current context.
(gdb) p 'Initialize.cpp'::processLogger
$5 = {<IceUtil::HandleBase<Ice::Logger>> = {
_ptr = 0x29e8f60}, <No data fields>}

事实说明processLogger要在加载python相关代码后才初始化,于是在Handle.h的构造函数打下断点,在uwsgi 启动后第一次请求触发断点,得到以下堆栈

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
Handle (p=0x0, this=0x7f2a2ff7bc00 <(anonymous namespace)::processLogger>)
at src/ice/cpp/include/Ice/Handle.h:42
42 src/ice/cpp/include/Ice/Handle.h: 没有那个文件或目录.
(gdb) bt
#0 Handle (p=0x0, this=0x7f2a2ff7bc00 <(anonymous namespace)::processLogger>)
at src/ice/cpp/include/Ice/Handle.h:42
#1 __static_initialization_and_destruction_0 (__initialize_p=1,
__priority=65535) at src/ice/cpp/src/Ice/Initialize.cpp:45
#2 _GLOBAL__sub_I_Initialize.cpp(void) ()
at src/ice/cpp/src/Ice/Initialize.cpp:538
#3 0x00007f2a3e22e1e3 in _dl_init_internal () from /lib64/ld-linux-x86-64.so.2
#4 0x00007f2a3e2328f6 in dl_open_worker () from /lib64/ld-linux-x86-64.so.2
#5 0x00007f2a3e22dff4 in _dl_catch_error () from /lib64/ld-linux-x86-64.so.2
#6 0x00007f2a3e231feb in _dl_open () from /lib64/ld-linux-x86-64.so.2
#7 0x00007f2a3dafdfbb in dlopen_doit () from /usr/lib64/libdl.so.2
#8 0x00007f2a3e22dff4 in _dl_catch_error () from /lib64/ld-linux-x86-64.so.2
#9 0x00007f2a3dafe5bd in _dlerror_run () from /usr/lib64/libdl.so.2
#10 0x00007f2a3dafe051 in dlopen@@GLIBC_2.2.5 () from /usr/lib64/libdl.so.2
#11 0x00007f2a3cb7122f in _PyImport_GetDynLoadFunc ()
from /usr/lib64/libpython2.7.so.1.0
#12 0x00007f2a3cb59b3e in _PyImport_LoadDynamicModule ()
from /usr/lib64/libpython2.7.so.1.0
#13 0x00007f2a3cb57c41 in import_submodule ()
from /usr/lib64/libpython2.7.so.1.0
#14 0x00007f2a3cb57e8d in load_next () from /usr/lib64/libpython2.7.so.1.0
#15 0x00007f2a3cb5886e in PyImport_ImportModuleLevel ()
from /usr/lib64/libpython2.7.so.1.0
---Type <return> to continue, or q <return> to quit---
#16 0x00007f2a3cb3bb1f in builtin___import__ ()
from /usr/lib64/libpython2.7.so.1.0
#17 0x00007f2a3caab8e3 in PyObject_Call () from /usr/lib64/libpython2.7.so.1.0
#18 0x00007f2a3cb3d6f7 in PyEval_CallObjectWithKeywords ()
from /usr/lib64/libpython2.7.so.1.0
#19 0x00007f2a3cb423b5 in PyEval_EvalFrameEx ()
from /usr/lib64/libpython2.7.so.1.0
#20 0x00007f2a3cb46e3d in PyEval_EvalCodeEx ()
from /usr/lib64/libpython2.7.so.1.0
#21 0x00007f2a3cb46f42 in PyEval_EvalCode ()
from /usr/lib64/libpython2.7.so.1.0
#22 0x00007f2a3cb56d2c in PyImport_ExecCodeModuleEx ()
from /usr/lib64/libpython2.7.so.1.0
#23 0x00007f2a3cb56fa8 in load_source_module ()
from /usr/lib64/libpython2.7.so.1.0
#24 0x00007f2a3cb57c41 in import_submodule ()
from /usr/lib64/libpython2.7.so.1.0
#25 0x00007f2a3cb57f26 in load_next () from /usr/lib64/libpython2.7.so.1.0
#26 0x00007f2a3cb5886e in PyImport_ImportModuleLevel ()
from /usr/lib64/libpython2.7.so.1.0
#27 0x00007f2a3cb3bb1f in builtin___import__ ()
from /usr/lib64/libpython2.7.so.1.0
#28 0x00007f2a3caab8e3 in PyObject_Call () from /usr/lib64/libpython2.7.so.1.0
---Type <return> to continue, or q <return> to quit---
#29 0x00007f2a3cb3d6f7 in PyEval_CallObjectWithKeywords ()
from /usr/lib64/libpython2.7.so.1.0
#30 0x00007f2a3cb423b5 in PyEval_EvalFrameEx ()
from /usr/lib64/libpython2.7.so.1.0
#31 0x00007f2a3cb46e3d in PyEval_EvalCodeEx ()
from /usr/lib64/libpython2.7.so.1.0
#32 0x00007f2a3cb46f42 in PyEval_EvalCode ()
from /usr/lib64/libpython2.7.so.1.0
#33 0x00007f2a3cb56d2c in PyImport_ExecCodeModuleEx ()
from /usr/lib64/libpython2.7.so.1.0
#34 0x00007f2a3cb56fa8 in load_source_module ()
from /usr/lib64/libpython2.7.so.1.0
#35 0x00007f2a3cb57c41 in import_submodule ()

所以processLogger的生命周期从加载uwsgi python相关动态库开始,在atexit之前结束,所以估计是和uwsgi python加载的动态库的生命周期是相关的
但是debug的时候没发现调用了dlclose,所以为啥会在atexit之前析构暂时没弄懂