Web服务器的环境(用了lnmp一键安装包):

Linux 2.6.32-220.13.1.el6.x86_64
nginx/1.2.7
PHP 5.3.9
mysql Ver 14.14 Distrib 5.1.60

论坛程序为 Discuz! X3.2

因为论坛的服务器是中途接手管理的,所以之前遇到 Discuz! 的 502的问题时直接 /root/lnmp restart 就好了。
虽然知道可能是 CGI 的问题,也没有分析日志好好研究下。

但是今天又遇到502的情况,就登上服务器看了看日志:

$ tail nginx_error.log 
2015/03/07 18:15:24 [crit] 28036#0: *57 connect() to unix:/tmp/php-cgi.sock failed (2: No such file or directory) while connecting to upstream, client: 115.25.81.74, server: hnubbs.com, request: "GET /thread-898971-10-1.html HTTP/1.1", upstream: "fastcgi://unix:/tmp/php-cgi.sock:", host: "www.hnubbs.com"
2015/03/07 18:15:24 [crit] 28036#0: *59 connect() to unix:/tmp/php-cgi.sock failed (2: No such file or directory) while connecting to upstream, client: 115.25.81.70, server: hnubbs.com, request: "GET /thread-939979-1-1.html HTTP/1.1", upstream: "fastcgi://unix:/tmp/php-cgi.sock:", host: "www.hnubbs.com"

$ tail php-fpm.log 
[07-Mar-2015 18:09:59] WARNING: [pool www] child 27532, script '/home/wwwroot/hnubbs.com/member.php' (request: "POST /member.php") execution timed out (119.985649 sec), terminating
[07-Mar-2015 18:09:59] WARNING: [pool www] child 27531, script '/home/wwwroot/hnubbs.com/member.php' (request: "POST /member.php") execution timed out (119.988660 sec), terminating
[07-Mar-2015 18:09:59] WARNING: [pool www] child 27530, script '/home/wwwroot/hnubbs.com/member.php' (request: "POST /member.php") execution timed out (119.991652 sec), terminating
[07-Mar-2015 18:09:59] WARNING: [pool www] child 27528, script '/home/wwwroot/hnubbs.com/member.php' (request: "POST /member.php") execution timed out (119.452557 sec), terminating

查了查,看起来像是 php-fpm 的问题导致的,主要是由于进程数不够或者是执行时间过长,于是在 /usr/local/php/etc/php-fpm.conf 里加上两行即可:

request_terminate_timeout = 250s  #250秒还没执行完的脚本自动退出
pm.max_children = 128  #最大子进程数增加到128

更新

过了一段时间论坛又开始出现 Table './bbs/forum_threadaddviews' is marked as crashed and should be repaired ,刚开始还以为是 mysql 出了问题,看了下日志:

150730  0:03:17 [ERROR] Got error 134 when reading table './bbs/forum_forumrecommend'
150730  0:05:29 [ERROR] Got error 134 when reading table './bbs/forum_forumrecommend'

关于 134 错误官方是这样解释的:

Even though the MyISAM table format is very reliable (all changes to a
table made by an SQL statement are written before the statement
returns), you can still get corrupted tables if any of the following
events occur:

The mysqld process is killed in the middle of a write. An unexpected
computer shutdown occurs (for example, the computer is turned off).
Hardware failures. You are using an external program (such as
myisamchk ) to modify a table that is being modified by the server at
the same time. A software bug in the MySQL or MyISAM code. Typical
symptoms of a corrupt table are: You get the following error while
selecting data from the table:

Incorrect key file for table: '...'. Try to repair it Queries don't
find rows in the table or return incomplete results.

来自:http://dev.mysql.com/doc/refman/5.0/en/corrupted-myisam-tables.html

但是几乎每天用 myisamchk -c -r 和 repair 来修复损坏的表也依然出错。然后在 message 里发现是内存不够 kill 了 mysqld :

messages-20150705:Jun 29 04:21:21 localhost kernel: Out of memory: Kill process 14957 (mysqld) score 118 or sacrifice child
messages-20150705:Jun 29 04:29:02 localhost kernel: Out of memory: Kill process 4169 (mysqld) score 35 or sacrifice child
messages-20150705:Jun 29 21:17:52 localhost kernel: Out of memory: Kill process 5172 (mysqld) score 78 or sacrifice child
messages-20150705:Jun 29 21:21:36 localhost kernel: Out of memory: Kill process 14353 (mysqld) score 32 or sacrifice child

这时谭总提醒我可能是 php-fpm 的问题,ps 一看每个 php-fpm 的进程占用内存有 20m+,联想到之前设置的 pm.max_children = 128,问题估计就出在这里(服务器内存只有 2G)。把 pm.max_children 调小到 50 后果然再没有出现 mysql 出错的情况。

参考资料:

LNMP报502错误
LNMP一键安装包的Nginx 502 Bad Gateway错误可能原因及解决方法
MySQL Got error 134 错误排查