問題描述
支付清結(jié)算系統(tǒng)"外部對(duì)賬"部分任務(wù)在早上6:00~7:00時(shí)間段內(nèi)對(duì)賬速度緩慢,且經(jīng)常造成對(duì)賬統(tǒng)計(jì)結(jié)果不準(zhǔn)確問題,導(dǎo)致結(jié)轉(zhuǎn)記賬不準(zhǔn)確。
初步分析及處理過程
由于外部對(duì)賬是經(jīng)過性能優(yōu)化的,且在壓測(cè)環(huán)境表現(xiàn)良好,所以最開始初步認(rèn)為是由于數(shù)據(jù)庫性能瓶頸導(dǎo)致,初步處理方式為將線上6:00~7:00之前的慢SQL進(jìn)行了一番排查,并對(duì)涉及外部系統(tǒng)的慢SQL進(jìn)行了優(yōu)化,這個(gè)步驟理論上是可以解決此類問題,但RDS(阿里云Mysql可能存在慢SQL統(tǒng)計(jì)不準(zhǔn)確問題)。
特點(diǎn)時(shí)間段內(nèi)線程棧分析
特定時(shí)間段線程棧分析,由于SQL優(yōu)化部分并沒有解決此類問題,為了查看在早上6:00~7:00這個(gè)時(shí)間段內(nèi)程序JVM內(nèi)部到底發(fā)生了什么,所以采取了定時(shí)抓取線程棧日志的手段(具體方式為在Linux服務(wù)器添加cron任務(wù),并累計(jì)輸出日志的方式,腳本如??)。
clearjstack.sh
#!/bin/sh
DATE='/bin/date'
timestamp() {
$DATE +'%Y-%m-%d %H:%M:%S'
}
ts=`timestamp`
echo "start log $ts" >> /tmp/clearjstack.log
/usr/java/jdk1.7.0_67/bin/jstack -l `/usr/java/jdk1.7.0_67/bin/jps -lvm | grep tomcat-clear | awk '{print $1}'` >> /tmp/clearjstack.log
ts=`timestamp`
echo "end log $ts" >> /tmp/clearjstack.log
Linux下添加定時(shí)任務(wù)的方式:
//添加計(jì)劃任務(wù)
[root@123.57.28.241/10.172.233.8 tmp]cd /var/spool/cron/
[root@123.57.28.241/10.172.233.8 cron]pwd
/var/spool/cron/具體用戶名(如為root用戶)
[root@123.57.28.241/10.172.233.8 cron]vim root
*/5 * * * * /bin/sh /tmp/clearjstack.sh > /dev/null 2>&1
//查看計(jì)劃任務(wù)
[root@123.57.28.241/10.172.233.8 cron]crontab -l
0 0 * * * /usr/local/script/Cutlog.sh >/dev/null 2>&1
*/5 * * * * /bin/sh /tmp/clearjstack.sh > /dev/null 2>&1
對(duì)線程棧日志分析(1)
#clearjstack.log
start log 2016-11-24 06:20:01
2016-11-24 06:20:02
Full thread dump Java HotSpot(TM) 64-Bit Server VM (24.65-b04 mixed mode):
"Thread-70" prio=10 tid=0x00007f21f4022800 nid=0x12da runnable [0x00007f216cace000]
java.lang.Thread.State: RUNNABLE
at java.net.SocketInputStream.socketRead0(Native Method)
at java.net.SocketInputStream.read(SocketInputStream.java:152)
at java.net.SocketInputStream.read(SocketInputStream.java:122)
at com.mysql.jdbc.util.ReadAheadInputStream.fill(ReadAheadInputStream.java:114)
at com.mysql.jdbc.util.ReadAheadInputStream.readFromUnderlyingStreamIfNecessary(ReadAheadInputStream.java:161)
at com.mysql.jdbc.util.ReadAheadInputStream.read(ReadAheadInputStream.java:189)
- locked <0x0000000786facbc8> (a com.mysql.jdbc.util.ReadAheadInputStream)
at com.mysql.jdbc.MysqlIO.readFully(MysqlIO.java:3014)
at com.mysql.jdbc.MysqlIO.reuseAndReadPacket(MysqlIO.java:3467)
at com.mysql.jdbc.MysqlIO.reuseAndReadPacket(MysqlIO.java:3456)
at com.mysql.jdbc.MysqlIO.checkErrorPacket(MysqlIO.java:3997)
at com.mysql.jdbc.MysqlIO.sendCommand(MysqlIO.java:2468)
at com.mysql.jdbc.MysqlIO.sqlQueryDirect(MysqlIO.java:2629)
at com.mysql.jdbc.ConnectionImpl.execSQL(ConnectionImpl.java:2719)
- locked <0x0000000786f9a650> (a com.mysql.jdbc.JDBC4Connection)
at com.mysql.jdbc.PreparedStatement.executeInternal(PreparedStatement.java:2155)
- locked <0x0000000786f9a650> (a com.mysql.jdbc.JDBC4Connection)
at com.mysql.jdbc.PreparedStatement.execute(PreparedStatement.java:1379)
- locked <0x0000000786f9a650> (a com.mysql.jdbc.JDBC4Connection)
at com.alibaba.druid.pool.DruidPooledPreparedStatement.execute(DruidPooledPreparedStatement.java:493)
at sun.reflect.GeneratedMethodAccessor169.invoke(Unknown Source)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at org.apache.ibatis.logging.jdbc.PreparedStatementLogger.invoke(PreparedStatementLogger.java:55)
at com.sun.proxy.$Proxy84.execute(Unknown Source)
at org.apache.ibatis.executor.statement.PreparedStatementHandler.query(PreparedStatementHandler.java:56)
at org.apache.ibatis.executor.statement.RoutingStatementHandler.query(RoutingStatementHandler.java:70)
at org.apache.ibatis.executor.ReuseExecutor.doQuery(ReuseExecutor.java:54)
at org.apache.ibatis.executor.BaseExecutor.queryFromDatabase(BaseExecutor.java:259)
at org.apache.ibatis.executor.BaseExecutor.query(BaseExecutor.java:132)
at org.apache.ibatis.executor.BaseExecutor.query(BaseExecutor.java:115)
at sun.reflect.GeneratedMethodAccessor165.invoke(Unknown Source)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at org.apache.ibatis.plugin.Invocation.proceed(Invocation.java:46)
at com.github.miemiedev.mybatis.paginator.OffsetLimitInterceptor.intercept(OffsetLimitInterceptor.java:66)
at org.apache.ibatis.plugin.Plugin.invoke(Plugin.java:57)
at com.sun.proxy.$Proxy82.query(Unknown Source)
at org.apache.ibatis.session.defaults.DefaultSqlSession.selectList(DefaultSqlSession.java:104)
at org.apache.ibatis.session.defaults.DefaultSqlSession.selectList(DefaultSqlSession.java:98)
at org.apache.ibatis.session.defaults.DefaultSqlSession.selectOne(DefaultSqlSession.java:62)
at sun.reflect.GeneratedMethodAccessor203.invoke(Unknown Source)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at org.mybatis.spring.SqlSessionTemplate$SqlSessionInterceptor.invoke(SqlSessionTemplate.java:358)
at com.sun.proxy.$Proxy29.selectOne(Unknown Source)
at org.mybatis.spring.SqlSessionTemplate.selectOne(SqlSessionTemplate.java:163)
at org.apache.ibatis.binding.MapperMethod.execute(MapperMethod.java:63)
at org.apache.ibatis.binding.MapperProxy.invoke(MapperProxy.java:43)
at com.sun.proxy.$Proxy87.countOrgSettleFlow(Unknown Source)
at com.ninefbank.smallpay.clear.service.impl.OrgSettleFlowServiceImpl.isExistOrgSettleFlow(OrgSettleFlowServiceImpl.java:45)
at com.ninefbank.smallpay.clear.service.impl.OrgSettleFlowServiceImpl$$FastClassBySpringCGLIB$$c921c9ca.invoke(<generated>)
at org.springframework.cglib.proxy.MethodProxy.invoke(MethodProxy.java:204)
at org.springframework.aop.framework.CglibAopProxy$CglibMethodInvocation.invokeJoinpoint(CglibAopProxy.java:700)
at org.springframework.aop.framework.ReflectiveMethodInvocation.proceed(ReflectiveMethodInvocation.java:150)
at org.springframework.transaction.interceptor.TransactionInterceptor$1.proceedWithInvocation(TransactionInterceptor.java:96)
at org.springframework.transaction.interceptor.TransactionAspectSupport.invokeWithinTransaction(TransactionAspectSupport.java:260)
at org.springframework.transaction.interceptor.TransactionInterceptor.invoke(TransactionInterceptor.java:94)
at org.springframework.aop.framework.ReflectiveMethodInvocation.proceed(ReflectiveMethodInvocation.java:172)
at org.springframework.aop.framework.CglibAopProxy$DynamicAdvisedInterceptor.intercept(CglibAopProxy.java:633)
at com.ninefbank.smallpay.clear.service.impl.OrgSettleFlowServiceImpl$$EnhancerBySpringCGLIB$$6b2e2fc1.isExistOrgSettleFlow(<generated>)
at com.ninefbank.smallpay.clear.util.FeeCountUtil.createOrgSettleFlow(FeeCountUtil.java:235)
- locked <0x000000078c795ed8> (a java.lang.Class for com.ninefbank.smallpay.clear.util.FeeCountUtil)
at com.ninefbank.smallpay.clear.writer.OuterReconWriter.saveTaskErrorInfo(OuterReconWriter.java:126)
- locked <0x0000000788061c30> (a com.ninefbank.smallpay.clear.writer.OuterReconWriter)
at com.ninefbank.smallpay.clear.tasklet.task.OuterReconWriterParallTask.call(OuterReconWriterParallTask.java:70)
at com.ninefbank.smallpay.clear.tasklet.task.OuterReconWriterParallTask.call(OuterReconWriterParallTask.java:35)
at java.util.concurrent.FutureTask.run(FutureTask.java:262)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
Locked ownable synchronizers:
- <0x000000078deebbf0> (a java.util.concurrent.ThreadPoolExecutor$Worker)
"Thread-69" prio=10 tid=0x00007f21f401c000 nid=0x12d9 waiting for monitor entry [0x00007f216ccd1000]
java.lang.Thread.State: BLOCKED (on object monitor)
at com.ninefbank.smallpay.clear.writer.OuterReconWriter.saveReconData(OuterReconWriter.java:169)
- waiting to lock <0x0000000788061c30> (a com.ninefbank.smallpay.clear.writer.OuterReconWriter)
at com.ninefbank.smallpay.clear.tasklet.task.OuterReconWriterParallTask.call(OuterReconWriterParallTask.java:68)
at com.ninefbank.smallpay.clear.tasklet.task.OuterReconWriterParallTask.call(OuterReconWriterParallTask.java:35)
at java.util.concurrent.FutureTask.run(FutureTask.java:262)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
Locked ownable synchronizers:
- <0x000000078defb298> (a java.util.concurrent.ThreadPoolExecutor$Worker)
"Thread-68" prio=10 tid=0x00007f21f401b000 nid=0x12d8 waiting for monitor entry [0x00007f216c7cc000]
java.lang.Thread.State: BLOCKED (on object monitor)
at com.ninefbank.smallpay.clear.writer.OuterReconWriter.saveReconData(OuterReconWriter.java:169)
- waiting to lock <0x0000000788061c30> (a com.ninefbank.smallpay.clear.writer.OuterReconWriter)
at com.ninefbank.smallpay.clear.tasklet.task.OuterReconWriterParallTask.call(OuterReconWriterParallTask.java:68)
at com.ninefbank.smallpay.clear.tasklet.task.OuterReconWriterParallTask.call(OuterReconWriterParallTask.java:35)
at java.util.concurrent.FutureTask.run(FutureTask.java:262)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
Locked ownable synchronizers:
- <0x000000078deec880> (a java.util.concurrent.ThreadPoolExecutor$Worker)
從上述日志分析線程Thread-68、Thread-69都為阻塞狀態(tài),且阻塞狀態(tài)
smallpay.clear.writer.OuterReconWriter.saveReconData(OuterReconWriter.java:169)
所對(duì)應(yīng)的代碼主要體現(xiàn)在對(duì)賬結(jié)果寫入部分,另外從日志
- locked <0x0000000786f9a650> (a com.mysql.jdbc.JDBC4Connection)
所以此時(shí)很容易覺得問題出現(xiàn)在連接池與MySQL數(shù)據(jù)庫本身的問題上。按照這個(gè)思路(JVM日志很豐富,去偽存真,往往需要嘗試與不斷試錯(cuò)),開始了對(duì)連接池本身及MySQL本身問題的排查(Google),以下為幾種Google結(jié)果鏈接及原因概述。
1)、網(wǎng)絡(luò)原因&MySQL設(shè)置&換連接池
在以上鏈接所述的原因中,重點(diǎn)是說了MySQL及網(wǎng)絡(luò)方面的原因。針對(duì)這種敘述,與運(yùn)維大哥溝通后調(diào)取了阿里云RDS MySQL實(shí)例的參數(shù)設(shè)置,如下:
show global variables like '%timeout%'
最后一個(gè)參數(shù)wait_timeout表明Mysql服務(wù)器與客戶端連接空閑自動(dòng)斷開的超時(shí)時(shí)間為24小時(shí),也就是說只有連接在與MySQL服務(wù)器連接后24小時(shí)內(nèi)無任何操作,MySQL服務(wù)器才會(huì)主動(dòng)斷開與客戶端(這里為應(yīng)用連接池中與MySQL數(shù)據(jù)庫建立的連接),然而對(duì)于應(yīng)用來說,這種可能性基本不存在,況且大多數(shù)連接池配置都有對(duì)連接是否可用以及自動(dòng)探測(cè)連接的配置。
查看應(yīng)用連接池配置:
<property name="timeBetweenEvictionRunsMillis" value="3000" />
<property name="minEvictableIdleTimeMillis" value="300000" />
<property name="validationQuery" value="SELECT 'x'" />
<property name="testWhileIdle" value="true" />
<property name="testOnBorrow" value="true" />
<property name="testOnReturn" value="true" />
<property name="poolPreparedStatements" value="false" />
<property name="maxPoolPreparedStatementPerConnectionSize" value="-1" />
按照上述配置基本上不會(huì)出現(xiàn)連接長(zhǎng)時(shí)間與MySQL服務(wù)器無任何交互的可能性,除非連接池本身的處理存在問題(druid連接池為阿里高性能數(shù)據(jù)庫連接池),但是出于懷疑的態(tài)度,按照鏈接中的說法druid在處理長(zhǎng)時(shí)間連接等待方面可能不如c3p0好,也不排除druid本身存在問題。但出于謹(jǐn)慎角度,并沒有立刻采用對(duì)連接池進(jìn)行替換的措施,而是將焦點(diǎn)轉(zhuǎn)移到應(yīng)用服務(wù)器與MySQL服務(wù)器網(wǎng)絡(luò)連接上(也確實(shí)有人是因?yàn)榫W(wǎng)絡(luò)防火墻設(shè)置導(dǎo)致出現(xiàn)此類問題的)。
2)、網(wǎng)絡(luò)連接參數(shù)調(diào)整
由于阿里云底層網(wǎng)絡(luò)結(jié)構(gòu)并不對(duì)用戶透明,且遇到問題的應(yīng)用存在批量大數(shù)據(jù)插入的使用場(chǎng)景,確實(shí)是存在TCP緩沖區(qū)不夠的可能性,通過對(duì)MySQL官方文檔相關(guān)章節(jié)的閱讀,MySQL連接串參數(shù)中確實(shí)提供了tcpRcvBuf這樣的參數(shù),其對(duì)該參數(shù)的說明是:“connecting using TCP/IP, should the driver set SO_RCV_BUF to the given value? The default value of '0', means use the platform default value for this property)”。
于是覺得這種可能性會(huì)相對(duì)大些,所以在線上調(diào)整了連接參數(shù)而并沒有更換連接池(如若不行,可再嘗試替換連接池),參數(shù)調(diào)整如下:
jdbc:mysql://xxx.xxx.xx.xx:3306/xx?useUnicode=true&characterEncoding=UTF-8&tcpRcvBuf=1024000&autoReconnect=true&failOverReadOnly=false&connectTimeout=0
連接TCP緩沖區(qū)的大小被調(diào)整為1M。
對(duì)線程棧日志分析(2)
在4中通過對(duì)網(wǎng)絡(luò)連接參數(shù)的調(diào)整后,并沒有起到作用,于是對(duì)線程棧日志再次進(jìn)行了調(diào)取,并進(jìn)行了地毯式閱讀,終于發(fā)現(xiàn)了問題的癥結(jié):
at com.ninefbank.smallpay.clear.service.impl.OrgSettleFlowServiceImpl$$EnhancerBySpringCGLIB$$5775a83.isExistOrgSettleFlow(<generated>)
at com.ninefbank.smallpay.clear.util.FeeCountUtil.createOrgSettleFlow(FeeCountUtil.java:235)
- locked <0x000000078b4d8e98> (a java.lang.Class for com.ninefbank.smallpay.clear.util.FeeCountUtil)
at com.ninefbank.smallpay.clear.writer.OuterReconWriter.saveTaskErrorInfo(OuterReconWriter.java:126)
- locked <0x00000007880399f0> (a com.ninefbank.smallpay.clear.writer.OuterReconWriter)
at com.ninefbank.smallpay.clear.tasklet.task.OuterReconWriterParallTask.call(OuterReconWriterParallTask.java:70)
at com.ninefbank.smallpay.clear.tasklet.task.OuterReconWriterParallTask.call(OuterReconWriterPara
通過對(duì)比多日問題時(shí)段的線程棧日志,發(fā)現(xiàn)都存在兩處代碼都指向同一個(gè)方法,即isExistOrgSettleFlow,于是找到代碼所對(duì)應(yīng)的方法,發(fā)現(xiàn)是由于該方法高頻訪問了一個(gè)未加相關(guān)索引的大數(shù)據(jù)表,導(dǎo)致方法本身的執(zhí)行存在阻塞及鎖住的情況,占用了大量連接資源,從而導(dǎo)致4中所看到的其他方法拿不到數(shù)據(jù)庫連接出現(xiàn)長(zhǎng)時(shí)間等待,甚至超時(shí)的情況。加上索引后,隔日觀察問題消失。
總結(jié)
對(duì)陌生問題的分析,需要通過各類分析手段從而找到問題所在并進(jìn)行解決(這其中需要運(yùn)用各類分析工具Linux命令、JDK命令、網(wǎng)絡(luò)命令等等);
要心存思辨之心,對(duì)傳統(tǒng)的權(quán)威的東西在敬畏的同時(shí)也要大膽的懷疑并進(jìn)行分析、嘗試,切不可畏懼不前;
在編碼方面,對(duì)代碼類、方法的命名一定要合理、規(guī)范,不然日后出現(xiàn)疑難問題在排查時(shí),會(huì)給人造成錯(cuò)覺從而延長(zhǎng)解決時(shí)間;