现象

从监控上看,tomcat的线程busy的非常少,线程池使用率很低,但是线程池里的线程的个数却很多。

难道tomcat的线程池没有回落机制吗?

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
[arthas@22]$ mbean | grep -i thread
Catalina:type=ThreadPool,name="http-nio-22441"
java.lang:type=Threading
Catalina:type=ThreadPool,name="http-nio-22441",subType=SocketProperties
[arthas@22]$ mbean Catalina:type=ThreadPool,name=*
OBJECT_NAME Catalina:type=ThreadPool,name="http-nio-22441"
----------------------------------------------------------------------------------
NAME VALUE
----------------------------------------------------------------------------------
currentThreadsBusy 2
sslImplementationName null
paused false
selectorTimeout 1000
modelerType org.apache.tomcat.util.net.NioEndpoint
connectionCount 46
acceptCount 2000
threadPriority 5
executorTerminationTimeoutMillis 5000
running true
currentThreadCount 916
sSLEnabled false
sniParseLimit 65536
maxThreads 2000
sslImplementation null
connectionTimeout 2000
tcpNoDelay true
maxConnections 20000
connectionLinger -1
keepAliveCount 1
keepAliveTimeout 5000
maxKeepAliveRequests 2000
localPort 22441
deferAccept false
useSendfile true
acceptorThreadCount 1
pollerThreadCount 2
daemon true
minSpareThreads 25
useInheritedChannel false
alpnSupported false
acceptorThreadPriority 5
bindOnInit true
pollerThreadPriority 5
port 22441
domain Catalina
name http-nio-22441
defaultSSLHostConfigName _default_

几个关键点:

  • currentThreadsBusy 2
  • currentThreadCount 916
  • maxThreads 2000
  • minSpareThreads 25

干活的线程只有2个,但是线程池里有916个线程?why?

多次观察,仍然是这个情况。

原因

mbean数据来源

先搞清楚mbean的数据来源。

1
2
3
4
// org.apache.tomcat.util.net.AbstractEndpoint#init
// Register endpoint (as ThreadPool - historical name)
oname = new ObjectName(domain + ":type=ThreadPool,name=\"" + getName() + "\"");
Registry.getRegistry(null, null).registerComponent(this, oname, null);
  • currentThreadBusy——当前有任务的线程个数

    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    15
    // org.apache.tomcat.util.net.AbstractEndpoint#getCurrentThreadsBusy
    public int getCurrentThreadsBusy() {
    Executor executor = this.executor;
    if (executor != null) {
    if (executor instanceof ThreadPoolExecutor) {
    return ((ThreadPoolExecutor) executor).getActiveCount();
    } else if (executor instanceof ResizableExecutor) {
    return ((ResizableExecutor) executor).getActiveCount();
    } else {
    return -1;
    }
    } else {
    return -2;
    }
    }
  • currentThreadCount——线程池中,当前线程个数

    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    15
    // org.apache.tomcat.util.net.AbstractEndpoint#getCurrentThreadCount
    public int getCurrentThreadCount() {
    Executor executor = this.executor;
    if (executor != null) {
    if (executor instanceof ThreadPoolExecutor) {
    return ((ThreadPoolExecutor) executor).getPoolSize();
    } else if (executor instanceof ResizableExecutor) {
    return ((ResizableExecutor) executor).getPoolSize();
    } else {
    return -1;
    }
    } else {
    return -2;
    }
    }
  • maxThreads——最大线程数

    1
    2
    3
    4
    5
    6
    7
    8
    // org.apache.tomcat.util.net.AbstractEndpoint#getMaxThreads
    public int getMaxThreads() {
    if (internalExecutor) {
    return maxThreads;
    } else {
    return -1;
    }
    }
  • minSpareThreads——核心线程数

    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    // org.apache.tomcat.util.net.AbstractEndpoint#getMinSpareThreads
    public int getMinSpareThreads() {
    return Math.min(getMinSpareThreadsInternal(), getMaxThreads());
    }
    private int getMinSpareThreadsInternal() {
    if (internalExecutor) {
    return minSpareThreads;
    } else {
    return -1;
    }
    }

默认线程池初始化逻辑:

1
2
3
4
5
6
7
8
9
10
// org.apache.tomcat.util.net.AbstractEndpoint#createExecutor
public void createExecutor() {
// 使用内部线程池
internalExecutor = true;
TaskQueue taskqueue = new TaskQueue();
TaskThreadFactory tf = new TaskThreadFactory(getName() + "-exec-", daemon, getThreadPriority());
// 注意,这个ThreadPoolExecutor是tomcat自己魔改过的
executor = new ThreadPoolExecutor(getMinSpareThreads(), getMaxThreads(), 60, TimeUnit.SECONDS,taskqueue, tf);
taskqueue.setParent( (ThreadPoolExecutor) executor);
}

看到线程池的初始化,就会发现miniSpareThreads其实就是corePoolSize! 而且有一个写死的keepAliveTime 60s。而且任务队列是个无界的队列。

线程池的keepAliveTime

先看JDK中的注释:

@param keepAliveTime when the number of threads is greater than
the core, this is the maximum time that excess idle threads
will wait for new tasks before terminating.

简单来说,就是超过核心数的线程,如果等待keepAliveTime,还没有接到任务,就会被终止掉。

看一眼实现:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
// java.util.concurrent.ThreadPoolExecutor#runWorker
try {
// 注意,没有获取到task,这里循环也就结束了,走到线程退出的逻辑
while (task != null || (task = getTask()) != null) {
// 省略
task.run();
}
completedAbruptly = false;
} finally {
// 线程退出的一些清理工作
processWorkerExit(w, completedAbruptly);
}

// 获取task的逻辑
// java.util.concurrent.ThreadPoolExecutor#getTask
for (;;) {
// Are workers subject to culling?
boolean timed = allowCoreThreadTimeOut || wc > corePoolSize;

// 如果允许timeout,而且timeout发生了,这里直接返回null,循环结束,线程的任务就结束了(退出)
if ((wc > maximumPoolSize || (timed && timedOut))
&& (wc > 1 || workQueue.isEmpty())) {
if (compareAndDecrementWorkerCount(c))
return null;
continue;
}

try {
// 允许timeout(核心线程,或者worker count > 核心个数),则使用poll,而且timeout是keepAliveTime
// 否则,走的是阻塞版本的take
Runnable r = timed ?
workQueue.poll(keepAliveTime, TimeUnit.NANOSECONDS) :
workQueue.take();
// poll到task,或者take到,可以直接返回
if (r != null)
return r;
// 走到这里肯定是poll超时了
timedOut = true;
} catch (InterruptedException retry) {
timedOut = false;
}
}

从源码上看,这个keepAliveTime并没有什么问题。

ReentrantLock

有没有一种可能,task queue的poll是雨露均撒的?

When you have eliminated the impossible, whatever remains, however improbable, must be the truth.

tomcat使用的TaskQueue作为队列,继承自LinkedBlockingQueue。但是核心的poll逻辑,还是用的LinkedBlockingQueue:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
// org.apache.tomcat.util.threads.TaskQueue#poll
@Override
public Runnable poll(long timeout, TimeUnit unit)
throws InterruptedException {
Runnable runnable = super.poll(timeout, unit);
if (runnable == null && parent != null) {
// the poll timed out, it gives an opportunity to stop the current
// thread if needed to avoid memory leaks.
parent.stopCurrentThreadIfNeeded();
}
return runnable;
}

//java.util.concurrent.LinkedBlockingQueue#poll(long, java.util.concurrent.TimeUnit)
public E poll(long timeout, TimeUnit unit) throws InterruptedException {
E x = null;
int c = -1;
long nanos = unit.toNanos(timeout);
final AtomicInteger count = this.count;
final ReentrantLock takeLock = this.takeLock;
// 锁范围开始
takeLock.lockInterruptibly();
try {
while (count.get() == 0) {
// 超时时间为0(没有设置超时,或者超时时间到了),则没有就直接返回
if (nanos <= 0)
return null;
// 否则,放入ReentrantLock的条件队列,等待timeout时间
nanos = notEmpty.awaitNanos(nanos);
}
// 此时count > 0,取出一个
x = dequeue();
// 减少计数
c = count.getAndDecrement();
// 如果还有,则通知条件队列里等待的线程
if (c > 1)
notEmpty.signal();
} finally {
// 锁范围结束
takeLock.unlock();
}
// 因为poll走了一个,现在容量是capacity - 1,所以signalNotFull
if (c == capacity)
signalNotFull();
return x;
}

核心就在takeLock和notEmpty上,takeLock是ReentrantLock默认非公平,notEmpty是takeLock的条件队列。

1
2
3
4
5
6
7
// java.util.concurrent.LinkedBlockingQueue

/** Lock held by take, poll, etc */
private final ReentrantLock takeLock = new ReentrantLock();

/** Wait queue for waiting takes */
private final Condition notEmpty = takeLock.newCondition();

ReentrantLock默认非公平的,底层基于AQS实现。公平和非公平的区别只是在首次抢锁的行为上,首次如果没有抢到,都是排队,然后按顺序解锁。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
// java.util.concurrent.locks.ReentrantLock.Sync#nonfairTryAcquire
/**
* Performs non-fair tryLock. tryAcquire is implemented in
* subclasses, but both need nonfair try for trylock method.
*/
@ReservedStackAccess
final boolean nonfairTryAcquire(int acquires) {
final Thread current = Thread.currentThread();
int c = getState();
if (c == 0) {
// 因为是非公平,这里直接抢一次
if (compareAndSetState(0, acquires)) {
setExclusiveOwnerThread(current);
return true;
}
}
// 如果没有抢到,看看是不是自己已经获取(可重入)
else if (current == getExclusiveOwnerThread()) {
int nextc = c + acquires;
if (nextc < 0) // overflow
throw new Error("Maximum lock count exceeded");
setState(nextc);
return true;
}
// 最终抢失败,返回false
return false;
}

qps比较低的场景下,锁的竞争并不激烈,大部分线程即使抢到了锁,也拿不到任务,只能在条件队列中。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
// java.util.concurrent.locks.AbstractQueuedLongSynchronizer.ConditionObject#signal
/**
* Moves the longest-waiting thread, if one exists, from the
* wait queue for this condition to the wait queue for the
* owning lock.
*
* @throws IllegalMonitorStateException if {@link #isHeldExclusively}
* returns {@code false}
*/
public final void signal() {
if (!isHeldExclusively())
throw new IllegalMonitorStateException();
Node first = firstWaiter;
if (first != null)
doSignal(first);
}

条件队列里是按排队的顺序(longest-waiting thread)去通知的,将条件队列里的wait node转移到锁的等待队列中,重新竞争锁。

此时竞争的对象很少,基本就是busy的线程+被notify唤醒的线程,因此大概率还是能抢到任务的。

实验

问题的根源在于如果task很少,大家会在notEmpty的Condition队列中排队;task来的时候,又是按顺序解锁,如果qps和keepAliveTime合适,在keepAliveTime时间内,每个worker线程都能有机会至少活得一个task,从而不会被回收掉。

顺序排队

maxThreads设置为10,打印每次处理的线程的名称,测试代码:

1
2
3
4
5
6
7
8
9
10
@Override
protected void doGet(HttpServletRequest req, HttpServletResponse resp) throws ServletException, IOException {
LOGGER.error("thread is " + Thread.currentThread().getName());
try {
Thread.sleep(1_000);
} catch (InterruptedException e) {
e.printStackTrace();
}
resp.getWriter().write("Hello World! " + Thread.currentThread().getName());
}

串行curl 7次:

1
for i in `seq 1 10`; do curl "http://localhost:8087/web_war_exploded/hello" && echo -e '\n'; done;

输出:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
➜  conf  for i in `seq 1 10`; do curl "http://localhost:8087/web_war_exploded/hello" && echo -e '\n'; done;
Hello World! http-nio-8087-exec-8

Hello World! http-nio-8087-exec-9

Hello World! http-nio-8087-exec-1

Hello World! http-nio-8087-exec-2

Hello World! http-nio-8087-exec-3

Hello World! http-nio-8087-exec-4

Hello World! http-nio-8087-exec-5

Hello World! http-nio-8087-exec-7

Hello World! http-nio-8087-exec-9

Hello World! http-nio-8087-exec-10

确实是类似round robin的形式来的

线程回落

tomcat默认的线程池,keepAliveTime是60s,修改maxThreads为10,minSpareThreads为3。

启动之后,mbean输出:

1
2
3
4
5
6
7
8
9
10
[arthas@98537]$ mbean Catalina:type=ThreadPool,name=*
OBJECT_NAME Catalina:type=ThreadPool,name="http-nio-8087"
--------------------------------------------------------------------------------------------------------------------------------
NAME VALUE
--------------------------------------------------------------------------------------------------------------------------------
currentThreadsBusy 0
running true
currentThreadCount 3
maxThreads 10
minSpareThreads 3

跟设置一致,先来波高峰请求,创建出来10个worker(maxThreads)

1
for i in `seq 1 10`; do curl -s "http://localhost:8087/web_war_exploded/hello" & done;

此时mbean输出:

1
2
3
4
5
6
7
8
9
10
[arthas@98537]$ mbean Catalina:type=ThreadPool,name=*
OBJECT_NAME Catalina:type=ThreadPool,name="http-nio-8087"
--------------------------------------------------------------------------------------------------------------------------------
NAME VALUE
--------------------------------------------------------------------------------------------------------------------------------
currentThreadsBusy 0
running true
currentThreadCount 10
maxThreads 10
minSpareThreads 3

currentThreadCount有10个了,等1min,然后再看:

1
2
3
4
5
6
7
8
9
10
[arthas@98537]$ mbean Catalina:type=ThreadPool,name=*
OBJECT_NAME Catalina:type=ThreadPool,name="http-nio-8087"
--------------------------------------------------------------------------------------------------------------------------------
NAME VALUE
--------------------------------------------------------------------------------------------------------------------------------
currentThreadsBusy 0
running true
currentThreadCount 3
maxThreads 10
minSpareThreads 3

currentThreadCount已经回落到了3个(minSpareThreads)

线程不回落

线程不回落,只用保证每个线程1min内有一个task就行了。maxThreads是10,也就是10 qpm就行了。

先冲高

1
for i in `seq 1 10`; do curl -s "http://localhost:8087/web_war_exploded/hello" & done;

再维持10 qpm

1
for i in `seq 1 100000`; do curl -s "http://localhost:8087/web_war_exploded/hello" && echo "-n" && sleep 5; done;

代码里sleep了1s,加上curl的sleep 5s,一个请求6s,一分钟10个请求。此时再看mbean输出:

1
2
3
4
5
6
7
8
9
10
[arthas@98537]$ mbean Catalina:type=ThreadPool,name=*
OBJECT_NAME Catalina:type=ThreadPool,name="http-nio-8087"
--------------------------------------------------------------------------------------------------------------------------------
NAME VALUE
--------------------------------------------------------------------------------------------------------------------------------
currentThreadsBusy 0
running true
currentThreadCount 10
maxThreads 10
minSpareThreads 3

一直是10,跟线上的现象一样,复现了线程不回落的情形。

修改sleep的时间,降低qpm,看看是否有部分回落:

1
for i in `seq 1 100000`; do curl -s "http://localhost:8087/web_war_exploded/hello" && echo "-n" && sleep 7; done;

逐渐回落至8个线程:

1
2
[arthas@98537]$ mbean Catalina:type=ThreadPool,name=* | grep -i currentThreadCount
currentThreadCount 8

解决方案

QPS的临界值是maxThreads / keepAliveTime,考虑上请求的处理时间,实际值可能稍微大一点。大于临界值则不会发生线程的回落,小于临界值会逐渐回落。

  • 调整keepAliveTime

Tomcat使用默认的线程池,keepAliveTime是无法调整的,但是可以使用自定义的线程池,可以设置maxIdleTime(即keepAliveTime)。

1
2
3
4
5
6
7
<!--The connectors can use a shared executor, you can define one or more named thread pools-->
<Executor name="tomcatThreadPool" namePrefix="catalina-exec-"
maxThreads="10" minSpareThreads="3" maxIdleTime="10000"/>
<Connector executor="tomcatThreadPool"
port="8080" protocol="HTTP/1.1"
connectionTimeout="20000"
redirectPort="8443" />

调整为10s之后,维持10qpm,很快就回落了:

1
2
[arthas@54257]$ mbean Catalina:type=ThreadPool,name=* | grep -i currentThreadCount
currentThreadCount 3

总结

  • tomcat的线程池使用TaskQueue控制请求的分发,poll的逻辑和父类LinkedBlockingQueue一致
  • LinkedBlockingQueue内部,如果没有task时,poll的线程都会在notEmpty的ReentrantLock的Condition队列中,按序排队
  • 任务来时,signal操作是按队列里的顺序唤醒的,先入先出
  • qps > maxThreads / keepAliveTime,可以保证在keepAliveTime,每个线程都有机会获得task,从而避免被回收
  • tomcat默认的线程池,不支持设置keepAliveTime,可以使用自定义的线程池解决
  • JDK的线程池同样有这个问题,需要注意keepAliveTime的设置
  • 频繁的线程切换,会导致频繁的上下文切换,对性能应该也有影响
  • 对于线上的服务,一般会有探活机制,也是线程不回落的原因之一

参考