5497 lock contention on arcs_mtx

Review Request #151 - Created Dec. 30, 2014 and submitted

Information
Matthew Ahrens
illumos-gate
5497
b8e9ad5...
Reviewers
csiden, prakashsurya
5497 lock contention on arcs_mtx
Reviewed by: George Wilson <george.wilson@delphix.com\>
Reviewed by: Matthew Ahrens <mahrens@delphix.com\>

Original author: Prakash Surya


This patch attempts to reduce lock contention on the current arc_state_t
mutexes. These mutexes are used liberally to protect the number of LRU
lists within the ARC (e.g. ARC_mru, ARC_mfu, etc). The granularity at
which these locks are acquired has been shown to greatly affect the
performance of highly concurrent, cached workloads.

To illustrate this performance impact, I used fio to run a benchmark
on a system with 32GB of RAM and varied the number of CPUs on the system
from 1 to 32 CPUs, testing each power of two number of CPUs. The workload
remained constant for each test, and consisted of 32 threads each randomly
reading from unique files, consisting of 16GB worth of data in total (each
file was 512MB in size). The ZFS filesystem was configured to have a
recordsize of 8K, and the fio threads were configured to use a block size
of 8K.

Without this patch, when running with 32 CPUs, fio reported 388.6 thousand
IOPs during a 60 second run of the workload, which equates to about
3036.2 MB/s of aggregate read bandwidth. With this patch applied (again,
using 32 CPUs) fio reported 1266.8 thousand IOPs during a 60 second run,
which equates to about 9896.6 MB/s of aggregate read bandwidth. That's a
225% increase in the aggregate read bandwidth when using 32 CPUs.

As I mentioned above, I also ran the same workload while varying the
number of CPUs on the system, ranging from 1 CPU to 32 CPUs. The table
below shows the aggregate bandwidth results from those tests when running
on master, and the results when running with this patch applied:

+----------------------+
|  patched             |
+--------+-------------+
| # CPUS | Bandwidth   |
+--------+-------------+
|      1 | 833163KB/s  |
|      2 | 1520.2MB/s  |
|      4 | 2985.3MB/s  |
|      8 | 5913.2MB/s  |
|     16 | 8938.9MB/s  |
|     32 | 9896.6MB/s  |
+--------+-------------+
 
+----------------------+
|   master             |
+--------+-------------+
| # CPUS | Bandwidth   |
+--------+-------------+
|      1 | 840893KB/s  |
|      2 | 1559.5MB/s  |
|      4 | 3022.1MB/s  |
|      8 | 5104.9MB/s  |
|     16 | 3854.3MB/s  |
|     32 | 3036.2MB/s  |
+--------+-------------+

As you can see, the aggregate bandwidth scales much better when adding
more CPUs with this patch applied, than it does without this patch.

To achieve this, the single LRU and mutex that is used for each of the
arc states (e.g. ARC_mru, ARC_mfu, etc) has been split up into multiple
LRU lists each with it's own mutex. So, for the above test, 32
"sublists" were used for each arc state; one sublist per CPU on the
system. The multilist_t object was introduced to make this transition
from one LRU to many LRUs.

In addition, the logic performing reclamation from the ARC was reworked.
Now, only the arc_reclaim_thread will call the arc_evict() and
arc_evict_ghost() functions, whereas arc_evict() could previously be
called from arc_get_data_buf() (although, it can still be called from
dsl_pool_close() and "zinject -a"). This was done to try and prevent
multiple threads simultaneously contending on the arc state locks, all
trying to reclaim simultaneously. A secondary reason for this change is
it subjectively makes the code easier to understand as it removes some
edge cases (e.g. the "recycle" case, and mru vs. mfu decision in
arc_get_data_buf()).

One concern that this change introduces is "fairness" in the reclaim
logic. Now that each arc state is composed of multiple independent but
related LRUs, it's more difficult to determine which object is the
"oldest" across all of the LRUs composing each arc state. Thus, it is
possible for a buffer that isn't strictly the "oldest" in the set of
LRUs to be evicted out of order (e.g. before the "oldest" buffer). I've
reasoned this concern away, by assuming (and measuring during testing)
that buffers will be inserted into each arc state's LRU in a even
distribution across all sublists; thus, simply evicting the "oldest"
buffer(s) in one randomly selected sublist should be "good enough".

To work around a potential 3-way deadlock involving the multilist sublist
locks and the l2ad_mtx, the l2ad_mtx locking had to be slightly reworked
(we need to grab the sublist lock prior to the l2ad_mtx) in
l2arc_write_buffers. To accomodate these locking changes, the reclaim
logic surrounding l2arc buffers needed to changed as well, due to a couple
regressions introduced. The locking changes are straight forward, so I'll
refrain from discussing them here, but the changes to the reclaim logic
are more note worthy:

1. We need to be careful not to evict a header to the arc_l2c_only
   state that's concurrently being written to the l2arc device. Since
   the buffer isn't written out with the hash lock held, we must
   explicitly account for this in the reclaim logic. While not a common
   case, it is possible, so we must account for it. Now, if this was to
   happen, the buffer will be skipped during eviction.
 
2. Now that the l2ad_mtx isn't held throughout the whole execution of
   l2arc_write_header, we must be careful to place a header realloc'ed
   to the arc_l2c_only state back into the l2ad's buflist in the same
   location. Previously, the new header would be inserted into the
   head of the buflist, which would allow the header to be re-written
   out in l2arc_write_buffers (even though it no longer has a b_l1hdr).
 
3. And finally, if we encounter a hash lock collision in
   l2arc_write_done(), we wait for the lock to become uncontended and
   retry. Unlike the previous behavior, we can't skip the header, leaving
   its ARC_FLAG_L2_WRITING bit set, since that would prevent the buffer
   from being evicted from the ghost list until it was evicted from
   the L2ARC device.

The last big change introduced was the call to arc_do_user_evicts() was
removed from arc_reclaim_thread(), and split out into a completely
separate thread. This was due to a deadlock being introduced as a result
of the change to arc_get_data_buf(). When arc_get_data_buf() sleeps
waiting for the eviction thread to evict buffers, it does so holding the
hash lock. The eviction function is carefully crafted as to not sleep
on any arc header's hash lock (e.g. using mutex_tryenter). Unfortunately,
the callbacks in arc_do_user_evicts() are not so careful, so we must
execute them in such a way that they cannot block the reclaim thread
waiting on to acquire a header's hash lock.

A few miscellaneous notes:

 * A couple new static dtrace probes were introduced to help measure
the evenness of a given multilist structure: 'multilist-insert' and
'multilist-remove'

 * The tunable zfs_arc_num_sublists_per_state was added to allow a user
to configure the number of sublists to user for each arc state.

 * The tunable zfs_arc_overflow_shift was added to configure how far
past arc_c we're allowed to grow, before arc_get_data_buf() will
sleep and wait for the reclaim thread to free up some space.

 * The tunable arc_evict_iterations has been removed and replaced with
zfs_arc_evict_batch_limit, which defines the number of headers to
evict from each sublist before moving on to another sublist.

 * A couple new arcstats were added to track the new L2ARC reclaim
logic that's been added. These stats are "evict_l2_skip" and
"l2_writes_lock_retry".

 * A new kstat was added, "evict_not_enough", to track the number of
times arc_evict_state() is not able to meet it's target number of
bytes to evict from a given arc state.

zfs test suite
ztest

internal link: http://jenkins/job/zfs-precommit/1477/

In addition to the normal regession tests using git-zfs-precommit, I ran
this patch through numerous hours of "zloop" and "zone" using a couple
basic VMs.

Both "zloop" and "zone" were run for well over 24 hours.

I also ran the zfs-test suite manually in an infinite loop in another
two VMs for well over 24 hours. I didn't verify all of the logs
to see if the failed tests are all "known" failures, but I didn't hit
any panics, assertions, or deadlocks either (which was my motivation,
to try and see if I'd hit any panics or deadlocks).

========================================================================
This patch was manually tested using a couple of FIO workloads. First,
the workload used to generate the performance numbers quoted in the
description is given here:

[global]
group_reporting=1
fallocate=none
ioengine=psync
numjobs=32
iodepth=1
bs=8k
filesize=512m
size=512m
randrepeat=0
use_os_rand=1

[32threads]
rw=randread
time_based
runtime=1m
directory=/tank/fish

========================================================================
Additionally, a branch with this patch applied was stress tested using
the following FIO workload to try and catch any remaining issues that may
have been missed. This workload was run for well over 24 hours while also
periodically running "zinject -a". It was also run on a pool with and
without an L2ARC cache device.

[global]
group_reporting=1
fallocate=none
ioengine=psync
iodepth=1
bs=8k
randrepeat=0
use_os_rand=1
norandommap
loops=1048576
runtime=24h
time_based
directory=/tank/fish
numjobs=32
filesize=512m
size=512m

[write]
rw=write

[randwrite]
rw=randwrite

[read]
rw=read

[randread]
rw=randread

[rw,readwrite]
rw=readwrite

[randrw]
rw=randrw

========================================================================
Another long running stress test that I ran over a weekend was a slight
variation on the fio test I posted just above. The difference was that
instead of running a single FIO instance which spawned 192 threads, I used
6 independent FIO tests each of which used 32 threads. Each FIO test ran one
of the unique "rw" types (e.g. one 32 thread test ran "write", another ran
"randwrite", another running "read", etc) for 30 minutes. In addition, each
test ran in it's own independent pool, and after the 30 minute FIO test the
pool would be exported, reimported, and the fio test would run again. So,
the flow of the test was:

while true; do
zpool export $POOL
zpool import $POOL
fio $FILE
done

This test ran for about 18 hours (can't remember the exact time I started)
and no deadlocks, ASSERTs, or other malformed behavior was noticed. Also,
I intenionally crashed the system with an NMI and ran '::findleaks' after
the ~18 hours of run time; nothing of interest came up:

::findleaks
CACHE LEAKED BUFCTL CALLER
ffffff090042b888 1 ffffff096c9a6e68 au_pathdup+0x62
ffffff090042c448 1 ffffff0a528bfb00 au_pathdup+0x62
ffffff0900430008 2 ffffff0967354820 impl_acc_hdl_alloc+0x3d
ffffff090042b008 2 ffffff0966338c30 impl_acc_hdl_alloc+0x52
ffffff090042e008 2 ffffff0967355ca8 impl_acc_hdl_alloc+0x6b


       Total       8 buffers, 1080 bytes

Issues

  • 1
  • 0
  • 0
  • 1
Description From Last Updated
O3X build complains about leaked memory from "multilist_create(arc_l2c_only->arcs_list" and it appears a call to multilist_destroy for them do not exist. ... Jorgen Lundman Jorgen Lundman
Richard Elling
Richard Elling
Richard Elling
Matthew Ahrens
Matthew Ahrens
Review request changed

Status: Closed (submitted)

Jorgen Lundman

   
usr/src/uts/common/fs/zfs/arc.c (Diff revision 3)
 
 

O3X build complains about leaked memory from "multilist_create(arc_l2c_only->arcs_list" and it appears a call to multilist_destroy for them do not exist. Please see O3X commit:
https://github.com/openzfsonosx/zfs/commit/843c9e6ecb8914710ae2299ffd24e03913ef8bc4

  1. Yup, good catch. Now I need to think about why we didn't catch this in testing. It's a little different for us, since we don't actually ever call the destroy function when using ZFS on root, but I'm wondering if there's a good way to test for this using ztest or something.

Loading...