3525 Persistent L2ARC

Review Request #267 - Created Oct. 26, 2015 and updated

Information
Saso Kiselkov
illumos-gate
Reviewers
openzfs
csiden

This is an upstream port of the Persistent L2ARC feature from Nexenta's repo.

Been running in Nexenta's QA process for the past 2+ months.

Issues

  • 33
  • 9
  • 2
  • 44
Description From Last Updated
This is ~50GB by my calculation: 3 * 16MB * 1023 Matthew Ahrens Matthew Ahrens
I hit this one (NULL deref in l2arc_spa_rebuild_start) after a crash wiped the gpt partition table on a USB3 thumbnail ... Sean Doran Sean Doran
given that you're padding the other structures in the hope of being able to reconstruct the L2ARC even after future ... Matthew Ahrens Matthew Ahrens
why aren't we using the l2arc_log_blkptr_t? Matthew Ahrens Matthew Ahrens
this still needs updating, I think Matthew Ahrens Matthew Ahrens
consider using a static inline func Matthew Ahrens Matthew Ahrens
cstyle: parens around return value Matthew Ahrens Matthew Ahrens
shouldn't this be P2ROUNDUP(), in case sizeof(...) is e.g. 520? Or better yet, vdev_psize_to_asize()? Matthew Ahrens Matthew Ahrens
could we rename this member "l2ad_rebuild_needed"? or _wanted? Matthew Ahrens Matthew Ahrens
why only in kernel? can this not be exercised by ztest? Matthew Ahrens Matthew Ahrens
why record the did rather than the kthread_t? Matthew Ahrens Matthew Ahrens
I think we can make this part of spa_proc Matthew Ahrens Matthew Ahrens
I think that the daddr should already be ashift-aligned, so this should be "daddr + vdev_psize_to_asize(PSIZE())" Matthew Ahrens Matthew Ahrens
It would be nice to somehow make it more explicit that for log blocks, the concept of ASIZE (how much ... Matthew Ahrens Matthew Ahrens
maybe first you could ASSERT3U() that sizeof (lb_ptrs) is the same as sizeof (dh_start_lbps)? that way if we change one ... Matthew Ahrens Matthew Ahrens
how about also (or instead), zfs_dbgmsg()? Matthew Ahrens Matthew Ahrens
I'm confused by the use of PSIZE() here. Why would we care what the compressed size is at this point, ... Matthew Ahrens Matthew Ahrens
'hdr' meaning dev->l2ad_dev_hdr? Matthew Ahrens Matthew Ahrens
not very clean to be releasing a lock that the caller acqired. And what about the other return paths that ... Matthew Ahrens Matthew Ahrens
logically, it would make sense to check the magic, then the checksum, then the guid. That way we could differentiate ... Matthew Ahrens Matthew Ahrens
we never have a partially-full log block? Matthew Ahrens Matthew Ahrens
if it's ADDR_UNSET, how can the rest of this work? Matthew Ahrens Matthew Ahrens
I don't understand "connect the l2hdr to the hdr" Matthew Ahrens Matthew Ahrens
these are not prefetch i/os (see ARC_FLAG_PREFETCH), so please use a different name here (and when describing this in comments) Matthew Ahrens Matthew Ahrens
not sure why we need a wrapper function for this? Matthew Ahrens Matthew Ahrens
why not set the BP's byteorder bit? (and use that instead of checking the magic) Matthew Ahrens Matthew Ahrens
can this be moved into the macro? Matthew Ahrens Matthew Ahrens
oh, the PSIZE is the asize? so we don't need to do vdev_psize_to_asize() when reading it? Matthew Ahrens Matthew Ahrens
how about DMU_OTN_UINT64_METADATA? Matthew Ahrens Matthew Ahrens
how about &hdr->dh_spa_guid ? Matthew Ahrens Matthew Ahrens
PSIZE here is not ASIZE? seems we should be consistent. Matthew Ahrens Matthew Ahrens
That is not a valid dmu_object_type_t. I'd suggest either using MDU_OTN_UINT8_{META}DATA, or explicitly document this to be the arc_buf_contents_t (and ... Matthew Ahrens Matthew Ahrens
I'm always suspicious of the config, because it can be provided by unreliable sources (userland, zpool.cache, outdated labels). Do we ... Matthew Ahrens Matthew Ahrens
Saso Kiselkov
Matthew Ahrens
Jorgen Lundman

   
usr/src/uts/common/fs/zfs/arc.c (Diff revision 4)
 
 
This is probably the first time thread_join() is needed from a functional point of view, and OSX does not have it (nor ZOL and FBSD). We could certainly implement a thread_join in the SPL layer for each platform, if the use of thread_join is to become "the norm" ?
Sean Doran

Hi Sašo,

I just hit an interesting corner case on in the l2_rebuild_abort_lowmem logic.
  • Import several pools; rebuild goes fine.
  • Import pool that at system shutdown was dealing with a huge set (half a TiB) of deduped-dataset snapshot destroys.
  • This takes about 15-20 min waiting on the spa_namespace_lock mutex (in spa_import), the l2arc_feed_thr_lock mutex, and then the spa_namespace_lock mutex again (in spa_async_thread).
  • During this time, other activity makes the arc grow big
  • The rebuild sees arc_reclaim_needed(), and so the several GB of DDT (among other things) on the L2 are effectively discarded.

I think that the abort-in-lowmem response is a bit aggressive. In this particular example, this is the first time arc_reclaim_needed() was true since the system was started.

Maybe backing off and trying again once or twice before aborting is a reasonable thing to do in the generic case?

Also, as lundman noted, so far in openzfs nobody else expects thread_join() to work in the kernel; my approach was to extend struct l2arc_dev to hold a mutex and a condvar,

        kmutex_t                l2ad_rebuild_mutex;
        kcondvar_t              l2ad_rebuild_cv;

and to do this in l2ar_remove_vdev():

        if (remdev->l2ad_rebuild_did != 0) {
                /*
                 * N.B. it should be safe to thread_join with the rebuild
                 * thread while holding l2arc_dev_mtx because it is not
                 * accessed from anywhere in the l2arc rebuild code below
                 * (except for l2arc_spa_rebuild_start, which is ok).
                 */
          mutex_enter(&remdev->l2ad_rebuild_mutex);
          do {
            static int iter = 0;
            dprintf("ZFS: %s: waiting on condvar for rebuild thread %p, iter %d\n",
                   __func__, remdev->l2ad_rebuild_did, ++iter);
            cv_timedwait(&remdev->l2ad_rebuild_cv, &remdev->l2ad_rebuild_mutex, ddi_get_lbolt()+(hz*5));
          } while(!remdev->l2ad_rebuild_thread_exiting);
          mutex_exit(&remdev->l2ad_rebuild_mutex);
          dprintf("ZFS: %s: thread done %p, destroying cv and mutex, setting did to 0\n",
                 __func__, remdev->l2ad_rebuild_did);
          cv_destroy(&remdev->l2ad_rebuild_cv);
          mutex_destroy(&remdev->l2ad_rebuild_mutex);
          remdev->l2ad_rebuild_did = 0;
                //thread_join(remdev->l2ad_rebuild_did);

and this in l2arc_spa_rebuild_start():

                if (dev->l2ad_rebuild && !dev->l2ad_rebuild_cancel) {
                        VERIFY3U(dev->l2ad_rebuild_did, ==, 0);
#ifdef  _KERNEL
                        mutex_init(&dev->l2ad_rebuild_mutex, NULL, MUTEX_DEFAULT, NULL);
                        cv_init(&dev->l2ad_rebuild_cv, NULL, CV_DEFAULT, NULL);
                        mutex_enter(&dev->l2ad_rebuild_mutex);
                        dev->l2ad_rebuild_thread_exiting = FALSE;
                        mutex_exit(&dev->l2ad_rebuild_mutex);
                        dev->l2ad_rebuild_did = thread_create(NULL, 0,
                                                              (void *)l2arc_dev_rebuild_start,
                                                              dev, 0, &p0, TS_RUN,
                                                              minclsyspri);
#endif
              }

and this in l2arc_dev_rebuild_start()

  mutex_enter(&dev->l2ad_rebuild_mutex);
  dev->l2ad_rebuild_thread_exiting = FALSE;
        if (!dev->l2ad_rebuild_cancel) {
          mutex_exit(&dev->l2ad_rebuild_mutex);
          VERIFY(dev->l2ad_rebuild);
          (void) l2arc_rebuild(dev);
          mutex_enter(&dev->l2ad_rebuild_mutex);
          dev->l2ad_rebuild = B_FALSE;
          mutex_exit(&dev->l2ad_rebuild_mutex);
        } else {
          mutex_exit(&dev->l2ad_rebuild_mutex);
        }
        mutex_enter(&dev->l2ad_rebuild_mutex);
        dev->l2ad_rebuild_thread_exiting = TRUE;
        mutex_exit(&dev->l2ad_rebuild_mutex);
        cv_broadcast(&dev->l2ad_rebuild_cv);
        kpreempt(KPREEMPT_SYNC);
        if(!dev->l2ad_rebuild_cancel && dev->l2ad_rebuild_did == thr) {
          // we didn't get cleaned up in l2arc_remove_vdev
          cv_destroy(&dev->l2ad_rebuild_cv);
          mutex_destroy(&dev->l2ad_rebuild_mutex);
          dev->l2ad_rebuild_did = 0;
        } else if(dev->l2ad_rebuild_did != thr) {
          // zero: we probably got cleaned up in l2arc_remove_vdev, nonzero: impossible?
                 __func__, thr, dev->l2ad_rebuild_did);
        }
        dprintf("ZFS: %s calling thread_exit(), thread %p/%p, l2ad_rebuild_cancel == %d\n",
               __func__, thr, dev->l2ad_rebuild_did, dev->l2ad_rebuild_cancel);
        thread_exit();

Finally, I agree with you in your discussion with Matt Ahrens that the ~51 GB lower limit on rebuildable L2 vdevs is a bit strong.

Consider the case where a smaller L2ARC is actually desirable because the average block size is very small and turnover is fairly high; the extra l2hdrs can be significant pollution, and will accumulate on a larger L2 device. However, there is still presumably value in holding on to some smaller number of small blocks.

Sean.
  1. In a variety of tests with the across-system-shutdown behaviour of slow pools destroying huge deduped snapshots and datasets, modifying the low memory test

                    /*
                     * Our memory pressure valve. If the system is running low
                     * on memory, rather than swamping memory with new ARC buf
                     * hdrs, we opt not to rebuild the L2ARC. At this point,
                     * however, we have already set up our L2ARC dev to chain in
                     * new metadata log blk, so the user may choose to re-add the
                     * L2ARC dev at a later time to reconstruct it (when there's
                     * less memory pressure).
                     */
    
                    if (arc_reclaim_needed()) {
    

    to

                    if(arc_reclaim_needed() && arc_meta_used >= arc_meta_limit) {
    

    seems to work reasonably for pools that are late to the party (the arc has warmed up) but have a bunch of IOPS-saving rebuildable data.

    On the other hand, the way arc_reclaim_needed() is subtly different across various openzfs implementations, so this might not be a good change for the generic case. Perhaps arc_meta_used should be compared against some fraction of arc_meta_limit instead.

  2. Thanks for the great work on figuring this out. I'll build in your suggestions and update the review.

Sean Doran

   
usr/src/uts/common/fs/zfs/arc.c (Diff revision 4)
 
 

I hit this one (NULL deref in l2arc_spa_rebuild_start) after a crash wiped the gpt partition table on a USB3 thumbnail drive that during testing had two partitions in use as L2ARC for a single pool.

Any attempt to import, including just "zpool import -c [pool cachefile]" or "zpool import" would trip the NULL deref.

The workaround was to do
if(!dev) {
printfmessage;
continue;
}
where the ASSERT is, which let me import the pool (with the cache vdevs unavailable, of course).

Matthew Ahrens

Looking forward to seeing this in illumos! I'd like George Wilson to also review this after you respond to my feedback.

usr/src/uts/common/fs/zfs/arc.c (Diff revision 4)
 
 

given that you're padding the other structures in the hope of being able to reconstruct the L2ARC even after future on-disk format changes -- it would probably make sense to add a couple words of padding here too (and would bring the size of this up to a power of 2, though maybe that doesn't matter)

usr/src/uts/common/fs/zfs/arc.c (Diff revision 4)
 
 
 
 
 
 
 
 
 
 
 
 

why aren't we using the l2arc_log_blkptr_t?

usr/src/uts/common/fs/zfs/arc.c (Diff revision 4)
 
 

this still needs updating, I think

usr/src/uts/common/fs/zfs/arc.c (Diff revision 4)
 
 

consider using a static inline func

usr/src/uts/common/fs/zfs/arc.c (Diff revision 4)
 
 
 

cstyle: parens around return value

usr/src/uts/common/fs/zfs/arc.c (Diff revision 4)
 
 
 

shouldn't this be P2ROUNDUP(), in case sizeof(...) is e.g. 520? Or better yet, vdev_psize_to_asize()?

usr/src/uts/common/fs/zfs/arc.c (Diff revision 4)
 
 
could we rename this member "l2ad_rebuild_needed"? or _wanted?
usr/src/uts/common/fs/zfs/arc.c (Diff revision 4)
 
 

why only in kernel? can this not be exercised by ztest?

usr/src/uts/common/fs/zfs/arc.c (Diff revision 4)
 
 

why record the did rather than the kthread_t?

usr/src/uts/common/fs/zfs/arc.c (Diff revision 4)
 
 

I think we can make this part of spa_proc

usr/src/uts/common/fs/zfs/arc.c (Diff revision 4)
 
 

I think that the daddr should already be ashift-aligned, so this should be "daddr + vdev_psize_to_asize(PSIZE())"

usr/src/uts/common/fs/zfs/arc.c (Diff revision 4)
 
 

It would be nice to somehow make it more explicit that for log blocks, the concept of ASIZE (how much space is allocated for this block) is equal to vdev_psize_to_asize(PSIZE). Maybe even add a macro for LBP_GET_ASIZE() that does this (you'd have to pass in the dev as well)?

usr/src/uts/common/fs/zfs/arc.c (Diff revision 4)
 
 

maybe first you could ASSERT3U() that sizeof (lb_ptrs) is the same as sizeof (dh_start_lbps)? that way if we change one we will find where to change the other.

usr/src/uts/common/fs/zfs/arc.c (Diff revision 4)
 
 
 

how about also (or instead), zfs_dbgmsg()?

usr/src/uts/common/fs/zfs/arc.c (Diff revision 4)
 
 
 
I'm confused by the use of PSIZE() here.  Why would we care what the compressed size is at this point, given that we are using a l2arc_log_blk_phys_t*, which is uncompressed.
usr/src/uts/common/fs/zfs/arc.c (Diff revision 4)
 
 

'hdr' meaning dev->l2ad_dev_hdr?

usr/src/uts/common/fs/zfs/arc.c (Diff revision 4)
 
 
 

not very clean to be releasing a lock that the caller acqired. And what about the other return paths that don't drop the lock? Can this be restructured such that the caller drops its lock if we return an error?

usr/src/uts/common/fs/zfs/arc.c (Diff revision 4)
 
 

logically, it would make sense to check the magic, then the checksum, then the guid. That way we could differentiate between disk corruption and it being from a different pool.

That said, isn't there a zfs label on the cache device that ensures we wouldn't open it unless it's part of this pool?

usr/src/uts/common/fs/zfs/arc.c (Diff revision 4)
 
 

we never have a partially-full log block?

usr/src/uts/common/fs/zfs/arc.c (Diff revision 4)
 
 

if it's ADDR_UNSET, how can the rest of this work?

usr/src/uts/common/fs/zfs/arc.c (Diff revision 4)
 
 

I don't understand "connect the l2hdr to the hdr"

usr/src/uts/common/fs/zfs/arc.c (Diff revision 4)
 
 

these are not prefetch i/os (see ARC_FLAG_PREFETCH), so please use a different name here (and when describing this in comments)

usr/src/uts/common/fs/zfs/arc.c (Diff revision 4)
 
 
 
 
 
 
 
 
 
 
 

not sure why we need a wrapper function for this?

usr/src/uts/common/fs/zfs/arc.c (Diff revision 4)
 
 

why not set the BP's byteorder bit? (and use that instead of checking the magic)

usr/src/uts/common/fs/zfs/arc.c (Diff revision 4)
 
 

can this be moved into the macro?

usr/src/uts/common/fs/zfs/arc.c (Diff revision 4)
 
 

oh, the PSIZE is the asize? so we don't need to do vdev_psize_to_asize() when reading it?

usr/src/uts/common/fs/zfs/arc.c (Diff revision 4)
 
 

how about DMU_OTN_UINT64_METADATA?

usr/src/uts/common/fs/zfs/arc.c (Diff revision 4)
 
 
 

how about &hdr->dh_spa_guid ?

usr/src/uts/common/fs/zfs/arc.c (Diff revision 4)
 
 

PSIZE here is not ASIZE? seems we should be consistent.

usr/src/uts/common/fs/zfs/arc.c (Diff revision 4)
 
 
That is not a valid dmu_object_type_t.  I'd suggest either using MDU_OTN_UINT8_{META}DATA, or explicitly document this to be the arc_buf_contents_t (and put a comment by arc_buf_contents_t explaining that it's used on disk and therefore can't be changed!)
usr/src/uts/common/fs/zfs/spa.c (Diff revision 4)
 
 
 
 
 

I'm always suspicious of the config, because it can be provided by unreliable sources (userland, zpool.cache, outdated labels). Do we know that this is from the config in the MOS (which is reliable)? If not, I guess if do_rebuild is wrong, it isn't the end of the world, but I predict bugs of the form "I installed persistent l2arc bits but it wasn't persisted across reboot [when something weird happened with the config]".

Loading...