Alex Kolbasov technology blog: 2016

Wednesday, May 4, 2016

NFS Duplicate request hash

This post is about a rather technical point - the performance of the duplicate request cache in NFSv3 implementation for Illumos OS. The duplicate request cache was invented by Chet Juszczak in 1989 and pretty much any NFSv3 implementation uses some form of it. A short overview says:

The typical NFS version 3 protocol failure recovery model uses client time-out and retry to handle server crashes, network partitions, and lost server replies. A retried request is called a duplicate of the original.

When used in a file server context, the term idempotent can be used to distinguish between operation types. An idempotent request is one that a server can perform more than once with equivalent results (though it may in fact change, as a side effect, the access time on a file, say for READ). Some NFS operations are obviously non-idempotent. They cannot be reprocessed without special attention simply because they may fail if tried a second time. The CREATE request, for example, can be used to create a file for which the owner does not have write permission. A duplicate of this request cannot succeed if the original succeeded. Likewise, a file can be removed only once.

The side effects caused by performing a duplicate non-idempotent request can be destructive (for example, a truncate operation causing lost writes). The combination of a stateless design with the common choice of an unreliable network transport (UDP) implies the possibility of destructive replays of non-idempotent requests. Though to be more accurate, it is the inherent stateless design of the NFS version 3 protocol on top of an unreliable RPC mechanism that yields the possibility of destructive replays of non-idempotent requests, since even in an implementation of the NFS version 3 protocol over a reliable connection-oriented transport, a connection break with automatic reestablishment requires duplicate request processing (the client will retransmit the request, and the server needs to deal with a potential duplicate non-idempotent request).

The original Solaris/Illumos implementation is reasonable but is has several problems under seriously bigger workloads of today:

The cache size is static and should be determined up-front when the system boots. There is a documented tuneable rpcmod:maxdupreqs that can be tweaked. Its default value is 1024 which is way too small for modern applications. There is no clear idea about the optimal settings and values of 4096 or 8192 are suggested.
When the rate of the duplicate requests causes cache drops, performance drops significantly.
The implementation doesn't scale well over many CPUs typical in modern systems. It uses a single global lock to protect the hash table.

We can actually improve the implementation using some newer ideas and features available in Illumos - mostly by using the kmem caching facility of the Illumos kernel memory allocator. The idea is to increase the maximum cache size a lot and to keep entries in the cache for certain time if possible. There are two main parameters:

cotsmaxdupreqs is the maximum number of cached items. Should be much larger then the number of NFS daemon threads. The default value is 32768.
cots_dupreq_lifetime is the request time to live in the dup cache - should be enough for normal RPC timeouts.

The implementation tries to keep cached entries for cots_dupreq_lifetime but it also keeps total number of allocated entries within cotsmaxdupreqs.

Here is the description of the algorithm:

When a new request comes in, it is always placed in a cache (even if total number of entries exceeds cotsmaxdupreqs)and is marked as DUP_INPROGRESS. Once the request is processed by the underlying file system it is either marked as DUP_DONE (meaning that its results can be cached) or DUP_DROP (meaning that it can't be cached). If the request is marked as DUP_DONE and the total number of cached request doesn't exceed MAXDUPREQS, we keep it in the cache. If the total maximum cache size is reached, we drop the response.
We combine walk of the cache bucket chain to find the entry with freeing expired entries. Also there is a separate GC process that cleans up expired entries periodically.
The dup cache has an interesting property - most of the checks are misses and hits are very rare. So cache miss is a fast path.

Here is the actual code snippet:

/*

 * svc_cots_kdup searches the request cache and returns 0 if the

 * request is not found in the cache.  If it is found, then it

 * returns the state of the request (in progress or done) and

 * the status or attributes that were part of the original reply.

 * If DUP_DONE (there is a duplicate) svc_cots_kdup copies over the

 * value of the response. In that case, also return in *dupcachedp

 * whether the response free routine is cached in the dupreq - in which case

 * the caller should not be freeing it, because it will be done later

 * in the svc_cots_kdup code when the dupreq is reused.

*/

static int

svc_cots_kdup(struct svc_req *req, caddr_t res, int size, struct dupreq **drpp,

   bool_t *dupcachedp)
{
   struct rpc_cots_server *stats = CLONE2STATS(req->rq_xprt);
   dupreq_t *dr = NULL;
   dupreq_t *dr_next = NULL;
   uint32_t xid = REQTOXID(req);
   uint32_t drhash;
   int status;
   clock_t now = ddi_get_lbolt();
   dupreq_t *head;
   int nfreed;
   dupcache_bucket_t *bucket = &cots_duphash[XIDHASH(xid)];
   list_t *bucket_list = &bucket->dc_entries;
   kmutex_t *bucket_lock = &bucket->dc_lock;

   RSSTAT_INCR(stats, rsdupchecks);
   stats->rsentries.value.ui64 = cotsndupreqs;
   stats->rsoverflow.value.ui64 = cots_overflows;
   stats->rspending.value.ui64 = cots_inprogress;
   stats->rsclifetime.value.ui64 = cots_cache_lifetime;

   /*

    * Allocate a new entry outside the lock.

    * We only need the new entry if there isn't one already present, but

    * usually this is the case.

*/

   dupreq_t *dr_new = dr_alloc(req, size);
   if (dr_new == NULL) {
      RSSTAT_INCR(stats, rsnomem);
      return (DUP_ERROR);
   }

   dr_new->dr_created = now;

   /*

    * Check to see whether an entry already exists in the cache.

*/

   mutex_enter(bucket_lock);
   dr = list_head(bucket_list);
   while (dr != NULL) {
      dr_next = list_next(bucket_list, dr);
      status = dr->dr_status;
      /*

       * Remove expired DUP_DONE entries

*/

      if ((status == DUP_DONE) &&
          (dr->dr_created + cots_dupreq_lifetime < now)) {
         atomic_add_64(&cots_cache_lifetime,
            now - dr->dr_created);
         list_remove(bucket_list, dr);
         dr_free(dr);
      } else if (cots_compare(req, dr)) {
         if (status == DUP_DONE) {
            bcopy(dr->dr_resp.buf, res, size);
            if (dupcachedp != NULL &&
                dr->dr_resfree != NULL) {
               *dupcachedp = B_TRUE;
            }
         } else {
            *drpp = dr;
            RSSTAT_INCR(stats, rsinprogress);
         }
         /*

          * Client retried this request so it is a good chance

          * that it will retry again, so keep this entry in the

          * cache a bit longer = move it to the end of line.

*/

         atomic_add_64(&cots_cache_lifetime,
            now - dr->dr_created);
         dr->dr_created = now;
         list_remove(bucket_list, dr);
         list_insert_tail(bucket_list, dr);
         mutex_exit(bucket_lock);
         RSSTAT_INCR(stats, rsdupreqs);
         dr_free(dr_new);
         return (status);
      }
      dr = dr_next;
   }

   /*

    * Entry not found in the cache so it is a new request.

*/

   dr = dr_new;
   list_insert_tail(bucket_list, dr);
   mutex_exit(bucket_lock);

   atomic_inc_32(&cots_inprogress);
   *drpp = dr;
   return (DUP_NEW);
}

/*

 * svc_cots_kdupdone marks the request done (DUP_DONE)

 * and stores the response.

*/

static void

svc_cots_kdupdone(struct dupreq *dr, caddr_t res, void (*dis_resfree)(),
   int size, int status)
{
   uint32_t drhash = (uint32_t)DRHASH(dr);
   dupcache_bucket_t *bucket = &cots_duphash[drhash];
   list_t *bucket_list = &bucket->dc_entries;
   kmutex_t *bucket_lock = &bucket->dc_lock;


   ASSERT(dr->dr_resfree == NULL);
   ASSERT(dr->dr_status == DUP_INPROGRESS);
   atomic_dec_32(&cots_inprogress);
   mutex_enter(bucket_lock);
   if (status != DUP_DONE) {
      dr->dr_status = status;
   } else {
      dr->dr_resfree = dis_resfree;
      bcopy(res, dr->dr_resp.buf, size);
      if (cotsndupreqs < cotsmaxdupreqs) {
         dr->dr_status = DUP_DONE;
         mutex_exit(bucket_lock);
         return;
      }
      /*

       * Cache is full. Try replacing the oldest entry in the bucket

       * If this isn't possible, don't bother and just purge the

       * current entry

*/

      atomic_inc_32(&cots_overflows);

      dupreq_t *head = list_head(bucket_list);
      if (head != NULL && head->dr_status == DUP_DONE) {
         /* Cut the head off */         atomic_add_64(&cots_cache_lifetime,
            ddi_get_lbolt() - head->dr_created);
         list_remove_head(bucket_list);
         /*

          * dr stays in the cache

*/

         dr->dr_status = DUP_DONE;
         mutex_exit(bucket_lock);
         dr_free(head);
         return;
      }

   }
   atomic_add_64(&cots_cache_lifetime,
      ddi_get_lbolt() - dr->dr_created);
   list_remove(bucket_list, dr);
   mutex_exit(bucket_lock);
   dr_free(dr);
}

/*

 * Allocate a new dupreq structure which has enough space to hold the request

 *  and response data

*/

static dupreq_t *
dr_alloc(struct svc_req *req, int resp_size)
{
   dupreq_t *dr = kmem_cache_alloc(dupreq_cache, KM_NOSLEEP);
   if (dr == NULL) {
      return (NULL);
   }

   if (dr->dr_addr.buf == NULL ||
       dr->dr_addr.maxlen < req->rq_xprt->xp_rtaddr.len) {
      if (dr->dr_addr.buf != NULL) {
         kmem_free(dr->dr_addr.buf, dr->dr_addr.maxlen);
         dr->dr_addr.buf = NULL;
      }
      dr->dr_addr.maxlen = req->rq_xprt->xp_rtaddr.len;
      dr->dr_addr.buf = kmem_alloc(dr->dr_addr.maxlen, KM_NOSLEEP);
      if (dr->dr_addr.buf == NULL) {
         dr->dr_addr.maxlen = 0;
         kmem_cache_free(dupreq_cache, dr);
         return (NULL);
      }
   }

   if (dr->dr_resp.buf == NULL ||
       dr->dr_resp.maxlen < resp_size) {
      if (dr->dr_resp.buf != NULL) {
         kmem_free(dr->dr_resp.buf, dr->dr_resp.maxlen);
         dr->dr_resp.buf = NULL;
      }
      dr->dr_resp.maxlen = (unsigned int)resp_size;
      dr->dr_resp.buf = kmem_alloc(resp_size, KM_NOSLEEP);
      if (dr->dr_resp.buf == NULL) {
         dr->dr_resp.maxlen = 0;
         kmem_cache_free(dupreq_cache, dr);
         return (NULL);
      }
   }

   dr->dr_xid = REQTOXID(req);
   dr->dr_prog = req->rq_prog;
   dr->dr_vers = req->rq_vers;
   dr->dr_proc = req->rq_proc;
   dr->dr_addr.len = req->rq_xprt->xp_rtaddr.len;
   bcopy(req->rq_xprt->xp_rtaddr.buf, dr->dr_addr.buf,
      dr->dr_addr.len);
   dr->dr_status = DUP_INPROGRESS;

   atomic_inc_32(&cotsndupreqs);
   return (dr);
}

/*

 * Cache garbage collection. Called to cleanup cache entries periodically.

 * Note that entries are also cleaned up in svc_cots_kdup().

*/

static void

dupreq_gc(void *arg)
{
   int i;
   clock_t now = ddi_get_lbolt();

   /*

    * Walk through every bucket and free all expired entries

*/

   for (i = 0; i < DRHASHSZ; i++) {
      dupcache_bucket_t *bucket = &cots_duphash[i];
      list_t *bucket_list = &bucket->dc_entries;
      kmutex_t *bucket_lock = &bucket->dc_lock;
      dupreq_t *dr;

      mutex_enter(bucket_lock);
      dr = list_head(bucket_list);
      while ((dr != NULL) &&
         (dr->dr_created + cots_dupreq_lifetime < now)) {
         dupreq_t *dr_next = list_next(bucket_list, dr);
         int status = dr->dr_status;
         if (status == DUP_DONE) {
            atomic_add_64(&cots_cache_lifetime,
               now - dr->dr_created);
            list_remove(bucket_list, dr);
            dr_free(dr);
            atomic_inc_32(&cots_gc);
         }
         dr = dr_next;
      }
      mutex_exit(bucket_lock);
   }
   (void) timeout(dupreq_gc, NULL, cots_dupreq_lifetime);
}

static void

dr_free(dupreq_t *dr)
{
   atomic_dec_32(&cotsndupreqs);
   /* If There is custom free routine for the response data call it */

   if (dr->dr_resfree != NULL) {
      (*dr->dr_resfree)(dr->dr_resp.buf);
      dr->dr_resfree = NULL;
   }
   dr->dr_xid = -1;
   dr->dr_proc = 0;
   dr->dr_vers = 0;
   dr->dr_prog = 0;
   dr->dr_status = DUP_INPROGRESS;
   kmem_cache_free(dupreq_cache, dr);
}

/*ARGSUSED*/static int

dupreq_constructor(void *buf, void *cdrarg, int kmflags)
{
   dupreq_t *dr = buf;
   bzero(dr, sizeof (dupreq_t));
   dr->dr_xid = -1;
   dr->dr_status = DUP_INPROGRESS;
   return (0);
}

/*

 * Free cached address and response buffers

*/

static void

dupreq_destructor(void *buf, void *cdrarg)
{
   dupreq_t *dr = buf;
   if (dr->dr_addr.buf != NULL)
      kmem_free(dr->dr_addr.buf, dr->dr_addr.maxlen);
   dr->dr_addr.buf = NULL;
   if (dr->dr_resp.buf != NULL)
      kmem_free(dr->dr_resp.buf, dr->dr_resp.maxlen);
   dr->dr_resp.buf = NULL;
}

Wednesday, April 20, 2016

More on Cinder volume creation

After my complaints in the previous post I'd like to de-mistify the action of volume creation in Cinder -land.

The Volume object passed to create_volume() has several attributes that are useful:

size is the expected size of the new volume;
id is the UUID of the new volume. It is already set on the call to create_volume();
name is volume name as assigned by Cinder;

So the job of create_volume() is:

Do whatever is needed on the backend to obtain an NFS share that can host a file of the size specified in the volume 'size' attribute.
Create actual file of specified size. The filename should be exactly the value of the volume 'name' property and it should be created on the NFS share that you found or created.
Set file permissions to either 666 or 660 depending on security settings provided in options.
Return a dictionary of volume attributes that you would like to set. It should at least include 'provider_location' which is a string in 'host:/path/to/share' format (which doesn't include the actual file name!). This is used by other components (mostly Nova) to mount the file and use it as a backing store for the volume. It is very important that the share should be mountable from both the Cinder node (which needs to mount for some operations) and the Nova node (which always mounts it when the volume is attached to a VM instance).
You can also set 'provider_id' attribute in the return dictionary which is a string that only has a meaning to the driver. In my case, for example, I use the UUID of the newly created share, It can be used later to associate the volume with something that has meaning on your backend.

Tuesday, April 19, 2016

Exploring OpenStack Cinder

Recently I was asked to write an OpenStack cinder driver for a proprietary storage appliance. This gave me a chance to look at the implementation side of OpenStack.

Getting Around

Browsing OpenStack documentation we can see that the driver must support the following set of features:

Volume Create/Delete
Volume Attach/Detach
Snapshot Create/Delete
Create Volume from Snapshot
Get Volume Stats
Copy Image to Volume
Copy Volume to Image
Clone Volume

So it is quite natural to start with volume creation. Unfortunately the description above doesn't tell much about the semantics of these operations. A bit more digging round points to the cinder.volume.driver documentation which says:

create_volume(volume) creates a volume. Can optionally return a Dictionary of changes to the volume object to be persisted.

Well, this is rather helpful. Armed with this knowledge, writing an actual implementation is a breeze. A curious reader may wonder what is the 'volume' that is passed to a driver? Some more digging around produces the clear description:

...
fields = {'migration_status': String(default=<class 'oslo_versionedobjects.fields.UnspecifiedDefault'>,nullable=True), 'provider_id': UUID(default=<class 'oslo_versionedobjects.fields.UnspecifiedDefault'>,nullable=True), 'availability_zone': String(default=<class 'oslo_versionedobjects.fields.UnspecifiedDefault'>,nullable=True), 'terminated_at': DateTime(default=<class 'oslo_versionedobjects.fields.UnspecifiedDefault'>,nullable=True)
...

You get the idea. It turns out that Volumes are stored in a database so there is also a matching database schema in models.py which is about as useful.

So forget about documentation, let's dive in the source tree...

Back to the source

Since my goal was to implement NFS-based volume, I examined the existing NfsDriver which can be used by itself or as a base class for many other drivers. It is based on RemoteFsDriver which provides common code for all NFS drivers. Hopefully this should provide enough support for the new driver - I just need to add a few API calls to communicate with the actual appliance...

The first question I wanted to answer from the source was the semantics of the create_volume() call. The RemoteFsDriver provides some hints: the call returns a dictionary

volume['provider_location'] = self._find_share(volume['size'])
self._do_create_volume(volume)
return {'provider_location': volume['provider_location']}

This provider_location turns out to be a string of the form host:/path/to/remote/share that is used by the mount command to mount the NFS share.

A few NFS drivers that I looked at behaved in the following way:

Configuration provides location of a file that lists available shares;
Drivers provide some code that selects share suitable for the new volume and stick its NFS path into provider_location attribute;
The share path contains big files that represent volumes;
All shares are always kept mounted on the cinder node;

What I wanted to do was somewhat different - I wanted to keep 1:1 relationship between a volume and a share. This means that there is no file describing the share - shares are created on demand as volumes are created. Also since we may have a lot of volumes I didn't want to keep them mounted all the time and only mount them as needed. The benefit is that it is very easy to manage snapshots and clones since they are first class citizens on the actual appliance.

It turned out that in spite of all the existing generic code around NFS drivers all of it was useless in my situation because RemoteFsDriver assumed the wrong model. So I had to do everything from scratch. The only thing I was able to reuse was the RemoteFsClient from remotefs_brick which wasn't particularly useful either but I had to use for reasons that I'll explain in another post. The only service it provides is an ability to run mount command to mount an NFS share.

Conclusions

I was actually quite surprised to see such a dismal quality of the developer documentation and the actual implementation for something as hyped as core part of OpenStack. Compare it for example, with Docker Volume Plugin documentation (and implementations) and you'll see a huge difference. Volume plugins are small, simple, can be implemented in any language and clearly described.