Wandering Thoughts archives

2019-02-27

How to see and flush the Linux kernel NFS server's group membership cache

One of the long standing limits with NFS v3 is that the protocol only uses up to 16 groups. In order to get around this and properly support people in more than 16 groups, various Unixes have various fixes. Linux has supported this for many years (since at least 2011) if you run rpc.mountd with -g aka --manage-gids. If you do use this option, well, I'll just quote the rpc.mountd manpage:

Accept requests from the kernel to map user id numbers into lists of group id numbers for use in access control. [...] If you use the -g flag, then the list of group ids received from the client will be replaced by a list of group ids determined by an appropriate lookup on the server. Note that the 'primary' group id is not affected so a newgroup command on the client will still be effective. [...]

As this mentions, the 'appropriate lookup' is performed by rpc.mountd when the kernel asks it to do one. As you'd expect, rpc.mountd uses whatever normal group membership lookup methods are configured on the NFS server in nsswitch.conf (it just calls getpwuid(3) and getgrouplist(3) in mountd/cache.c).

As you might expect, the kernel maintains a cache of this group membership information so that it doesn't have to flood rpc.mountd with lookups of the same information (and slow down handling NFS requests as it waits for answers), much like it maintains a client authentication cache. The group membership cache is handled with the same general mechanisms as the client authentication cache, which are sort of covered in the nfsd(7) manpage.

The group cache's various control files are found in /proc/net/rpc/auth.unix.gid, and they work the same as auth.unix.ip. There is a content file that lets you see the currently cached data, which comes in the form:

#uid cnt: gids...
915 11: 125 832 930 1010 1615 30062 30069 30151 30216 31061 31091

Occasionally you may see an entry like '123 0:'. I believe that this is generally an NFS request from a UID that wasn't known on the NFS fileserver; since it wasn't know, it has no local groups and so rpc.mountd reported to the kernel that it's in no groups.

All entries have a TTL, which is unfortunately not reported in the content pseudo-file; rpc.mountd uses its standard TTL of 30 minutes when adding entries and then they count down from there, with the practical effect that anything you see will expire at some unpredictable time within the next 30 minutes. You can flush all entries by writing a future time in Unix seconds to the flush file. For example:

date -d tomorrow +%s >auth.unix.gid/flush

This may be useful if you have added someone to a group, propagated the group update to your Linux NFS servers, and want them to immediately have NFS client access to files that are group-restricted to that group.

On sufficiently modern kernels, this behavior has been loosened (for all flush files of caches) so that writing any number at all to flush will flush the entire cache. This change was introduced in early 2018 by Neil Brown, in this commit. Based on its position in the history of the kernel tree, I believe that this was first present in 4.17.0 (which unfortunately means that it's a bit too late to be in our Ubuntu 18.04 NFS fileservers).

Presumably there is a size limit on how large the kernel's group cache can be, but I don't know what it is. At the moment, there are just over 550 entries in content on our most broadly used Linux NFS fileserver (it holds /var/mail, so a lot of people access things from it).

linux/NFSFlushingServerGroupCache written at 23:02:33; Add Comment

Using Prometheus subqueries to do calculations over time ranges

Subqueries are a new feature in Prometheus 2.7. Their usual use is to nest time range queries, such as a max_over_time of a rate, as covered in, for example, Brian Brazil's How much of the time is my network usage over a certain amount?. However, they can be used in another, perhaps less obvious way. Put simply, subqueries let you use time based aggregation on expressions.

Suppose, for example, that you are collecting basic NTP information from your NTP servers, including their current time and the time at which they last set their clock. As an instant query, the current amount of time since a server set its clock is:

sntp_time_seconds - sntp_clockset_seconds

You can graph this instant query over time to get a nice picture of how frequently the server resets its time. However, now suppose we want to know the maximum amount of time that a server has gone between clock updates over the past week. If we had a single metric for this, this would be straightforward:

max_over_time( sntp_clock_age_seconds [1w] )

However, we don't. Before subqueries, working this out was impossible; you couldn't put an expression inside max_over_time, and the best we could do was graph our instant query and eyeball where the top of the graph fell. But with subqueries, we can now do calculations inside max_over_time:

max_over_time ( (sntp_time_seconds - sntp_clockset_seconds) [1w:] )

(You have to put the ':' into the time range to mark it as a subquery; it's required by the syntax. I find this a little bit annoying since it can't be anything but a subquery here.)

And so when I wrote yesterday's entry about ntpdate's surprising restriction on what it will sync to, I could confidently talk about how our three different NTP daemons seem to have three different types of behavior (which was something that wasn't clear at all from the graphs).

(The mention of subqueries in Querying basics sort of implies this, by talking about starting from an 'instant query'.)

PS: Somewhat to my surprise, Prometheus lets you do an instant query where the result is a range vector, eg 'metric[10m]'. For a simple metric range vector, the results you get back are the values at the various timestamps where the metric was scraped. This is actually useful because the timestamps themselves (and how many results you get for a given time range) give you the true scrape frequency for the metric, which is not otherwise available. If you ask for a '[15m]' of a metric that is only scraped once every five minutes, you only get three time points in the answer; if it's scraped every minute, you get fifteen.

(This works both in the web interface and in the underlying HTTP API. In the web interface you get both values and timestamps displayed in the console tab, but you unsurprisingly can't graph the result. In the API, you get a JSON values array instead of the usual single value.)

sysadmin/PrometheusSubqueriesMathOverTime written at 00:14:41; Add Comment


Page tools: See As Normal.
Search:
Login: Password:
Atom Syndication: Recent Pages, Recent Comments.

This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.