Wandering Thoughts

2024-03-18

Sorting out PIDs, Tgids, and tasks on Linux

In the beginning, Unix only had processes and processes had process IDs (PIDs), and life was simple. Then people added (kernel-supported) threads, so processes could be multi-threaded. When you add threads, you need to give them some user-visible identifier. There are many options for what this identifier is and how it works (and how threads themselves work inside the kernel). The choice Linux made was that threads were just processes (that shared more than usual with other processes), and so their identifier was a process ID, allocated from the same global space of process IDs as regular independent processes. This has created some ambiguity in what programs and other tools mean by 'process ID' (including for me).

The true name for what used to be a 'process ID', which is to say the PID of the overall entity that is 'a process with all its threads', is a TGID (Thread or Task Group ID). The TGID of a process is the PID of the main thread; a single-threaded program will have a TGID that is the same as its PID. You can see this in the 'Tgid:' and 'Pid:' fields of /proc/<PID>/status. Although some places will talk about 'pids' as separate from 'tids' (eg some parts of proc(5)), the two types are both allocated from the same range of numbers because they're both 'PIDs'. If I just give you a 'PID' with no further detail, there's no way to know if it's a process's PID or a task's PID.

In every /proc/<PID> directory, there is a 'tasks' subdirectory; this contains the PIDs of all tasks (threads) that are part of the thread group (ie, have the same TGID). All PIDs have a /proc/<PID> directory, but for convenience things like 'ls /proc' only lists the PIDs of processes (which you can think of as TGIDs). The /proc/<PID> directories for other tasks aren't returned by the kernel when you ask for the directory contents of /proc, although you can use them if you access them directly (and you can also access or discover them through /proc/<PID>/tasks). I'm not sure what information in the /proc/<PID> directories for tasks are specific to the task itself or are in total across all tasks in the TGID. The proc(5) manual page sometimes talks about processes and sometimes about tasks, but I'm not sure that's comprehensive.

(Much of the time when you're looking at what is actually a TGID, you want the total information across all threads in the TGID. If /proc/<PID> always gave you only task information even for the 'process' PID/TGID, multi-threaded programs could report confusingly low numbers for things like CPU usage unless you went out of your way to sum /proc/<PID>/tasks/* information yourself.)

Various tools will normally return the PID (TGID) of the overall process, not the PID of a random task in a multi-threaded process. For example 'pidof <thing>' behaves this way. Depending on how the specific process works, this may or may not be the 'main thread' of the program (some multi-threaded programs more or less park their initial thread and do their main work on another one created later), and the program may not even have such a thing (I believe Go programs mostly don't, as they multiplex goroutines on to actual threads as needed).

If a tool or system offers you the choice to work on or with a 'PID' or a 'TGID', you are being given the choice to work with a single thread (task) or the overall process. Which one you want depends on what you're doing, but if you're doing things like asking for task delay information, using the TGID may better correspond to what you expect (since it will be the overall information for the entire process, not information for a specific thread). If a program only talks about PIDs, it's probably going to operate on or give you information about the entire process by default, although if you give it the PID of a task within the process (instead of the PID that is the TGID), you may get things specific to that task.

In a kernel context such as eBPF programs, I think you'll almost always want to track things by PID, not TGID. It is PIDs that do things like experience run queue scheduling latency, make system calls, and incur block IO delays, not TGIDs. However, if you're selecting what to report on, monitor, and so on, you'll most likely want to match on the TGID, not the PID, so that you report on all of the tasks in a multi-threaded program, not just one of them (unless you're specifically looking at tasks/threads, not 'a process').

(I'm writing this down partly to get it clear in my head, since I had some confusion recently when working with eBPF programs.)

linux/PidsTgidsAndTasks written at 21:59:58; Add Comment

2024-03-17

Disk write buffering and its interactions with write flushes

Pretty much every modern system defaults to having data you write to filesystems be buffered by the operating system and only written out asynchronously or when you specially request for it to be flushed to disk, which gives you general questions about how much write buffering you want. Now suppose, not hypothetically, that you're doing write IO that is pretty much always going to be specifically flushed to disk (with fsync() or the equivalent) before the programs doing it consider this write IO 'done'. You might get this situation where you're writing and rewriting mail folders, or where the dominant write source is updating a write ahead log.

In this situation where the data being written is almost always going to be flushed to disk, I believe the tradeoffs are a bit different than in the general write case. Broadly, you can never actually write at a rate faster than the write rate of the underlying storage, since in the end you have to wait for your write data to actually get to disk before you can proceed. I think this means that you want the OS to start writing out data to disk almost immediately as your process writes data; delaying the write out will only take more time in the long run, unless for some reason the OS can write data faster when you ask for the flush than before then. In theory and in isolation, you may want these writes to be asynchronous (up until the process asks for the disk flush, where you have to synchronously wait for them), because the process may be able to generate data faster if it's not stalling waiting for individual writes to make it to disk.

(In OS tuning jargon, we'd say that you want writeback to start almost immediately.)

However, journaling filesystems and concurrency add some extra complications. Many journaling filesystems have the journal as a central synchronization point, where only one disk flush can be in progress at once and if several processes ask for disk flushes at more or less the same time they can't proceed independently. If you have multiple processes all doing write IO that they will eventually flush and you want to minimize the latency that processes experience, you have a potential problem if different processes write different amounts of IO. A process that asynchronously writes a lot of IO and then flushes it to disk will obviously have a potentially long flush, and this flush will delay the flushes done by other processes writing less data, because everything is running through the chokepoint that is the filesystem's journal.

In this situation I think you want the process that's writing a lot of data to be forced to delay, to turn its potentially asynchronous writes into more synchronous ones that are restricted to the true disk write data rate. This avoids having a large overhang of pending writes when it finally flushes, which hopefully avoids other processes getting stuck with a big delay as they try to flush. Although it might be ideal if processes with less write volume could write asynchronously, I think it's probably okay if all of them are forced down to relatively synchronous writes with all processes getting an equal fair share of the disk write bandwidth. Even in this situation the processes with less data to write and flush will finish faster, lowering their latency.

To translate this to typical system settings, I believe that you want to aggressively trigger disk writeback and perhaps deliberately restrict the total amount of buffered writes that the system can have. Rather than allowing multiple gigabytes of outstanding buffered writes and deferring writeback until a gigabyte or more has accumulated, you'd set things to trigger writebacks almost immediately and then force processes doing write IO to wait for disk writes to complete once you have more than a relatively small volume of outstanding writes.

(This is in contrast to typical operating system settings, which will often allow you to use a relatively large amount of system RAM for asynchronous writes and not aggressively start writeback. This especially would make a difference on systems with a lot of RAM.)

tech/WriteBufferingAndSyncs written at 21:59:25; Add Comment

2024-03-16

Some more notes on Linux's ionice and kernel IO priorities

In the long ago past, Linux gained some support for block IO priorities, with some limitations that I noticed the first time I looked into this. These days the Linux kernel has support for more IO scheduling and limitations, for example in cgroups v2 and its IO controller. However ionice is still there and now I want to note some more things, since I just looked at ionice again (for reasons outside the scope of this entry).

First, ionice and the IO priorities it sets are specifically only for read IO and synchronous write IO, per ioprio_set(2) (this is the underlying system call that ionice uses to set priorities). This is reasonable, since IO priorities are attached to processes and asynchronous write IO is generally actually issued by completely different kernel tasks and in situations where the urgency of doing the write is unrelated to the IO priority of the process that originally did the write. This is a somewhat unfortunate limitation since often it's write IO that is the slowest thing and the source of the largest impacts on overall performance.

IO priorities are only effective with some Linux kernel IO schedulers, such as BFQ. For obvious reasons they aren't effective with the 'none' scheduler, which is also the default scheduler for NVMe drives. I'm (still) unable to tell if IO priorities work if you're using software RAID instead of sitting your (supported) filesystem directly on top of a SATA, SAS, or NVMe disk. I believe that IO priorities are unlikely to work with ZFS, partly because ZFS often issues read IOs through its own kernel threads instead of directly from your process and those kernel threads probably aren't trying to copy around IO priorities.

Even if they pass through software RAID, IO priorities apply at the level of disk devices (of course). This means that each side of a software RAID mirror will do IO priorities only 'locally', for IO issued to it, and I don't believe there will be any global priorities for read IO to the overall software RAID mirror. I don't know if this will matter in practice. Since IO priorities only apply to disks, they obviously don't apply (on the NFS client) to NFS read IO. Similarly, IO priorities don't apply to data read from the kernel's buffer/page caches, since this data is already in RAM and doesn't need to be read from disk. This can give you an ionice'd program that is still 'reading' lots of data (and that data will be less likely to be evicted from kernel caches).

Since we mostly use some combination of software RAID, ZFS, and NFS, I don't think ionice and IO priorities are likely to be of much use for us. If we want to limit the impact a program's IO has on the rest of the system, we need different measures.

linux/IoniceNotesII written at 23:03:23; Add Comment

2024-03-15

The problem of using basic Prometheus to monitor DNS query results

Suppose that you want to make sure that your DNS servers are working correctly, for both your own zones and for outside DNS names that are important to you. If you have your own zones you may also care that outside people can properly resolve them, perhaps both within the organization and genuine outsiders using public DNS servers. The traditional answer to this is the Blackbox exporter, which can send the DNS queries of your choice to the DNS servers of your choice and validate the result. Well, more or less.

What you specifically do with the Blackbox exporter is that you configure some modules and then you provide those modules targets to check (through your Prometheus configuration). When you're probing DNS, the module's configuration specifies all of the parameters of the DNS query and its validation. This means that if you are checking N different DNS names to see if they give you a SOA record (or an A record or a MX record), you need N different modules. Quite reasonably, the metrics Blackbox generates when you check a target don't (currently) include the actual DNS name or query type that you're making. Why this matters is that it makes it difficult to write a generic alert that will create a specific message that says 'asking for the X type of record for host Y failed'.

You can somewhat get around this by encoding this information into the names of your Blackbox modules and then doing various creative things in your Prometheus configuration. However, you still have to write all of the modules out, even though many of them may be basically cut and paste versions of each other with only the DNS names changed. This has a number of issues, including that it's a disincentive to doing relatively comprehensive cross checks. (I speak from experience with our Prometheus setup.)

There is a third party dns_exporter that can be set up in a more flexible way where all parts of the DNS check can be provided by Prometheus (although it exposes some metrics that risk label cardinality explosions). However this still leaves you to list in your Prometheus configuration a cross-matrix of every DNS name you want to query and every DNS server you want to query against. What you'll avoid is needing to configure a bunch of Blackbox modules (although what you lose is the ability to verify that the queries returned specific results).

To do better, I think we'd need to write a custom program (perhaps run through the script exporter) that contained at least some of this knowledge, such as what DNS servers to check. Then our Prometheus configuration could just say 'check this DNS name against the usual servers' and the script would know the rest. Unfortunately you probably can't reuse any of the current Blackbox code for this, even if you wrote the core of this script in Go.

(You could make such a program relatively generic by having it take the list of DNS servers to query from a configuration file. You might want to make it support multiple lists of DNS servers, each of them named, and perhaps set various flags on each server, and you can get quite elaborate here if you want to.)

(This elaborates on a Fediverse post of mine.)

sysadmin/PrometheusDNSMonitoringProblem written at 22:37:29; Add Comment

2024-03-14

You might want to think about if your system serial numbers are sensitive

Recently, a commentator on my entry about what's lost when running the Prometheus host agent as a non-root user on Linux pointed out that if you do this, one of the things omitted (that I hadn't noticed) is part of the system DMI information. Specifically, you lose various serial numbers and the 'product UUID', which is potentially another unique identifier for the system, because Linux makes the /sys/class/dmi/id files with these readable only by root (this appears to have been the case since support for these was added to /sys in 2007). This got me thinking about whether serial numbers are something we should consider sensitive in general.

My tentative conclusion is that for us, serial numbers probably aren't sensitive enough to do anything special about. I don't think any of our system or component serial numbers can be used to issue one time license keys or the like, and while people could probably do some mischief with some of them, this is likely a low risk thing in our academic environment.

(Broadly we don't consider any metrics to be deeply sensitive, or to put it another way we wouldn't want to collect any metrics that are because in our environment it would take a lot of work to protect them. And we do collect DMI information and put it into our metrics system.)

This doesn't mean that serial numbers have no sensitivity even for us; I definitely do consider them something that I generally wouldn't (and don't) put in entries here, for example. Depending on the vendor, revealing serial numbers to the public may let the public do things like see your exact system configuration, when it was delivered, and other potentially somewhat sensitive information. There's also more of a risk that bored Internet people will engage in even minor mischief.

However, your situation is not necessarily like ours. There are probably plenty of environments where serial numbers are potentially more sensitive or more dangerous if exposed (especially if exposed widely). And in some environments, people run semi-hostile software that would love to get its hands on a permanent and unique identifier for the machine. Before you gather or expose serial number information (for systems or for things like disks), you might want to think about this.

At the same time, having relatively detailed hardware configuration information can be important, as in the war story that inspired me to start collecting this information in our metrics system. And serial numbers are a great way to disambiguate exactly which piece of hardware was being used for what, when. We deliberately collect disk drive serial number information from SMART, for example, and put it into our metrics system (sometimes with amusing results).

sysadmin/SerialNumbersMaybeSensitive written at 23:03:24; Add Comment

2024-03-13

Restarting systemd-networkd normally clears your 'ip rules' routing policies

Here's something that I learned recently: if systemd-networkd restarts, for example because of a package update for it that includes an automatic daemon restart, it will clear your 'ip rules' routing policies (and also I think your routing table, although you may not notice that much). If you've set up policy based routing of your own (or some program has done that as part of its operation), this may produce unpleasant surprises.

Systemd-networkd does this fundamentally because you can set ip routing policies in .network files. When networkd is restarted, one of the things it does is re-set-up whatever routing policies you specified; if you didn't specify any, it clears them. This is a reasonably sensible decision, both to deal with changes from previously specified routing policies and to also give people a way to clean out their experiments and reset to a known good base state. Similar logic applies to routes.

This can be controlled through networkd.conf and its drop-in files, by setting ManageForeignRoutingPolicyRules=no and perhaps ManageForeignRoutes=no. Without testing it through a networkd restart, I believe that the settings I want are:

[Network]
ManageForeignRoutingPolicyRules=no
ManageForeignRoutes=no

The minor downside of this for me is that certain sorts of route updates will have to be done by hand, instead of by updating .network files and then restarting networkd.

While having an option to do this sort of clearing is sensible, I am dubious about the current default. In practice, coherently specifying routing policies through .network files is so much of a pain that I suspect that few people do it that way; instead I suspect that most people either script it to issue the 'ip rule' commands (as I do) or use software that does it for them (and I know that such software exists). It would be great if networkd could create and manage high level policies for you (such as isolated interfaces), but the current approach is both verbose and limited in what you can do with it.

(As far as I know, networkd can't express rules for networks that can be brought up and torn down, because it's not an event-based system where you can have it react to the appearance of an interface or a configured network. It's possible I'm wrong, but if so it doesn't feel well documented.)

All of this is especially unfortunate on Ubuntu servers, which normally configure their networking through netplan. Netplan will more or less silently use networkd as the backend to actually implement what you wrote in your Netplan configuration, leaving you exposed to this, and on top of that Netplan itself has limitations on what routing policies you can express (pushing you even more towards running 'ip rule' yourself).

linux/SystemdNetworkdResetsIpRules written at 22:18:11; Add Comment

2024-03-12

What do we count as 'manual' management of TLS certificates

Recently I casually wrote about how even big websites may still be manually managing TLS certificates. Given that we're talking about big websites, this raises a somewhat interesting question of what we mean by 'manual' and 'automatic' TLS certificate management.

A modern big website probably has a bunch of front end load balancers or web servers that terminate TLS, and regardless of what else is involved in their TLS certificate management it's very unlikely that system administrators are logging in to each one of them to roll over its TLS certificate to a new one (any more than they manually log in to those servers to deploy other changes). At the same time, if the only bit of automation involved in TLS certificate management is deploying a TLS certificate across the fleet (once you have it) I think most people would be comfortable still calling that (more or less) 'manual' TLS certificate management.

As a system administrator who used to deal with TLS certificates (back then I called them SSL certificates) the fully manual way, I see three broad parts to fully automated management of TLS certificates:

  • automated deployment, where once you have the new TLS certificate you don't have to copy files around on a particular server, restart the web server, and so on. Put the TLS certificate in the right place and maybe push a button and you're done.

  • automated issuance of TLS certificates, where you don't have to generate keys, prepare a CSR, go to a web site, perhaps put in your credit card information or some other 'cost you money' stuff, perhaps wait for some manual verification or challenge by email, and finally download your signed certificate. Instead you run a program and you have a new TLS certificate.

  • automated renewal of TLS certificates, where you don't have to remember to do anything by hand when your TLS certificates are getting close enough to their expiry time. (A lesser form of automated renewal is automated reminders that you need to manually renew.)

As a casual thing, if you don't have fully automated management of TLS certificates I would say you had 'manual management' of them, because a human had to do something to make the whole process go. If I was trying to be precise and you had automated deployment but not the other two, I might describe you as having 'mostly manual management' of your TLS certificates. If you had automated issuance (and deployment) but no automated renewals, I might say you had 'partially automated' or 'partially manual' TLS certificate management.

(You can have automated issuance but not automated deployment or automated renewal and at that point I'd probably still say you had 'manual' management, because people still have to be significantly involved even if you don't have to wrestle with a TLS Certificate Authority's website and processes.)

I believe that at least some TLS Certificate Authorities support automated issuance of year long certificates, but I'm not sure. Now that I've looked, I'm going to have to stop assuming that a website using a year-long TLS certificate is a reliable sign that they're not using automated issuance.

web/TLSCertsWhatIsManual written at 22:29:15; Add Comment

2024-03-11

Why we should care about usage data for our internal services

I recently wrote about some practical-focused thoughts on usage data for your services. But there's a broader issue about usage data for services and having or not having it. My sense is that for a lot of sysadmins, building things to collect usage data feels like accounting work and likely to lead to unpleasant and damaging things, like internal chargebacks (which have create various problems, and also). However, I think we should strongly consider routinely gathering this data anyway, for fundamentally the same reasons as you should collect information on what TLS protocols and ciphers are being used by your people and software.

We periodically face decisions both obvious and subtle about what to do about services and the things they run on. Do we spend the money to buy new hardware, do we spend the time to upgrade the operating system or the version of the third party software, do we need to closely monitor this system or service, does it need to be optimized or be given better hardware, and so on. Conversely, maybe this is now a little-used service that can be scaled down, dropped, or simplified. In general, the big question is do we need to care about this service, and if so how much. High level usage data is what gives you most of the real answers.

(In some environments one fate for narrowly used services is to be made the responsibility of the people or groups who are the service's big users, instead of something that is provided on a larger and higher level.)

Your system and application metrics can provide you some basic information, like whether your systems are using CPU and memory and disk space, and perhaps how that usage is changing over a relatively long time base (if you keep metrics data long enough). But they can't really tell you why that is happening or not happening, or who is using your services, and deriving usage information from things like CPU utilization requires either knowing things about how your systems perform or assuming them (eg, assuming you can estimate service usage from CPU usage because you're sure it uses a visible amount of CPU time). Deliberately collecting actual usage gives you direct answers.

Knowing who is using your services and who is not also gives you the opportunity to talk to both groups about what they like about your current services, what they'd like you to add, what pieces of your service they care about, what they need, and perhaps what's keeping them from using some of your services. If you don't have usage data and don't actually ask people, you're flying relatively blind on all of these questions.

Of course collecting usage data has its traps. One of them is that what usage data you collect is often driven by what sort of usage you think matters, and in turn this can be driven by how you expect people to use your services and what you think they care about. Or to put it another way, you're measuring what you assume matters and you're assuming what you don't measure doesn't matter. You may be wrong about that, which is one reason why talking to people periodically is useful.

PS: In theory, gathering usage data is separate from the question of whether you should pay attention to it, where the answer may well be that you should ignore that shiny new data. In practice, well, people are bad at staying away from shiny things. Perhaps it's not a bad thing to have your usage data require some effort to assemble.

(This is partly written to persuade myself of this, because maybe we want to routinely collect and track more usage data than we currently do.)

sysadmin/UsageDataWhyCare written at 22:47:02; Add Comment

2024-03-10

Scheduling latency, IO latency, and their role in Linux responsiveness

One of the things that I do on my desktops and our servers is collect metrics that I hope will let me assess how responsive our systems are when people are trying to do things on them. For a long time I've been collecting disk IO latency histograms, and recently I've been collecting runqueue latency histograms (using the eBPF exporter and a modified version of libbpf/tools/runqlat.bpf.c). This has caused me to think about the various sorts of latency that affects responsiveness and how I can measure it.

Run queue latency is the latency between when a task becomes able to run (or when it got preempted in the middle of running) and when it does run. This latency is effectively the minimum (lack of) response from the system and is primarily affected by CPU contention, since the major reason tasks have to wait to run is other tasks using the CPU. For obvious reasons, high(er) run queue latency is related to CPU pressure stalls, but a histogram can show you more information than an aggregate number. I expect run queue latency to be what matters most for a lot of programs that mostly talk to things over some network (including talking to other programs on the same machine), and perhaps some of their time burning CPU furiously. If your web browser can't get its rendering process running promptly after the HTML comes in, or if it gets preempted while running all of that Javascript, this will show up in run queue latency. The same is true for your window manager, which is probably not doing much IO.

Disk IO latency is the lowest level indicator of things having to wait on IO; it sets a lower bound on how little latency processes doing IO can have (assuming that they do actual disk IO). However, direct disk IO is only one level of the Linux IO system, and the Linux IO system sits underneath filesystems. What actually matters for responsiveness and latency is generally how long user-level filesystem operations take. In an environment with sophisticated, multi-level filesystems that have complex internal behavior (such as ZFS and its ZIL), the actual disk IO time may only be a small portion of the user-level timing, especially for things like fsync().

(Some user-level operations may also not do any disk IO at all before they return from the kernel (for example). A read() might be satisfied from the kernel's caches, and a write() might simply copy the data into the kernel and schedule disk IO later. This is where histograms and related measurements become much more useful than averages.)

Measuring user level filesystem latency can be done through eBPF, to at least some degree; libbpf-tools/vfsstat.bpf.c hooks a number of kernel vfs_* functions in order to just count them, and you could convert this into some sort of histogram. Doing this on a 'per filesystem mount' basis is probably going to be rather harder. On the positive side for us, hooking the vfs_* functions does cover the activity a NFS server does for NFS clients as well as truly local user level activity. Because there are a number of systems where we really do care about the latency that people experience and want to monitor it, I'll probably build some kind of vfs operation latency histogram eBPF exporter program, although most likely only for selected VFS operations (since there are a lot of them).

I think that the straightforward way of measuring user level IO latency (by tracking the time between entering and exiting a top level vfs_* function) will wind up including run queue latency as well. You will get, basically, the time it takes to prepare and submit the IO inside the kernel, the time spent waiting for it, and then after the IO completes the time the task spends waiting inside the kernel before it's able to run.

Because of how Linux defines iowait, the higher your iowait numbers are, the lower the run queue latency portion of the total time will be, because iowait only happens on idle CPUs and idle CPUs are immediately available to run tasks when their IO completes. You may want to look at io pressure stall information for a more accurate track of when things are blocked on IO.

A complication of measuring user level IO latency is that not all user visible IO happens through read() and write(). Some of it happens through accessing mmap()'d objects, and under memory pressure some of it will be in the kernel paging things back in from wherever they wound up. I don't know if there's any particularly easy way to hook into this activity.

linux/SystemResponseLatencyMetrics written at 23:31:46; Add Comment

2024-03-09

Some thoughts on usage data for your systems and services

Some day, you may be called on by decision makers (including yourself) to provide some sort of usage information for things you operate so that you can make decisions about them. I'm not talking about system metrics such as how much CPU is being used (although for some systems that may be part of higher level usage information, for example for our SLURM cluster); this is more on the level of how much things are being used, by who, and perhaps for what. In the very old days we might have called this 'accounting data' (and perhaps disdained collecting it unless we were forced to by things like chargeback policies).

In an ideal world, you will already be generating and retaining the sort of usage information that can be used to make decisions about services. But internal services aren't necessarily automatically instrumented the way revenue generating things are, so you may not have this sort of thing built in from the start. In this case, you'll generally wind up hunting around for creative ways to generate higher level usage information from low level metrics and logs that you do have. When you do this, my first suggestion is write down how you generated your usage information. This probably won't be the last time you need to generate usage information, and also if decision makers (including you in the future) have questions about exactly what your numbers mean, you can go back to look at exactly how you generated them to provide answers.

(Of course, your systems may have changed around by the next time you need to generate usage information, so your old ways don't work or aren't applicable. But at least you'll have something.)

My second suggestion is to look around today to see if there's data you can easily collect and retain now that will let you provide better usage information in the future. This is obviously related to keeping your logs longer, but it also includes making sure that things make it to your logs (or at least to your retained logs, which may mean setting things to send their log data to syslog instead of keeping their own log files). At this point I will sing the praises of things like 'end of session' summary log records that put all of the information about a session in a single place instead of forcing you to put the information together from multiple log lines.

(When you've just been through the exercise of generating usage data is an especially good time to do this, because you'll be familiar with all of the bits that were troublesome or where you could only provide limited data.)

Of course there are privacy implications of retaining lots of logs and usage data. This may be a good time to ask around to get advance agreement on what sort of usage information you want to be able to provide and what sort you definitely don't want to have available for people to ask for. This is also another use for arranging to log your own 'end of session' summary records, because if you're doing it yourself you can arrange to include only the usage information you've decided is okay.

sysadmin/UsageDataSomeBits written at 22:10:39; Add Comment

(Previous 10 or go back to March 2024 at 2024/03/08)

Page tools: See As Normal.
Search:
Login: Password:
Atom Syndication: Recent Pages, Recent Comments.

This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.