Wandering Thoughts archives

2016-01-26

Low level issues can have quite odd high level symptoms (again)

Let's start with my tweet from yesterday:

So the recently released Fedora 22 libreswan update appears to have broken IPSec tunnels for me. Good going. Debugging this will be hell.

This was quite consistent: if I installed the latest Fedora 22 update to libreswan, my IPSec based point to point tunnel stopped working. More specifically, my home end (running Fedora 22) could not do an IKE negotiation with my office machine. If I reverted back to the older libreswan version, everything worked. This is exactly the sort of thing that is hell to debug and hell to get as a bug report.

(Fedora's libreswan update jumped several point versions, from 3.13 to 3.16. There could be a lot of changes in there.)

Today I put some time into trying to narrow down the symptoms and what the new libreswan was doing. It was an odd problem, because tcpdump was claiming that the initial ISAKMP packets were going out from my home machine, but I didn't see them on my office machine or even on our exterior firewall. Given prior experiences I suspected that the new version of libreswan was setting up IPSec security associations that were blocking traffic and making tcpdump mislead me about whether packets were really getting out. But I couldn't see any sign of errant SPD entries and using tcpdump at the PPPoE level suggested very strongly that my ISAKMP packets really were being transmitted. But at the same time I could flip back and forth between libreswan versions, with one working and the other not. So in the end I did the obvious thing: I grabbed tcpdump output from a working session and a non-working session and started staring at them to see if anything looked different.

Reading the packet dumps, my eyes settled on this (non-working first, then working):

PPPoE [ses 0xdf7] IP (tos 0x0, ttl 64, id 10253, offset 0, flags [DF], proto UDP (17), length 1464)
   X.X.X.X.isakmp > Y.Y.Y.Y.isakmp: isakmp 2.0 msgid 00000000: parent_sa ikev2_init[I]:
   [...]

PPPoE [ses 0xdf7] IP (tos 0x0, ttl 64, id 32119, offset 0, flags [DF], proto UDP (17), length 1168)
   X.X.X.X.isakmp > Y.Y.Y.Y.isakmp: isakmp 2.0 msgid 00000000: parent_sa ikev2_init[I]:
   [...]

I noticed that the packet length was different. The working packet was significantly shorter and the non-working one was not too far from the 1492 byte MTU of the PPP link itself. A little light turned on in my head, and some quick tests with ping later I had my answer: my PPPoE PPP MTU was too high, and as a result something in the path between me and the outside world was dropping any too-big packets that my machine generated.

(It's probably the DSL modem and DSL hop, based on some tests with traceroute.)

The reason things broke with the newer libreswan was that the newer version added several more cipher choices, which pushed the size of the initial ISAKMP packet over the actual working MTU. With the DF bit set in the UDP packet, there was basically no chance of the packet getting fragmented when it hit wherever the block was; instead it was just summarily dropped.

(I think I never saw issues with TCP connections because I'd long ago set a PPPoE option to clamp the MSS to 1412 bytes. So only UDP traffic would be affected, and of course I don't really do anything that generates large UDP packets. On the other hand, maybe this was a factor in an earlier mysterious network problem, which I eventually made go away by disabling SPDY in Firefox.)

What this illustrates for me, once again, is that I simply can't predict what the high level symptoms are going to be for a low level network problem. Or, more usefully, given a high level problem I can't even be sure if it's actually due to some low level network issue or if it has a high level cause of its own (like 'code changes between 3.13 and 3.16').

Sidebar: what happens with my office machine's ISAKMP packets

My office machine is running libreswan 3.16 too, so I immediately wondered if its initial ISAKMP packets were also getting dropped because of this (which would mean that my IPSec tunnel would only come up when my home machine initiated it). Looking into this revealed something weird: while my office machine is sending out large UDP ISAKMP packets with the DF bit set, something is stripping DF off and then fragmenting those UDP packets before they get to my home machine. Based on some experimentation, the largest inbound UDP packet I can receive un-fragmented is 1436 bytes. The DF bit gets stripped regardless of the packet size.

(I suspect that my ISP's DSL PPP touchdown points are doing this. It's an obvious thing to do, really. Interesting, the 1436 byte size restriction is smaller than the outbound MTU I can use.)

IKEAndMTUIssue written at 01:16:22; Add Comment

2016-01-18

Today I learned that a syslog server can be very silent on the network

Let's start with my tweet:

TIL that a UDP based syslog server can be so quiet that it falls out of switch MAC tables, causing syslog msgs to it to flood everywhere.

We have a central syslog server that people running 'sandbox' machines point their servers at to aggregate all of their syslog data at a single point (and off individual servers). Mostly because we're old fashioned, it uses the plain old UDP based method of syslog forwarding where client machines simply fire UDP packets in its direction.

Today, due to various recent events and questions I was running tcpdump on one of the machines here that's on the same network as the syslog server, partly to see what kind of crud I would discover swirling around on the network (there is always some). To my surprise I saw a whole burst of syslog traffic that was going to the machine (and it wasn't broadcast, either); the traffic was coming from a bunch of machines behind one firewall. I scratched my head for a bit until the penny dropped that the machine had fallen out of switch MAC to port tables.

The direct reason this could happen is that a UDP based syslog server doesn't naturally send out any packets. Unlike TCP streams, where it would at least be sending out ACKs and so refreshing the switch MAC tables, receiving UDP streams is entirely passive. The machine does do some things that generate packets periodically (such as NTP), but apparently it was doing them so infrequently that at least some switches timed its MAC entry out and started flooding traffic far enough to get to my observer machine. At the same time, the gateway for the sandbox network that was sending the syslog traffic didn't time out its ARP entry, so it never re-ARPed and thus provoked the syslog server into generating some packets to re-prime the switches.

(Or perhaps the outgoing packets it did generate didn't flow over enough of the switches involved.)

There's two lessons I draw from this. First, MAC table timeouts may vary significantly across different machines. They can vary not only in how long they are but also in what keeps an MAC table entry active. Either the switches timed out their MAC entries much faster than the gateway or the gateway's entries stayed alive when they were used for outgoing traffic while the switches didn't do that.

(I can come up with at least a justification for why a switch should be fairly aggressive about aging out MACs that it hasn't seen traffic from. Incorrect switch MAC tables can do significant damage, so better safe than sorry if something is silent.)

The second is that it may take only a single switch losing a MAC entry to cause significant flooding. If a machine doesn't generate broadcast traffic, many switches may not have MAC entries for it in the first place (if traffic to or from it never transits them). If a top level switch loses the MAC entry, it will flood the traffic to all of its ports and thus to many of those switches, who then flood it down to all of their ports and so on. The narrower the normal traffic flow is (for example, if it's mostly between one gateway and the machine), the fewer switches there are that have the MAC association in the first place and thus are in a position to stop such a flood.

(There are probably all sorts of interesting dynamics in this situation in terms of where outgoing traffic from the 'mostly silent' machine goes, what switches it passes through, and thus whether or not it will cause all of the relevant switches to pick up MAC entries again. The moral here is that nothing beats forcing the machine to generate some broadcast traffic in some way. There's direct traffic generation with eg arping, or there's just pinging a nonexistent IP every few minutes to force an ARP broadcast.)

SyslogAndSilence written at 22:43:38; Add Comment

A limitation of tcpdump is that you can't tell in from out

Suppose that you are running tcpdump on a network that's experiencing problems, on a machine which you know is supposed to be sending out broadcast ARP requests. When you do something that provokes an ARP request, you see two or three ARP broadcast packets from the machine in close succession (but not back to back; timestamps say there's a little bit of time between each). That sounds okay, doesn't it? Or at least it's not too crazy. There's a plausible case for rapid repeated ARPs in cases where the first request didn't get an immediate reply, and probably Unixes behave differently here so experience on say Linux doesn't necessarily tell you what to normally expect on OpenBSD or Illumos.

Except that there's a problem here. As far as I know, there is no way to tell from tcpdump output whether you're seeing these packets because the system is transmitting them or because it's receiving them. Of course normally you shouldn't (re-)receive packets that your system initially transmitted, but, well, network loops can happen.

Some versions of tcpdump have a switch to control whether it listens to input packets, output packets, or both. On OpenBSD this is -D, on different versions of Linux it is either -P (Ubuntu, CentOS) or -Q (Fedora); FreeBSD doesn't seem to have an option for this. Of course to use this option (if it's available) you have to remember that some sort of echo-back situation might be happening, but at least you can check for it.

This is definitely something that I'm going to have to try to remember for future network troubleshooting. Sadly it's not as simple as always using 'in' initially, because often you want to see both what the machine is sending and what it's getting back; you just would like to be able to tell them apart immediately.

(I believe that this is a limitation of the underlying kernel interfaces that tcpdump uses, in that most or all implementations simply don't tag packets with whether they're 'out' or 'in' packets.)

(This is just one of a number of ways that I've found to be misled by not looking sufficiently closely at what tcpdump seems to be telling me. Eg tcpdump -p versus not, various firewall and IPSec settings causing received packets to be dropped or even IPSec re-materializing packets, and not looking at MAC addresses (also).)

TcpdumpInOutLimitation written at 00:46:47; Add Comment

2016-01-17

My theory on how network loop caused the problem we observed

Yesterday I described how a network loop on an wiring closet leaf switch in our port isolated network caused replies to the gateway's ARP requests to usually disappear (although not always). This is a fairly weird and mysterious symptom, but as it happens I have a wild theory about why (or how) it did.

The big characteristic of a port isolated network is that in fact most unicast traffic is supposed to disappear if you try to send it out. Hosts inside the port isolated network are only allowed to talk to hosts outside, while traffic to other hosts inside is dropped. Mechanically this is implemented inside the switches by rules on what ports are allowed to talk to each other with unicast traffic; non-uplink ports can only talk to the uplink port, while the uplink port can talk to anything.

(Broadcast traffic is flooded through the entire network, as is traffic for unknown MACs.)

This creates a very simple way to cause unicast traffic to be dropped on a port isolated network: do something to cause the network to believe that the destination MAC is an 'inside' host. So how do you do that? Well, switches learn MAC associations based on what port they see inbound traffic from a MAC on. So suppose you have a network loop at the bottom of your network hierarchy, and an 'outside' port sends out a broadcast packet. The packet will cascade down your tree, with each switch learning that the MAC is found on the uplink port, but then it bottoms out at the loop and gets re-injected into a leaf switch. As we've seen, this causes the leaf switch to change the MAC to port association; it then passes the broadcast back out its uplink port, and the packet re-floods through the entire network again with all switches flipping their MAC association to 'oh, it's coming from an internal port'. If an internal host sends out a unicast packet to that MAC shortly afterwards (say as an almost-instant reply to an ARP request), the switches will see this as an 'inside to inside' packet and drop it due to port isolation. The next packet from the outside host will start resetting the MAC port associations in switches back to recognizing it as an outside host, although it probably won't reach all of them.

This is a nice theory, but what I don't have an explanation for is why the network didn't blow up with endlessly repeated broadcast packets (such as broadcast ARP requests or DHCP queries). Looking back we saw some things that might have been signs of repeated packets, but certainly there was no flood of traffic; that would have been a glaringly damning sign right off the bat.

(If this explanation is correct, it also suggests a monitoring measure. If you can monitor MAC to port associations on your top level port isolated switch, just alarm on the MAC of any 'outside' machine switching to an 'inside' port, or more generally any MAC association flipping back and forth between the uplink port and a non-uplink port. Sadly I suspect that we can't do this on our current switches.)

NetworkLoopWhyVanishingARP written at 02:02:07; Add Comment

2016-01-15

Network loops can have weird effects (at least sometimes)

Today we had a weird network problem on one of our most crucial networks, our port-isolated user machine network; this is the wired network used to connect laptops, most machines in people's offices, and so on. The only failure we could really see was that when the gateway firewall sent out a (broadcast) ARP request for a given IP, it would not see the (unicast) ARP reply from your machine. If your machine did something that caused the gateway to pick up its MAC, everything worked. Manually delete the ARP entry on the gateway, and the problem would be back. And rarely (often taking many minutes) an ARP reply would make it to the gateway and poof, everything would work again for your machine for a while until your ARP entry fell out of the gateway's ARP table.

There were several oddities about this. The biggest is that only ARP replies were affected; you could, for example, ping back and forth between your machine and elsewhere as long as the gateway had you in its ARP table. Nor did we see any unusual network traffic during this. We've seen our networks melt down on occasion (including this one), with things like traffic floods, rogue DHCP servers, and packet echoes, but nothing odd showed up in tcpdump from multiple vantage points. If anything maybe there was less extraneous broadcast babbling than usual.

Given 'some packets are vanishing', we initially suspected malfunctioning switches; we've seen various downright weird things when this happens. So we swapped in spares for core top level switches (they) were basically the only common point in the switch fabric between all of the machines that were seeing problems) and of course nothing happened. It wasn't the gateway, because we could reproduce the problem with a number of other machines in the same top level network position (such as the DHCP server). We scratched our heads a lot, or at least I did, and eventually resorted to brute force instead of trying to come up with theories about what had broken how: as I mentioned on Twitter, we started systematically disconnecting bits of the network from the top down to see what had to be connected to make things go wrong.

As you already know from the title of this entry, the problem turned out to be a network loop. At the very periphery of the network (in one of the department's office areas), someone had plugged a little 5-port switch into two network drops at once, thereby creating a loop between two ports on one of our wiring closet leaf switches. This simple single-switch cross-connect was the root cause of all of our network problems.

Looking back at it after the fact, I can construct a theory about how this cross-connect caused the observed problems (although I have no idea if it's correct). But at the time I wouldn't have at all expected to see these symptoms from a network loop. So my moral for today is that the symptoms of network loops can be quite weird and not what I expect at all.

(For reasons beyond the scope of this entry, we do not have STP enabled on our switches. Under normal circumstances it's unnecessary, as all of our networks are strict (acyclic) trees.)

NetworkLoopsAreWeird written at 23:36:18; Add Comment

2016-01-13

What I want out of backups for my home machine (in the abstract)

Over time I've come to realize that part of my home backup issues is that I have somewhat contradictory desires for my backups at home. Today I feel like writing them down, partly to get everything straight in my own head.

What I want is all of:

  • multiple backups going back in time, not just one backup that is 'the current state'. Being able to go back in time is often handy (and it's reassuring).

    (This can be done just by having several independent disks, each with a separate backup, and in some ways that's the best option. But I don't want to buy that many disks and disk enclosures for my home backups.)

  • that these multiple backups be independent, so that damage to a single backup still leaves me able to recover all the files from another one (assuming the files were present in both).

    (This implies that I don't want deduplication or a full plus incrementals approach, because either mean that a single point of damage affects a whole series of backups.)

  • these backups should be stored in some reasonably compressed form so they use up relatively little space. This is important in order to get multiple independent copies (which imply that my (raw) backups will take much more space than the live data).

  • the backups should normally be offline. Online backups are only one accident or slip-up away from disappearing.

  • making backups should not kill the performance of my machine, because otherwise that's a disincentive to actually make them.

  • the process of creating the current backup should be something that can be done incrementally, so that I can start it, decide that it's having too much of an impact or is taking too long, and stop it again without throwing away the progress to date.

  • backups should be done in a way that captures even awkward, hard to capture things like holes in files, special ACLs, and so on. I consider traditional dump to be ideal for this, although dump is less than ideal if I'm backing up a live filesystem.

If I ignore the last issue and 'backups should be aggressively compressed', it sounds like what I want is rsync to separate directory trees for each backup run. Rsync is about the best technology for being able to interrupt and resume a backup that I can think of, although perhaps modern Linux backup tools can do it too (I haven't looked at them). I can get some compression by choosing eg ZFS as the filesystem on the backup target (that would also get me integrity checks, which I'd like).

If I ignore being able to interrupt and resume backups, dump doing level 0 full backups to files that are compressed with the standalone compressor of my choice is not a bad choice (and it's my default one today). I think it has somewhat more load impact than other options, though.

The actual performance impact of making backups depends partly on the backup method and partly on how I connect the backup target to my machine (there are (bad) options that have outsized impacts). And in general Linux just doesn't seem to do very well here for me, although perhaps rsync could be made to deliberately run slowly to lower its impact.

(For my home machine, backing up to external disk is probably the only solution that I'm happy with. Over the Internet backups have various issues, including my upstream bandwidth.)

HomeBackupWants written at 23:23:52; Add Comment

Your system's performance is generally built up in layers

There are many facets and many approaches to troubleshooting performance issues, but there are also some basic principles that can really help to guide your efforts. One of them, one so fundamental that it often doesn't get mentioned, is that your system and its performance is built up of layers and thus to troubleshoot system performance you want to test and measure each layer, working upwards from the base layers (whatever they are).

(A similar 'working upwards' process can be used to estimate the best performance possible in any particular environment. This too can be useful, for example to assess how close to it you are or if the best possible performance can possibly meet your needs.)

To make this more concrete, suppose that you have an iSCSI based fileserver environment and the filesystems on your fileservers are performing badly. There are a lot of moving parts here; you have the physical disks on the iSCSI targets, the network link(s) between the fileservers and the iSCSI targets, the iSCSI software stack on both sides, and then the filesystem that's using the disks on the fileserver (and perhaps a RAID implementation on the iSCSI targets). Each of these layers in the stack is a chance for a performance problem to sneak in, so you want to test them systematically:

  • how fast is a single raw disk on the iSCSI targets, measured locally on a target?
  • how fast are several raw disks on the iSCSI targets when they're all operating at once?
  • if the iSCSI targets are doing their own RAID, how fast can that go compared to the raw disk performance?

  • how fast is the network between the fileserver and the iSCSI targets?

  • how fast is the iSCSI stack on the initiator and targets? Some iSCSI target software supports 'dummy' targets that don't do any actual IO, so you can test raw iSCSI speed. Otherwise, perhaps you can swap in a very fast SSD or the like for testing purposes.

  • how fast can the fileserver talk to a single raw disk over iSCSI? To several of them at once? To an iSCSI target's RAID array, if you're using that?

By working through the layers of the stack like this, you have a much better chance of identifying where your performance is leaking out. Not all performance problems are neatly isolated to a single layer of the stack (there can be all sorts of perverse interactions across multiple layers), but many are and it's definitely worth checking out first. If nothing else you'll rule out obvious and easily identified problems, like 'our network is only running at a third of the speed we really ought to be getting'.

Perhaps you think that this layering approach should be obvious, but let me assure you that I've seen people skip it. I've probably skipped it myself on occasion, when I felt I was in too much of a hurry to really analyze the problem systematically.

PS: when assessing each layer, you probably want to look at something like Brendan Gregg's USE Method in addition to measuring the performance you can get in test situations.

PerformanceInLayers written at 01:49:17; Add Comment

2016-01-04

How I do per-address blocklists with Exim

Yesterday I wrote about the power of per-address blocklists. This is all well and good, but part of the challenge of this is actually implementing them. We use Exim, so my implementation is for that (and it's not entirely original with me; I believe I copied much of this approach from an ex-co-worker).

Since we want to reject at SMTP time, we need to do this in one of Exim's SMTP ACLs. Since we want to reject on a per-address basis, this has to go in the RCPT TO ACL. Exim has good general support for blocklists based on IP origin, MAIL FROM, and so on, so the important trick is to figure out how to easily support per-address ones. We use the simplest brute force approach: a directory hierarchy.

# The top level directory for per-local-address blocks.
BLOCKSDIR       = CFGDIR/blocks
# The per-user directory. Using $local_part is safe
# because we restrict this to valid addresses.
UBLOCKDIR       = BLOCKSDIR/${lc:$local_part}

Then in the actual RCPT TO ACL, we need some rules to match against this. Assuming that you have already rejected any non-local, non-valid addresses:

# Actual per-address block files for hosts and senders are in UBLOCKDIR:
#   UBLOCKDIR/hosts
#   UBLOCKDIR/senders
deny
  hosts = ${if exists{UBLOCKDIR/hosts} {+ignore_unknown : +ignore_defer : UBLOCKDIR/hosts}}
  message = mail from host $sender_host_address not accepted by <$local_part@$domain>.
  log_message = blocked by personal hosts blacklist.

deny
  senders = ${if exists {UBLOCKDIR/senders} {UBLOCKDIR/senders}}
  message = mail from <$sender_address> not accepted by <$local_part@$domain>.
  log_message = blocked by personal senders blacklist.

As I discovered the hard way, the +ignore_unknown and +ignore_defer are very important for making the host-based blocklist work the way you expect it to.

You could use another naming scheme for how to find the per-address host and blocklist files, but this one gives you a simple two-level namespace; you wind up with, say, /var/local/mail/blocks/cks/hosts. In theory it is easy enough to make subdirectories for particular users and then give them to the users to maintain on their own. Even without that, it keeps things nicely organized and lets you see at a glance (or at a ls) which local addresses even theoretically have some filtering.

Now, I have a confession: I don't know if this is actually secure enough to allow arbitrary users to directly manipulate their hosts and senders files (and our current setup doesn't directly expose these directories to users). Exim host and address list file expansion allows people to insert anything that can normally occur in host and address lists, and this is quite powerful. Probably you don't want to allow users to directly edit these files but instead force them to go through some interface or preprocessing step that limits what they can do. At the same time, giving people as much power here as is safe is nice, because you can do a lot of handy things with wildcards and even regular expressions.

(Locally, we have one persistent spammer that hits one of our administrative addresses using changing domains with a pattern that we wound up writing a regular expression to match. I was very happy to discover that we could actually do this, even in a host list read from a file; it was handy to make the spammer go away.)

PS: If you put this in the RCPT TO ACL before you verify that the local part is a valid username, you'll want to pre-filter things so that you block local parts with 'dangerous in filename' characters like .. and /.

EximPerUserBlocklists written at 01:24:45; Add Comment

2015-12-31

How I've wound up taking my notes

If you're going to take and keep notes, you need some way to actually do this. I'm not going to claim that my system is at all universal; it's simply what has worked so far for me. I'm a brute force Unix kind of person, so my system is built on simple basic Unix things.

My basic form of note taking is plain ASCII in Unix files. I don't version control them (although maybe I should), but instead I treat them as logs where I only append new information to the end rather than rewriting existing sections (although sometimes I'll add an update note directly in place next to some information I later found out was incorrect). The honest reason why I take this append only approach is that it's easier to write, but I can justify with it as creating a useful record of my thinking at the time.

(The exception to this is when I'm writing and testing things like build instructions or migration checklists. If my output is going to be documentation for other people, it obviously has to get rewritten in place so the end result is coherent.)

Some but not all of the time I will date new additions to files (in simple '2015-12-30' form). This helps me keep track of when I did something and also how long it's been since I worked on something. Although I often used to just summarize the commands I was using and the output I was getting, I have tried lately to literally copy and paste both commands and output in. I've found that this is handy for being lazy when repeating things; I can just copy-and-paste commands from the file into a terminal window.

I use file names that make sense to me, although not necessarily to anyone else. Typical file names are things like bsdtcp-restores and cs8-oldmail-weirdness. Often I'll put a summary of the project or issue at the top of a file, so that if (or when) I look at it much later I can remember what it was about. For lab notebook stuff I tend to put the date of the initial incident in the filename, but I'm kind of inconsistent in this.

I've found it useful to segregate my notes files into more or less three directories. One is for (active) projects, one is for general notes on various things, and one is for lab notebook stuff done during (semi-)crisis situations. In all of those directories I have subdirectories for files that are complete, or over, or now obsolete for various reasons. All of these old files remain valuable so I keep them, but I try to keep the top level directories only having current things (especially for the projects directory). I sometimes rename files when I move them into subdirectories because I realize that my initial file name is not a good one for future reference (often it turns out to be too generic).

I don't currently keep any of these files under version control. Maybe I should, but at the moment it feels like overkill given that I never want to delete things and I don't really have a situation where I want 'revert to (or look at) previous version of a file'. Many things are implicitly versioned just by me having multiple files and starting new ones for new situations, even if I copy things from an older file.

(For example, the test plan for upgrading our mail server from Ubuntu 10.04 to 12.04 is in a different file than the test plan for upgrading it from 12.04 to 14.04, even though I created the latter from the former.)

As to where these files all live: their master location is in my home directory on our fileservers. On the rare occasion that I need to refer to or work on one of them when our fileservers or our Linux servers are going to be down, I rsync the relevant file to my office workstation and work on it there, then rsync it back afterwards. These are fortunately not the sort of notes that I'd be looking at if our entire infrastructure fell down. Putting them in my fileserver home directory means that they're automatically available on all of our Unix servers and they get backed up via our backup system and so on.

As for the editor I use, well, vi is my sysadmin editor. But the choice of editor doesn't really matter here (and sometimes I use others).

PS: I'm lucky enough that none of my notes files need to be kept so secret that they need to be encrypted. I don't know what I'd do if I needed that for some of my notes, and given that encryption is generally a pain I hope that I never have to find out.

HowITakeNotes written at 01:25:19; Add Comment

2015-12-29

Take notes when (and as) you do things and keep them

One of the lessons that I have been learning over and over again over time and in different contexts is that I should take notes about what I'm doing and then keep them. As you can tell I've written before about this in various specific contexts, but I keep not entirely learning and writing down the general lesson, which is that this is a good idea basically all the time. Really, there are very few situations where taking good notes and then keeping them is not a good idea.

So that is my big piece of advice:

Take notes as you do things and then keep them after you're done, even if you don't think you're going to want them later.

(Like all general pieces of advice there are all sorts of specific exceptions.)

Fortunately I mostly haven't learned this the hard way. I've tended to write things down as I was doing them just to keep track of where I was (as I get interrupted by moths) and I'm naturally a packrat with files, so I've wound up keeping all of these notes files. Where I have learned the hard way is in how much detail I've often (not) put in many of those notes. Unless I thought I was writing them for my future self (which I knew I was in a few cases), I tended to only put down what I needed at the time to jog my current memory. This is of course far less than what I wound up wanting much later when I was trying to remember what exactly I'd done and precisely what the results were.

(Having stubbed my toes on this, I now try to include the specific command lines, exact output, and so on instead of just writing general notes. My co-workers periodically asking me for specifics has also helped a bunch; there is nothing like other people for showing you your own blind spots.)

There are some things I do that are too small for notes, of course, but certainly anything that takes me more than an hour or two should have notes, regardless of what it is. Regardless of whether it was looking into some issue, working out where information was, testing something, making some change, or so on, sooner or later I'm probably going to want to do something like it again or at least look back at what I did and what I saw.

(And even if I don't think I'm ever going to need what I'm doing again, well, writing things down is relatively cheap and not writing them down can be very annoying, as I keep reminding myself when I write some entries here. It's better to err on the side of writing too much down and then having to search through it later.)

TakeAndKeepNotes written at 02:08:54; Add Comment


Page tools: See As Normal.
Search:
Login: Password:
Atom Syndication: Recent Pages, Recent Comments.

This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.