Wandering Thoughts archives

2016-09-30

Some git repository manipulations that I don't know how to do well

For a while now, I've been both tracking upstream git repositories and carrying and rebasing local changes on top of some of them in a straightforward way. The rebasing has been pretty easy and a clear win. But recently I've run into a number of more complicated cases where I'm not sure that I'm using git in the best way. So here are a collection of problems that I'm semi-solving.

I keep local tracking copies of a few upstream repos that are rebased periodically, such as this one. I have no local changes in such tracking repositories (ie my local master is always the same as origin/master). When the upstream rebases my plain git pull will fail, either telling me that it can't be fast-forwarded or demanding that I merge things. In the past I've just deleted the entire repo and re-cloned as the simple way out. Recently I've constructed the manual fix:

git pull --ff-only
[fails, but it's fetched stuff]
git checkout -f -B master origin/master

It would be nice to have the repo set up so that a plain 'git pull' would do this, ideally only if it's safe. I could script this but there are only some repos that I want to do this for; for others, either I have local changes or this should never happen.

(The git pull manpage has some interesting wording that makes it sound like asking for a rebase here will maybe do the right thing. But I just found that and contriving a test case is not trivial. Or maybe I want 'git pull -f'. And now that I've done some web searches, apparently I want 'git reset --hard origin/master'. Git is like Perl; sometimes it has so many different paths to the same result.)

Next, I'm tracking the repo for my favorite RAW development software and have some little local fixes added on top. Normally I build from the latest git tip, but that's sort of in flux right now so I want to switch to the release branch but still add my changes on top. I'm doing this like so:

git checkout darktable-2.0.x
git cherry-pick origin/master..master

I think that using git cherry-pick here instead of some form of git rebase is probably the correct approach and this is the one case where I'm doing things right.

I'm tracking the main repository for my shell and applying my changes on top of it. However, there is a repository of interesting changes that I want to try out; of course I still want my local changes on top of this. When I did this I think what I did was 'git pull --rebase /the/other/repo', but I believe that probably wasn't the right approach. I suspect that what I really wanted to do was add the second repo as an alternate upstream, switch to it, and either cherry-pick or rebase my own changes on top.

Except, well, looking back at my notes about working with Github PRs it strikes me that this is basically the same situation (except this repo isn't an explicit PR). I should probably follow the same process instead of hacking my way around the way I did this time.

Finally, we combine the two situations: I'm building on top of the repo of interesting changes and it rebases itself. Now I want to replace the old version with the new version but reapply my changes on top. I'm not going to try to write down the process I used this time, because I'm convinced it's not optimal; I basically combined the reset origin plus cherry-pick process, using explicit commit hashes for the cherry-picking and recording them beforehand. Based on Aristotle Pagaltzis's comment on my Github PR entry, I think I want something from this Stackoverflow Q&A but I need to read the answers carefully and try it out before I'm sure I understand it all. It does look relatively simple.

(This writeup of git rebase --onto is helpful too, but again I need to digest it. Probably I need to redo this whole procedure the right way the next time this comes up, starting from a properly constructed repo.)

Sidebar: How I test this stuff out

I always, always test anything like this on a copy of my repo, not on the main version. I make these copies with rsync, because making repo copies with 'git clone' changes how git sees things like upstreams (for obvious reasons). I suspect that this is the correct approach and there is no magic 'git clone' option that does what I want here.

programming/GitTreeUncertainShuffles written at 22:09:24; Add Comment

In search of modest scale structured syslog analysis

Every general issue should start from a motivating usage case, so here's ours: we want to be able to find users who haven't logged with SSH or used IMAP in the past N months (this perhaps should include Samba authentication as well). As a university department that deals with graduate students, postdocs, visiting researchers, and various other sorts of ongoing or sporadic collaborations, we have a user population that's essentially impossible to keep track of centrally (and sometimes at all). So we want to be able to tell people things like 'this account that you sponsor doesn't seem to have been used for a year'.

As far as I can tell from Internet searches and so on, there are an assorted bunch of log aggregation, analysis, and querying tools. Logstash is the big one that many people have heard of, but then there's Graylog and fluentd and no doubt others. In theory any of these ought to be the solution to our issue. In practice, there seem to be two main drawbacks:

  • They all seem to be designed for large to very large environments. We have what I tend to call a midsized environment; what's relevant here is that we only have on the order of 20 to 30 servers. Systems designed for large environments seem to be both complicated and heavyweight, requiring things like JVMs and multiple servers and so on.

  • None of them appear to come with or have a comprehensive set of parsers to turn syslog messages from various common programs into the sort of structured information that these systems seem designed to work with. You can write your own parsers (usually with regular expressions), but doing that well requires a relatively deep knowledge of just what messages the programs can produce.

(In general all of these systems feel as if they're primarily focused on application level logging of structured information, where you have your website or backend processing system or whatever emit structured messages into the logging backend. Or perhaps I don't understand how you're supposed to use these systems.)

We can undoubtedly make these systems solve our problem. We can set up the required collection of servers and services and get them talking to each other (and our central syslog server), and we can write a bunch of grok patterns to crack apart sshd and Dovecot and Samba messages. But all of this feels as if we are using the back of a large and very sharp axe to hammer in a small nail. It works, awkwardly, but it's probably not the right way.

It certainly feels as if structured capturing and analysis of syslog messages from common programs like sshd, Dovecot, and so on in a moderate sized environment ought to be a well solved problem. We can't be the first people to want to do this, so this particular wheel must have been reinvented repeatedly by now. But I can't find even a collection of syslog parsing patterns for common Unix daemons, much less a full system for this.

(If people know of systems or resources for doing this, we would of course be quite interested. There are some SaaS services that do log analysis for you, but as a university department we're not in a position to pay for this (staff time is free, as always).)

sysadmin/ModestScaleSyslogAnalysis written at 01:30:15; Add Comment


Page tools: See As Normal.
Search:
Login: Password:
Atom Syndication: Recent Pages, Recent Comments.

This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.