2016-09-30
Some git repository manipulations that I don't know how to do well
For a while now, I've been both tracking upstream git repositories and carrying and rebasing local changes on top of some of them in a straightforward way. The rebasing has been pretty easy and a clear win. But recently I've run into a number of more complicated cases where I'm not sure that I'm using git in the best way. So here are a collection of problems that I'm semi-solving.
I keep local tracking copies of a few upstream repos that are rebased
periodically, such as this one.
I have no local changes in such tracking repositories (ie my local
master
is always the same as origin/master
). When the upstream
rebases my plain git pull
will fail, either telling me that it
can't be fast-forwarded or demanding that I merge things. In the
past I've just deleted the entire repo and re-cloned as the simple
way out. Recently I've constructed the manual fix:
git pull --ff-only [fails, but it's fetched stuff] git checkout -f -B master origin/master
It would be nice to have the repo set up so that a plain 'git
pull
' would do this, ideally only if it's safe. I could script
this but there are only some repos that I want to do this for; for
others, either I have local changes or this should never happen.
(The git pull
manpage has some interesting wording that makes it
sound like asking for a rebase here will maybe do the right thing.
But I just found that and contriving a test case is not trivial.
Or maybe I want 'git pull -f
'. And now that I've done some web
searches, apparently I want 'git reset --hard origin/master
'.
Git is like Perl; sometimes it has so many different paths to the
same result.)
Next, I'm tracking the repo for my favorite RAW development software and have some little local fixes added on top. Normally I build from the latest git tip, but that's sort of in flux right now so I want to switch to the release branch but still add my changes on top. I'm doing this like so:
git checkout darktable-2.0.x git cherry-pick origin/master..master
I think that using git cherry-pick
here instead of some form of
git rebase
is probably the correct approach and this is the one
case where I'm doing things right.
I'm tracking the main repository for my shell and applying my changes on top of
it. However, there is a repository of interesting changes that I want to try out; of course
I still want my local changes on top of this. When I did this I
think what I did was 'git pull --rebase /the/other/repo
', but I
believe that probably wasn't the right approach. I suspect that
what I really wanted to do was add the second repo as an alternate
upstream, switch to it, and either cherry-pick or rebase my own
changes on top.
Except, well, looking back at my notes about working with Github PRs it strikes me that this is basically the same situation (except this repo isn't an explicit PR). I should probably follow the same process instead of hacking my way around the way I did this time.
Finally, we combine the two situations: I'm building on top of the repo of interesting changes and it rebases itself. Now I want to replace the old version with the new version but reapply my changes on top. I'm not going to try to write down the process I used this time, because I'm convinced it's not optimal; I basically combined the reset origin plus cherry-pick process, using explicit commit hashes for the cherry-picking and recording them beforehand. Based on Aristotle Pagaltzis's comment on my Github PR entry, I think I want something from this Stackoverflow Q&A but I need to read the answers carefully and try it out before I'm sure I understand it all. It does look relatively simple.
(This writeup of git rebase --onto
is helpful too,
but again I need to digest it. Probably I need to redo this whole
procedure the right way the next time this comes up, starting from
a properly constructed repo.)
Sidebar: How I test this stuff out
I always, always test anything like this on a copy of my repo, not
on the main version. I make these copies with rsync
, because
making repo copies with 'git clone
' changes how git sees things
like upstreams (for obvious reasons). I suspect that this is the
correct approach and there is no magic 'git clone
' option that
does what I want here.
In search of modest scale structured syslog analysis
Every general issue should start from a motivating usage case, so here's ours: we want to be able to find users who haven't logged with SSH or used IMAP in the past N months (this perhaps should include Samba authentication as well). As a university department that deals with graduate students, postdocs, visiting researchers, and various other sorts of ongoing or sporadic collaborations, we have a user population that's essentially impossible to keep track of centrally (and sometimes at all). So we want to be able to tell people things like 'this account that you sponsor doesn't seem to have been used for a year'.
As far as I can tell from Internet searches and so on, there are an assorted bunch of log aggregation, analysis, and querying tools. Logstash is the big one that many people have heard of, but then there's Graylog and fluentd and no doubt others. In theory any of these ought to be the solution to our issue. In practice, there seem to be two main drawbacks:
- They all seem to be designed for large to very large environments.
We have what I tend to call a midsized environment;
what's relevant here is that we only have on the order of 20 to
30 servers. Systems designed for large environments seem to be
both complicated and heavyweight, requiring things like JVMs and
multiple servers and so on.
- None of them appear to come with or have a comprehensive set of parsers to turn syslog messages from various common programs into the sort of structured information that these systems seem designed to work with. You can write your own parsers (usually with regular expressions), but doing that well requires a relatively deep knowledge of just what messages the programs can produce.
(In general all of these systems feel as if they're primarily focused on application level logging of structured information, where you have your website or backend processing system or whatever emit structured messages into the logging backend. Or perhaps I don't understand how you're supposed to use these systems.)
We can undoubtedly make these systems solve our problem. We can set
up the required collection of servers and services and get them
talking to each other (and our central syslog server),
and we can write a bunch of grok patterns
to crack apart sshd
and Dovecot and Samba messages. But all of
this feels as if we are using the back of a large and very sharp
axe to hammer in a small nail. It works, awkwardly, but it's probably
not the right way.
It certainly feels as if structured capturing and analysis of syslog
messages from common programs like sshd
, Dovecot, and so on in a
moderate sized environment ought to be a well solved problem. We
can't be the first people to want to do this, so this particular
wheel must have been reinvented repeatedly by now. But I can't find
even a collection of syslog parsing patterns for common Unix daemons,
much less a full system for this.
(If people know of systems or resources for doing this, we would of course be quite interested. There are some SaaS services that do log analysis for you, but as a university department we're not in a position to pay for this (staff time is free, as always).)