Wandering Thoughts archives

2019-11-17: It's good to make sure you have notifications of things
The operational differences between notifications and logs
2019-11-10: Putting a footer on automated email that says what generated it
The problems with piping curl to a shell are system management ones
2019-11-04: Many of our 'worklog' messages currently assume a lot of context
2019-10-11: A YAML syntax surprise and trick in Prometheus Alertmanager configuration
2019-10-08: How we implement reboot notifications when our machines reboot in Prometheus
2019-10-07: Why we generate alert notifications about our machines having rebooted
2019-10-06: Automating our 'bookable' compute servers with SLURM has created generic 'cattle' machines
2019-10-04: Vim, its defaults, and the problem this presents sysadmins
2019-10-02: It's useful to record changes that you tried and failed to do
2019-09-30: Using alerts as tests that guard against future errors
2019-09-27: A file permissions and general deployment annoyance with Certbot
2019-09-17: Finding metrics that are missing labels in Prometheus (for alert metrics)
2019-09-13: Bidirectional NAT and split horizon DNS in our networking setup
2019-09-04: Using Wireshark's Statistics menu to get per-host traffic volume
2019-09-02: Another way to do easy configuration for lots of Prometheus Blackbox checks
2019-08-26: A lesson of (alert) scale we learned from a power failure
2019-08-09: Turning off DNSSEC in my Unbound instances
2019-08-01: How not to set up your DNS (part 24)
2019-07-21: Why we're going to be using Certbot as our new Let's Encrypt client
2019-07-18: Switching Let's Encrypt clients is currently quite disruptive
2019-07-14: We're going to be separating our redundant resolving DNS servers
2019-07-13: Our switches can wind up in weird states after a power failure
2019-07-12: Reflections on almost entirely stopping using my (work) Yubikey
2019-07-05: My plan for two-stage usage of Certbot when installing web server hosts
2019-06-28: Using Prometheus's statsd exporter to let scripts make metrics updates
2019-06-21: One of the things a metrics system does is handle state for you
2019-06-19: A Let's Encrypt client feature I always want for easy standard deployment
2019-06-17: Sometimes, the problem is in a system's BIOS
2019-06-14: Intel's MDS issues have now made some old servers almost completely useless to us
2019-06-10: Keeping your past checklists doesn't help unless you can find them again
2019-06-08: Our current approach for updating things like build instructions
2019-06-02: Exploring the start time of Prometheus alerts via ALERTS_FOR_STATE
2019-05-24: The problem of paying too much attention to our dashboards
2019-05-20: Understanding how to pull in labels from other metrics in Prometheus
2019-05-17: My new favorite tool for looking at TLS things is certigo
2019-05-12: What we'll want in a new Let's Encrypt client
2019-05-03: Some implications of using offset instead of delta() in Prometheus
2019-04-28: A gotcha with stale metrics and *_over_time() in Prometheus
2019-04-26: Brief notes on making Prometheus instant queries with curl
2019-04-21: My view on upgrading Prometheus (and Grafana) on an ongoing basis
2019-04-18: A pattern for dealing with missing metrics in Prometheus in simple cases
2019-04-13: Remembering that Prometheus expressions act as filters
2019-04-07: Why selecting times is still useful even for dashboards that are about right now
2019-04-05: It's always DNS (a story of our circular dependency)
2019-03-31: Our likely ZFS fileserver upgrade plans (as of March 2019)
2019-03-29: Our current approach for significantly upgrading or modifying servers
2019-03-24: Prometheus's delta() function can be inferior to subtraction with offset
2019-03-22: Sometimes the simplest version of a graph is a text table
2019-03-18: Prometheus subqueries pick time points in a surprising way
2019-03-12: An easy optimization for restricted multi-metric queries in Prometheus
2019-03-11: Testing Prometheus alert conditions through subqueries
2019-03-10: What the default query step is for Prometheus subqueries
Turning something into a script encourages improving it
2019-03-06: Using Prometheus subqueries to look for spikes in rates
2019-03-01: What you get when you do a DNS A record lookup for a CNAME'd name
2019-02-27: Using Prometheus subqueries to do calculations over time ranges
2019-02-25: ntpdate has a surprising restriction on what it will sync to
2019-02-17: Some notes on heatmaps and histograms in Prometheus and Grafana
2019-02-08: Making more use of keyboard control over window position and size
2019-01-30: How having a metrics system centralized information and got me to check it
2019-01-23: A little surprise with Prometheus scrape intervals, timeouts, and alerts
2019-01-09: On right and wrong ways to harvest system-level performance stats
2019-01-04: Planning ahead in documentation worked out for us

Page tools: See As Normal.
Search:
Login: Password:
Atom Syndication: Recent Pages, Recent Comments.

This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.