watchcat timing fix openwrt
i recently got a small change merged in the openwrt package tree.
at first i thought it was a very small thing. basically a timer detail in
watchcat.
but while working on it, and especially during review, i realised the behaviour was more interesting than i first thought.
the pr is here:
https://github.com/openwrt/packages/pull/29326
the basic idea
watchcat is an openwrt package that can keep checking if something is
reachable by pinging one or more hosts.
if the pings keep failing for long enough, it can take some recovery action.
depending on the config, it can:
- restart an interface
- restart networking
- run a script
- reboot the device
the part i was looking at was mostly around these two modes:
restart_ifacerun_script
in my head, watchdog behaviour was always simple:
thing goes down
watchdog notices it
wait for failure period
run recovery action
keep checking
simple enough.
but then i started thinking about one detail:
what happens to the failure timer after the recovery action runs?
the timing problem
say the config is something like this:
failure_period = 60 seconds
recovery action takes 15 seconds
then the timeline looks like this:
t=0 outage starts
t=60 recovery action starts
t=75 recovery action finishes
now the question is, what is the next baseline?
should the next recovery action happen around t=120?
t=0 outage starts
t=60 restart #1 starts
t=75 restart #1 finishes
t=120 restart #2
or should it happen around t=135, because the first restart finished at
t=75, and only after that we start a fresh 60 second failure window?
t=0 outage starts
t=60 restart #1 starts
t=75 restart #1 finishes
t=135 restart #2 would be the earliest next retry
my first thought was the second one.
after the recovery action finishes, reset the timer and give the interface or script some room. otherwise it can keep doing recovery actions too close to each other.
for a normal interface, that feels noisy. it may not help anything, and it can make the logs harder to understand.
the part i almost missed
the review made this more interesting.
daniel f. dickinson, who maintains watchcat now, pointed out that the old behaviour is useful for some setups and should not just be changed.
that was the important bit.
imagine a wireguard or openvpn interface.
the upstream internet may come back, but the tunnel itself may still be stuck
until it is kicked again. if watchcat is pinging through that tunnel, it does
not really know that the underlying internet recovered.
it only knows one thing:
the monitored path is still failing
so in that case, continuing to retry during the outage can be useful. maybe even necessary.
so this was not really:
old behavior bad
new behavior good
it was more like:
there are two valid timing models
and that changed the patch.
what changed
instead of changing the default behaviour, the merged pr added an opt-in option:
option reset_failure_timer '1'
when this option is enabled, watchcat starts a fresh failure window after the
recovery action finishes.
so the default behaviour stays the same. that keeps the old retry model useful for vpn/tunnel style setups.
and the new behaviour is there for setups where repeated recovery actions too close to each other are not useful.
the merged commit was:
945029322 watchcat: add optional failure timer reset
the pr also cleaned up some log wording, added timing notes in TIMINGS.md,
bumped PKG_RELEASE, and added a small CI version test override.
what this looks like
default behaviour:
t=0 outage starts
t=60 restart #1 starts
t=75 restart #1 finishes
t=120 restart #2
with reset_failure_timer=1:
t=0 outage starts
t=60 restart #1 starts
t=75 restart #1 finishes
t=135 restart #2 would be the earliest next retry
it decides whether the time spent inside the recovery action counts toward the next failure window or not.