| |

watchcat timing fix openwrt

i recently got a small change merged in the openwrt package tree.

at first i thought it was a very small thing. basically a timer detail in watchcat.

but while working on it, and especially during review, i realised the behaviour was more interesting than i first thought.

the pr is here:

https://github.com/openwrt/packages/pull/29326

the basic idea

watchcat is an openwrt package that can keep checking if something is reachable by pinging one or more hosts.

if the pings keep failing for long enough, it can take some recovery action.

depending on the config, it can:

  • restart an interface
  • restart networking
  • run a script
  • reboot the device

the part i was looking at was mostly around these two modes:

  • restart_iface
  • run_script

in my head, watchdog behaviour was always simple:

thing goes down
watchdog notices it
wait for failure period
run recovery action
keep checking

simple enough.

but then i started thinking about one detail:

what happens to the failure timer after the recovery action runs?

the timing problem

say the config is something like this:

failure_period = 60 seconds
recovery action takes 15 seconds

then the timeline looks like this:

t=0    outage starts
t=60   recovery action starts
t=75   recovery action finishes

now the question is, what is the next baseline?

should the next recovery action happen around t=120?

t=0    outage starts
t=60   restart #1 starts
t=75   restart #1 finishes
t=120  restart #2

or should it happen around t=135, because the first restart finished at t=75, and only after that we start a fresh 60 second failure window?

t=0    outage starts
t=60   restart #1 starts
t=75   restart #1 finishes
t=135  restart #2 would be the earliest next retry

my first thought was the second one.

after the recovery action finishes, reset the timer and give the interface or script some room. otherwise it can keep doing recovery actions too close to each other.

for a normal interface, that feels noisy. it may not help anything, and it can make the logs harder to understand.

the part i almost missed

the review made this more interesting.

daniel f. dickinson, who maintains watchcat now, pointed out that the old behaviour is useful for some setups and should not just be changed.

that was the important bit.

imagine a wireguard or openvpn interface.

the upstream internet may come back, but the tunnel itself may still be stuck until it is kicked again. if watchcat is pinging through that tunnel, it does not really know that the underlying internet recovered.

it only knows one thing:

the monitored path is still failing

so in that case, continuing to retry during the outage can be useful. maybe even necessary.

so this was not really:

old behavior bad
new behavior good

it was more like:

there are two valid timing models

and that changed the patch.

what changed

instead of changing the default behaviour, the merged pr added an opt-in option:

option reset_failure_timer '1'

when this option is enabled, watchcat starts a fresh failure window after the recovery action finishes.

so the default behaviour stays the same. that keeps the old retry model useful for vpn/tunnel style setups.

and the new behaviour is there for setups where repeated recovery actions too close to each other are not useful.

the merged commit was:

945029322 watchcat: add optional failure timer reset

the pr also cleaned up some log wording, added timing notes in TIMINGS.md, bumped PKG_RELEASE, and added a small CI version test override.

what this looks like

default behaviour:

t=0    outage starts
t=60   restart #1 starts
t=75   restart #1 finishes
t=120  restart #2

with reset_failure_timer=1:

t=0    outage starts
t=60   restart #1 starts
t=75   restart #1 finishes
t=135  restart #2 would be the earliest next retry

it decides whether the time spent inside the recovery action counts toward the next failure window or not.