When I first started meshing, I used olsrd. There was a problem on our platform that caused olsrd packets to stop going out after the board had been up about 5 minutes. The outage lasted about 40 seconds, then came back. It turns out that there is a problem with the times() call that will return -1 for the 4096 jiffies before wrap around. This caused no apparent change in time during that duration, so no packets were sent. Also, for some reason on our platform, times() starts out about 5 minutes before wrap around.
I found a work around elsewhere (and I leave credit in the patch) for some other software that was failing every 400 something days. I applied this to olsrd, and the problem went away.
When I switched to batman, I saw that times() is used in it as well, so I made the patch attached. So, if anyone sees packets fail to come from batman for 40 seconds, they might need this.
On Apr 30, 2009, at 5:43 PM, Nathan Wharton wrote:
When I first started meshing, I used olsrd. There was a problem on our platform that caused olsrd packets to stop going out after the board had been up about 5 minutes. The outage lasted about 40 seconds, then came back. It turns out that there is a problem with the times() call that will return -1 for the 4096 jiffies before wrap
FYI / maybe it helps BATMAN:
that has been fixed in olsrd in a more portable way in the meantime.
around. This caused no apparent change in time during that duration, so no packets were sent. Also, for some reason on our platform, times() starts out about 5 minutes before wrap around.
I found a work around elsewhere (and I leave credit in the patch) for some other software that was failing every 400 something days. I applied this to olsrd, and the problem went away.
yes, we had some looong discussions about this on the lists. It is really not easy to use times() in a OS/HW independant way. One way is to check if there is an overrun in times(). That is how me and Sven-Ola fixed it. Henning provided a more general/cleaner solution which can even survive if the clock goes backwards (sometimes happens on Xen machines! you wont believe it) or jumps badly. This has been tested with turning the time of the clock backwards etc.
Please also feel free to compare against the olsrd code in case that helps / in case you are interested.
the man page for times() says that it is deprecated.
One more thing for Nathan:
your comment in the patch (+ /* + * times(2) really returns an unsigned value ... + * + * We don't check to see if we got back the error value (-1), because + * the only possibility for an error would be if the address of + * dummy_tms_struct was invalid. Since it's a + * compiler-generated address, we assume that errors are impossible. + * And, unfortunately, it is quite possible for the correct return + * from times(2) to be exactly (clock_t)-1. Sigh... + * + */
)
is only partly correct. It really depends on the OS that you compile on. Some OSes actually treat the return value as signed, some as unsigned. You can compare the different manpages for the different OSes. But -1 is the problem after all, agreed.
*Sigh* when does Marek learn how to code proper C ;-)) ^^^ --- just teasing, don't take it seriously.
When I switched to batman, I saw that times() is used in it as well, so I made the patch attached. So, if anyone sees packets fail to come from batman for 40 seconds, they might need this. <001-times.patch>_______________________________________________ B.A.T.M.A.N mailing list B.A.T.M.A.N@open-mesh.net https://lists.open-mesh.net/mm/listinfo/b.a.t.m.a.n
Hi,
When I switched to batman, I saw that times() is used in it as well, so I made the patch attached. So, if anyone sees packets fail to come from batman for 40 seconds, they might need this.
excellent finding! I immediately applied your patch.
For the curious reader: When you develop a daemon that should do certain things in intervals (e.g. sending packets, removing dead nodes from the routing table, etc) you will need a reference clock to know when this interval is over. Using the system time as reference is the obvious solution which theoretically offers what you need.
As often, reality is much harder: time synchronisation approaches (e.g. NTP), manually changed system time, winter / summer time switches, timezones, etc might encourage the daemon to improvised reactions (e.g. sending the messages of the last 3 years because ntp just finished the time sync).
A more stable reference clock can be found by using the times() system call. It returns "the number of clock ticks that have elapsed since an arbitrary point in the past" (often the uptime). The compelling advantage: it always increments although it overflows from time to time.
Batman uses the times() function which has a bug that this patch deals with. I had a short look in the olsr code (0.5.6-r4) to see how this bug is being handled there. main.c line 896 reveals a function called olsr_times()
--- SNIP --> clock_t olsr_times(void) { struct tms tms_buf; return times(&tms_buf); } <-- SNAP ---
which is called from vairous places that don't deal with this bug:
scheduler.c line 117 --- SNIP --> void olsr_scheduler(void) { [..] /* Main scheduler loop */ for (;;) {
/* * Update the global timestamp. We are using a non-wallclock timer here * to avoid any undesired side effects if the system clock changes. */ now_times = olsr_times();
[..]
/* We are done, sleep until the next scheduling interval. */ olsr_scheduler_sleep(olsr_times() - now_times); <-- SNAP ---
Regards, Marek
b.a.t.m.a.n@lists.open-mesh.org