On Sun, Aug 05, 2012 at 02:34:15AM -0300, Gui Iribarren wrote:
On Mon, Jul 23, 2012 at 2:28 PM, Antonio Quartulli ordex@autistici.org wrote:
On Sun, Jul 22, 2012 at 08:20:21AM -0300, Guido Iribarren wrote:
On Sun, Jul 22, 2012 at 7:57 AM, Guido Iribarren guidoiribarren@buenosaireslibre.org wrote:
This time it solved itself after some brief time (a minute) but the symptoms were the same. So I could catch some logs, http://pastebin.com/MEENj94i
sadly, i wasn't fast enough to get a live log from the node involved in the inconsistency as you suggested, so the report might be pretty useless.
from this particular node i ran previous report (colmena-casa) that was rebooted recently, L3 ping to all of the network had the same issue, (no replies for a minute or so) so i had the chance to "recreate" the situation several times. Turns out, a "batctl ll tt ; batctl l" on the nodes mentioned in the inconsistencies gave no output at all, so the previous pastebin report is in fact complete :P Looks like the inconsistency is being resolved locally between neighbours, without the need to contact the far end of the network (which is coherent with what's described in the wiki)
Exactly! If the neighbour has the needed information, the node can directly get answered without bothering the real destination ;)
In any case, AFAIR previous ocurrences of the bug didn't resolve by themselves (in a reasonable amount of time) so what I'm looking at now might be perfectly normal behaviour? (tt tables take some time to propagate?)
Well, the log you posted is perfectly correct. You missed some OGMs, therefore the node is asking for an update that he missed.
it would be interesting to run batctl ll tt; batctl l all the time on the node that usually experiences the "problem". The log should be not so big, unless the bug happens.
I admit i haven't left this running as instructed, but on the other hand, so far I haven't come across the original bug again, and a few days ago I asked Nico Echaniz which confirmed that he's not suffering it as previously. he does bump from time to time with [a few moments | a few minutes] of "nodes majaretas" (at first sight) but it resolves by itself quickly[*], which indicates normal behaviour, of missing OGMs and consequently a delay in TT table updating, as you explained.
[*] "quickly" means under 15 minutes , at most. Previously, problem would never resolve by itself, being L3-unreachable for hours or days until manual reboot was done.
In conclusion, so far so good, i think we can close this as fixed for lack of evidence stating the contrary, heh. I hope gioacchino managed to recompile ninux images and is having the same stableness as we do :)
Gui
Hello Guido and thank you for reporting back your results :) However, even if the "behaviour" is good (table gets recovered and everything starts working again) it is a bit strange that it takes 15 minutes to do so.
If you accidentally see the bug, it would be interesting to get the log of the "non-working" node and see why it is taking so long.
Thank you very much!
Cheers,