Hi,
I'm currently try using batmand-experimental Rev.972.
I have encountered a strange behaviour. I'm running two WRT54 (GL and GS). Routing is working and there is enough memory. I have setup a cron job to call "batmand -c -d [2,7,8,9]" every minutes to update the status files on ramdisk. The webinterface then access the content of the status file which reduces the cpu load.
The WRT54GS is working almost, but the WRT54GL hangs after a while as described below:
The call to "batmand -c -d [2,7,8,9]" blocks batmand completely. batman does not do any routing and OGM processing. As result the router leaves the network. I can still call "batmand -c -r 3" and verify with "batmand -c" that the options were set. But OGMs are not processed. Any call to access the debug-information is blocked.
After killall batmand and restart the call to "batmand -c -d x" is possible serveral times until batman hangs. The process list shows this "batmand -c -d x" - process.
I have compiled batmand for whiterussian_rc6 and with the following options: (The email server has a problem with the assignment character so I have removed it in this email) CFLAGS -Wall -Os LDFLAGS -lpthread CFLAGS_MIPS -Wall -Os -DREVISION_VERSION $(REVISION_VERSION) LDFLAGS_MIPS -lpthread
I had to remove the -pg option because it was not possible to compile. Also in whiterussian_rc the CFLAGS_MIPS/LDFLAGS_MIPS are not used (I think).
Any Idea /Stephan
Hi -
Routing is working and there is enough memory. I have setup a cron job to call "batmand -c -d [2,7,8,9]" every minutes to update the status files on ramdisk. The webinterface then access the content of the status file which reduces the cpu load.
just a stupid question to verify things - you have not forgotten to add -b for batch mode to the command?
cu elektra
Zitat von elektra onelektra@gmx.net:
[Zitattext verstecken] Hi -
Routing is working and there is enough memory. I have setup a cron job to call "batmand -c -d [2,7,8,9]" every minutes to update the status files on ramdisk. The webinterface then access the content of the status file which reduces the cpu load. just a stupid question to verify things - you have not forgotten to add -b for batch mode to the command? yes, you are right. I just forgot this option in the previous post. The firmware uses the -b option. When I create the firmware I'm using the whiterussian kit 1.4.5. It generates the same firmware for different routers. So when I flash the WRT54GS and the WRT54GL with the same firmware, the WRT54GL batmand hangs. There are still 2Mbyte Ram unused. restarting batman does almost not change the memory consumtion. I use the "top" command for this.
I have made a log when the batmand stopps. Perhaps it helps you a little. http://www.ddmesh.de/batmand-hanglog.txt
/Stephan
Hi Stephan,
I have several WRT54GL here and I can execute something like "batmand -cbd8" as often as I want. It never hangs. Can you attach the cron file that is executed and that causes the problem?
ciao, axel
On Sonntag 03 Februar 2008, Freifunk Dresden wrote:
Hi,
I'm currently try using batmand-experimental Rev.972.
I have encountered a strange behaviour. I'm running two WRT54 (GL and GS). Routing is working and there is enough memory. I have setup a cron job to call "batmand -c -d [2,7,8,9]" every minutes to update the status files on ramdisk. The webinterface then access the content of the status file which reduces the cpu load.
The WRT54GS is working almost, but the WRT54GL hangs after a while as described below:
The call to "batmand -c -d [2,7,8,9]" blocks batmand completely. batman does not do any routing and OGM processing. As result the router leaves the network. I can still call "batmand -c -r 3" and verify with "batmand -c" that the options were set. But OGMs are not processed. Any call to access the debug-information is blocked.
After killall batmand and restart the call to "batmand -c -d x" is possible serveral times until batman hangs. The process list shows this "batmand -c -d x" - process.
I have compiled batmand for whiterussian_rc6 and with the following options: (The email server has a problem with the assignment character so I have removed it in this email) CFLAGS -Wall -Os LDFLAGS -lpthread CFLAGS_MIPS -Wall -Os -DREVISION_VERSION $(REVISION_VERSION) LDFLAGS_MIPS -lpthread
I had to remove the -pg option because it was not possible to compile. Also in whiterussian_rc the CFLAGS_MIPS/LDFLAGS_MIPS are not used (I think).
Any Idea /Stephan
B.A.T.M.A.N mailing list B.A.T.M.A.N@open-mesh.net https://list.open-mesh.net/mm/listinfo/b.a.t.m.a.n
Hi Axel,
I have several WRT54GL here and I can execute something like "batmand -cbd8" as often as I want. It never hangs. Can you attach the cron file that is executed and that causes the problem?
The crontab contains an entry such like: 0-59/1 * * * * /etc/init.d/S53batmand check
this calles the following commands after another: batmand -cb -d2 >/tmp/batmand_gateway batmand -cb -d7 >/tmp/batmand_gateway batmand -cb -d8 >/tmp/batmand_gateway batmand -cb -d9 >/tmp/batmand_gateway
But I also did put these commands into an loop: while true; do batmand -cb -d2;....;done
I have also disabled this "logging" completely and let only run batmand to build up the net. I can not say if the access to the debug output leads to blocking the batmand faster.I also have seen that batmand blocks after awhile if it is only running for building the network.
I have put a logfile on my webpage. http://www.ddmesh.de/batmand-hanglog.txt
In one of the previous threads someone had a problem with "batmand going crazy". I'm not sure to remember right. But I think that it had to do with sequence number that's wrapping around. The logfile ends at the time batmand stopps. At the end of this log you will find something like "prevRxSeqno: 0, currRxSeqno-prevRxSeqno 0," perhabs it is the same reason.
batmand is currently started with two interfaces eth1 and tbb. eth1 is the wireless interface and tbb is a tun/tap device that is used by vpn tincd. tincd has got invalid hostnames, so it never creates a connection. Perhabs batmand has a problem with this kind of "dead" interfaces. I have tried to remove this tbb interface when starting batmand. batmand was running at least for two days. But the "dead" interface may also have no influence to this problem. Currently batmand is running since 10 hours with eth1 and tbb (dead interface).
I never have seen this problem with the WRT54GS, only with GL.
/Stephan
Hello,
On Dienstag 12 Februar 2008, Freifunk Dresden wrote:
I have also disabled this "logging" completely and let only run batmand to build up the net. I can not say if the access to the debug output leads to blocking the batmand faster.I also have seen that batmand blocks after awhile if it is only running for building the network.
Ok, then it does not strictly depend on the logging.
I have put a logfile on my webpage. http://www.ddmesh.de/batmand-hanglog.txt
In one of the previous threads someone had a problem with "batmand going crazy". I'm not sure to remember right. But I think that it had to do with sequence number that's wrapping around. The logfile ends at the time batmand stopps. At the end of this log you will find something like "prevRxSeqno: 0, currRxSeqno-prevRxSeqno 0," perhabs it is the same reason.
I checked the log file. the "prevRxSeqno: 0..." line is no problem. The "0" comes from a bad debug statement. If you search your debug log you'll see many of these lines. The "going crazy..." thing was related to overlapping uptime - thats also another story.
batmand is currently started with two interfaces eth1 and tbb. eth1 is the wireless interface and tbb is a tun/tap device that is used by vpn tincd. tincd has got invalid hostnames, so it never creates a connection. Perhabs batmand has a problem with this kind of "dead" interfaces. I have tried to remove this tbb interface when starting batmand. batmand was running at least for two days. But the "dead" interface may also have no influence to this problem. Currently batmand is running since 10 hours with eth1 and tbb (dead interface).
Can you verify if the problem also occures if batmand is started without any tap devices? Can you check for other syslog messages that might be related to the stopping batmand? What does logread say ?
The strange thing is that the debug-level-4 output stops in the middle of an action. Can you also check for the number of batmand processes before and after the stopped batmand process?
Have you ever tried what happens if you connect the tap interface to a bridge and bind batmand to the bridge device instead?
Last but not least: have you observed (or explicitly not observed) this phenomenon also with previous revisions in the same scenario ?
I never have seen this problem with the WRT54GS, only with GL.
Is the batmand on the WRT54GS also bound to a tinc interface ?
ciao, axel
/Stephan
B.A.T.M.A.N mailing list B.A.T.M.A.N@open-mesh.net https://list.open-mesh.net/mm/listinfo/b.a.t.m.a.n
Hello,
Can you verify if the problem also occures if batmand is started without any tap devices?
my last 10h log (with tap dev) was also crashing. I'm currently let run without the tap dev since few hours. I like to run it longer.
Can you check for other syslog messages that might be related to the stopping batmand? What does logread say ?
I have looked at that, but did not find any strange log
The strange thing is that the debug-level-4 output stops in the middle of an action. Can you also check for the number of batmand processes before and after the stopped batmand process?
The number of task are the same. But I have seen, that when the -d4 output stopps and I keep this batmand running when accessing an different log level from another terminal, I see the socket-connection logs in -d4 output.
Also I still can just call "batmand -c" to see the parameters and current gateway settings. I also can change the gateway settings.
The batmand seems to stop processing any OGMs.
Have you ever tried what happens if you connect the tap interface to a bridge and bind batmand to the bridge device instead?
I haven't tried it, yet. but this also came in my mind. I will this check after finishing the "no-tap-dev-test"
Last but not least: have you observed (or explicitly not observed) this phenomenon also with previous revisions in the same scenario ?
I can not say, because implementing tinc and updateing the batmand version was at same time.
I never have seen this problem with the WRT54GS, only with GL.
Is the batmand on the WRT54GS also bound to a tinc interface ?
Yes, the GL is running standalone with stubid tincsetup and also the GS was running with same parameters and standalone (no network cable).
Perhaps it is more random and is depending on speed of the router when the event occurs.
bye Stephan
Hi,
The strange thing is that the debug-level-4 output stops in the middle of an action. Can you also check for the number of batmand processes before and after the stopped batmand process?
The number of task are the same. But I have seen, that when the -d4 output stopps and I keep this batmand running when accessing an different log level from another terminal, I see the socket-connection logs in -d4 output.
Also I still can just call "batmand -c" to see the parameters and current gateway settings. I also can change the gateway settings.
The batmand seems to stop processing any OGMs.
The messages you see are logged from another thread (not the thread which is doing the OGM processing). Thats also the reason why some of the dynamically changeable parameters _seem_ to be processed. I guess for example a "batmand -c -a 1.2.3.4/32" wont be processed completely. In this case a simultaneous running "batmand -cd3" _should_ report: [ 162940] Unix socket: got connection [ 162946] got request: 10 [ 162947] Unix socket: Requesting adding of HNA 1.2.3.4/32 - put this on todo list... [ 162951] got request: 10 [ 162952] Unix client closed connection ... [ 163157] found todo item, adding HNA 1.2.3.4/32 atype 1
I guess everything except the last line will be shown. The last line is generated from the OGM-processing thread which seems to be blocked.
Perhaps, if you can find a way to reliable reproduce this kind of problem then it would be much easier to fix it. Just an idea, what happens with batmand (bound to the tap interface) when stopping the running tincd like this: kill -STOP $(pidof tincd) and later on: kill -CONT $(pidof tincd)
ciao, axel
Have you ever tried what happens if you connect the tap interface to a bridge and bind batmand to the bridge device instead?
I haven't tried it, yet. but this also came in my mind. I will this check after finishing the "no-tap-dev-test"
Last but not least: have you observed (or explicitly not observed) this phenomenon also with previous revisions in the same scenario ?
I can not say, because implementing tinc and updateing the batmand version was at same time.
I never have seen this problem with the WRT54GS, only with GL.
Is the batmand on the WRT54GS also bound to a tinc interface ?
Yes, the GL is running standalone with stubid tincsetup and also the GS was running with same parameters and standalone (no network cable).
Perhaps it is more random and is depending on speed of the router when the event occurs.
bye Stephan
B.A.T.M.A.N mailing list B.A.T.M.A.N@open-mesh.net https://list.open-mesh.net/mm/listinfo/b.a.t.m.a.n
b.a.t.m.a.n@lists.open-mesh.org