Sorry, looks like I didn't reply all on this one:
On Fri, May 22, 2009 at 11:31 PM, Marek Lindner lindner_marek@yahoo.de wrote:
I am getting set up to do remote debugging. I'll see what I can find doing that.
Yeah, that would be another option.
ok, here is what I found on the debug malloc problem I am having:
In the function debugMalloc: ================================= chunkHeader = (struct chunkHeader *)memory; chunk = memory + sizeof(struct chunkHeader); chunkTrailer = (struct chunkTrailer *)(memory + sizeof(struct chunkHeader) + length);
chunkHeader->length = length; chunkHeader->tag = tag; chunkHeader->magicNumber = MAGIC_NUMBER;
chunkTrailer->magicNumber = MAGIC_NUMBER;
=> pthread_mutex_lock(&chunk_mutex); ================================= I get the following results:
(gdb) p chunkHeader $26 = (struct chunkHeader *) 0x308c8 (gdb) p chunk $27 = (unsigned char *) 0x308d8 "\020\322\n" (gdb) p chunkTrailer $28 = (struct chunkTrailer *) 0x308dd (gdb) p *chunkHeader $29 = {next = 0x40096c34, length = 5, tag = 15, magicNumber = 305419896} (gdb) p *chunkTrailer $30 = {magicNumber = 878082050} (gdb) p length $31 = 5
I think the magic number is not getting into the chunk trailer correctly because it is not aligned. chunkHeader is aligned because it was returned by malloc chunk is aligned because the size of the header is 16 chunkTrailer is not aligned because it is chunk + 5
Hope this helps. If not, and that shouldn't be a problem, I'll see what else I can find.
Looks like this one didn't go to all either. Sorry for the mess.
On Tue, May 26, 2009 at 10:47 AM, Nathan Wharton naw@greptar.com wrote:
On Fri, May 22, 2009 at 11:31 PM, Marek Lindner lindner_marek@yahoo.de wrote:
I am getting set up to do remote debugging. I'll see what I can find doing that.
Yeah, that would be another option.
ok, here is what I found on the debug malloc problem I am having:
In the function debugMalloc:
chunkHeader = (struct chunkHeader *)memory; chunk = memory + sizeof(struct chunkHeader); chunkTrailer = (struct chunkTrailer *)(memory + sizeof(struct chunkHeader) + length);
chunkHeader->length = length; chunkHeader->tag = tag; chunkHeader->magicNumber = MAGIC_NUMBER;
chunkTrailer->magicNumber = MAGIC_NUMBER;
=> pthread_mutex_lock(&chunk_mutex);
I get the following results:
(gdb) p chunkHeader $26 = (struct chunkHeader *) 0x308c8 (gdb) p chunk $27 = (unsigned char *) 0x308d8 "\020\322\n" (gdb) p chunkTrailer $28 = (struct chunkTrailer *) 0x308dd (gdb) p *chunkHeader $29 = {next = 0x40096c34, length = 5, tag = 15, magicNumber = 305419896} (gdb) p *chunkTrailer $30 = {magicNumber = 878082050} (gdb) p length $31 = 5
I think the magic number is not getting into the chunk trailer correctly because it is not aligned. chunkHeader is aligned because it was returned by malloc chunk is aligned because the size of the header is 16 chunkTrailer is not aligned because it is chunk + 5
Hope this helps. If not, and that shouldn't be a problem, I'll see what else I can find.
I added the following patch: ========================================= --- batmand-r1269/batman/allocate.c 2009-05-20 13:54:18.000000000 -0500 +++ batmand-r1269.mod/batman/allocate.c 2009-05-26 12:25:07.000000000 -0500 @@ -206,6 +206,10 @@ struct chunkHeader *chunkHeader; struct chunkTrailer *chunkTrailer; unsigned char *chunk; + uint32_t remainder = length % 4; + + if (remainder > 0) + length += 4 - remainder;
/* printf("sizeof(struct chunkHeader) = %u, sizeof (struct chunkTrailer) = %u\n", sizeof (struct chunkHeader), sizeof (struct chunkTrailer)); */
========================================= And I don't get any debug malloc problems anymore.
I do, however, still get the kernel crashes when using batgat.
On Tuesday 26 May 2009 20:50:56 Nathan Wharton wrote:
I added the following patch:
--- batmand-r1269/batman/allocate.c 2009-05-20 13:54:18.000000000 -0500 +++ batmand-r1269.mod/batman/allocate.c 2009-05-26 12:25:07.000000000 -0500 @@ -206,6 +206,10 @@ struct chunkHeader *chunkHeader; struct chunkTrailer *chunkTrailer; unsigned char *chunk;
uint32_t remainder = length % 4;
if (remainder > 0)
length += 4 - remainder;
/* printf("sizeof(struct chunkHeader) = %u, sizeof (struct chunkTrailer) = %u\n", sizeof (struct chunkHeader), sizeof (struct chunkTrailer)); */
Please don't introduce more magic numbers. If you need an alignment on integer granularity then please use sizeof(int) instead of a simple constant. The rest looks quite good and logical. I assumed that the only architecture without automatic catching of alignment problems is mips with load operators to the fp registers (never had such problems on simple arm machine nor could I produce that problem on them). Is there any other documentation about such problems on xscale or can you proof this with an test program (malloc 6 byte, write uint32_t to first 4 byte, write another uint32_t to byte 2-5, compare bytes with input - don't forget to set it volatile)? Your current proof if concept test only shows that the "overriden" bytes will not get overriden after moving them some bytes higher.
========================================= #include <stdio.h> #include <stdlib.h>
#define x_moved(pos) ((int*)&x[pos])[0] static const int value1 = 0x01234567; static const int value2 = 0x89abcdef;
int main() { volatile char *x = (char*)malloc(sizeof(int) * 2); x_moved(0) = value1; x_moved(2) = value2;
if (x_moved(2) != value2) { printf("ERROR Value: "); } else { printf("Value: "); } printf("%x %x\n", x_moved(0) , x_moved(2));
free(x); return 0; } =========================================
Best regards, Sven
On Tue, May 26, 2009 at 5:13 PM, Sven Eckelmann sven.eckelmann@gmx.de wrote:
On Tuesday 26 May 2009 20:50:56 Nathan Wharton wrote:
I added the following patch:
Please don't introduce more magic numbers. If you need an alignment on integer granularity then please use sizeof(int) instead of a simple constant. The rest looks quite good and logical. I assumed that the only architecture without automatic catching of alignment problems is mips with load operators to the fp registers (never had such problems on simple arm machine nor could I produce that problem on them). Is there any other documentation about such problems on xscale or can you proof this with an test program (malloc 6 byte, write uint32_t to first 4 byte, write another uint32_t to byte 2-5, compare bytes with input - don't forget to set it volatile)? Your current proof if concept test only shows that the "overriden" bytes will not get overriden after moving them some bytes higher.
I didn't really mean for that to be an accepted patch, just a quick test. Maybe the trailing magic number could just be a byte so you don't have to allocate more memory. I'm sure there are other things to do too.
I haven't looked at any documentation for the xscale for this problem. I can do a test program. I have seen this type of problem before, but it would throw a bus error instead of going along with an incorrect result.
========================================= #include <stdio.h> #include <stdlib.h>
#define x_moved(pos) ((int*)&x[pos])[0] static const int value1 = 0x01234567; static const int value2 = 0x89abcdef;
int main() { volatile char *x = (char*)malloc(sizeof(int) * 2); x_moved(0) = value1; x_moved(2) = value2;
if (x_moved(2) != value2) { printf("ERROR Value: "); } else { printf("Value: "); } printf("%x %x\n", x_moved(0) , x_moved(2));
free(x); return 0; } =========================================
Best regards, Sven
Thanks for your help.
On Wednesday 27 May 2009 08:06:16 Nathan Wharton wrote:
I haven't looked at any documentation for the xscale for this problem. I can do a test program. I have seen this type of problem before, but it would throw a bus error instead of going along with an incorrect result.
Yes, I'd expect some kind of interrupt error that the kernel handles. Please try to run the code that Sven sent and let us know about the results.
If this test is negative I would make a debug patch for you that calls checkIntegrity() during the different stages of the startup phase to narrow down the issue.
Regards, Marek
On Tue, May 26, 2009 at 5:13 PM, Sven Eckelmann sven.eckelmann@gmx.de wrote:
========================================= #include <stdio.h> #include <stdlib.h>
#define x_moved(pos) ((int*)&x[pos])[0] static const int value1 = 0x01234567; static const int value2 = 0x89abcdef;
int main() { volatile char *x = (char*)malloc(sizeof(int) * 2); x_moved(0) = value1; x_moved(2) = value2;
if (x_moved(2) != value2) { printf("ERROR Value: "); } else { printf("Value: "); } printf("%x %x\n", x_moved(0) , x_moved(2));
free(x); return 0; } =========================================
This results in: ERROR Value: 89abcdef cdef89ab
On Wed, May 27, 2009 at 9:20 AM, Nathan Wharton naw@greptar.com wrote:
On Tue, May 26, 2009 at 5:13 PM, Sven Eckelmann sven.eckelmann@gmx.de wrote:
========================================= #include <stdio.h> #include <stdlib.h>
#define x_moved(pos) ((int*)&x[pos])[0] static const int value1 = 0x01234567; static const int value2 = 0x89abcdef;
int main() { volatile char *x = (char*)malloc(sizeof(int) * 2); x_moved(0) = value1; x_moved(2) = value2;
if (x_moved(2) != value2) { printf("ERROR Value: "); } else { printf("Value: "); } printf("%x %x\n", x_moved(0) , x_moved(2));
free(x); return 0; } =========================================
This results in: ERROR Value: 89abcdef cdef89ab
Some good info is here: http://lecs.cs.ucla.edu/wiki/index.php/XScale_alignment
On Wednesday 27 May 2009 16:20:16 Nathan Wharton wrote:
On Tue, May 26, 2009 at 5:13 PM, Sven Eckelmann sven.eckelmann@gmx.de
wrote:
========================================= #include <stdio.h> #include <stdlib.h>
#define x_moved(pos) ((int*)&x[pos])[0] static const int value1 = 0x01234567; static const int value2 = 0x89abcdef;
int main() { volatile char *x = (char*)malloc(sizeof(int) * 2); x_moved(0) = value1; x_moved(2) = value2;
if (x_moved(2) != value2) { printf("ERROR Value: "); } else { printf("Value: "); } printf("%x %x\n", x_moved(0) , x_moved(2)); free(x); return 0;
}
This results in: ERROR Value: 89abcdef cdef89ab
Thanks. This is completely unexpecting for a kernel with alignment trap. The expected result is "Value: 12389ab 89abcdef". Can you check the value of /proc/cpu/alignment ? I will look further at it this evening.
Regards, Sven
On Wed, May 27, 2009 at 11:16 AM, Sven Eckelmann sven.eckelmann@gmx.de wrote:
Thanks. This is completely unexpecting for a kernel with alignment trap. The expected result is "Value: 12389ab 89abcdef". Can you check the value of /proc/cpu/alignment ? I will look further at it this evening.
Regards, Sven
User: 0 System: 0 Skipped: 0 Half: 0 Word: 0 DWord: 0 Multi: 0 User faults: 0 (ignored)
On Wed, May 27, 2009 at 11:21 AM, Nathan Wharton naw@greptar.com wrote:
On Wed, May 27, 2009 at 11:16 AM, Sven Eckelmann sven.eckelmann@gmx.de wrote:
Thanks. This is completely unexpecting for a kernel with alignment trap. The expected result is "Value: 12389ab 89abcdef". Can you check the value of /proc/cpu/alignment ? I will look further at it this evening.
Regards, Sven
User: 0 System: 0 Skipped: 0 Half: 0 Word: 0 DWord: 0 Multi: 0 User faults: 0 (ignored)
I do get the correct result after echo 2 > /proc/cpu/alignment
On Wednesday 27 May 2009 18:35:42 Nathan Wharton wrote:
On Wed, May 27, 2009 at 11:21 AM, Nathan Wharton naw@greptar.com wrote:
On Wed, May 27, 2009 at 11:16 AM, Sven Eckelmann sven.eckelmann@gmx.de
wrote:
Thanks. This is completely unexpecting for a kernel with alignment trap. The expected result is "Value: 12389ab 89abcdef". Can you check the value of /proc/cpu/alignment ? I will look further at it this evening.
Regards, Sven
User: 0 System: 0 Skipped: 0 Half: 0 Word: 0 DWord: 0 Multi: 0 User faults: 0 (ignored)
Thanks.
I do get the correct result after echo 2 > /proc/cpu/alignment
Ok, so 100% sure that it is a alignment problem. The kernel fixup is always activated and shouldn't create such problems as the page faults you mentioned in your first post. Maybe some communication between kernelland and userland has gone wrong. Can you test batmand with batgat and activated alignment fixup?
Best regards, Sven
(Thanks for the link - it is a real nice overview).
On Wed, May 27, 2009 at 12:13 PM, Sven Eckelmann sven.eckelmann@gmx.de wrote:
Ok, so 100% sure that it is a alignment problem. The kernel fixup is always activated and shouldn't create such problems as the page faults you mentioned in your first post. Maybe some communication between kernelland and userland has gone wrong. Can you test batmand with batgat and activated alignment fixup?
Best regards, Sven
(Thanks for the link - it is a real nice overview).
Changing the mode from -r 2 to -g 12000 still causes the kernel to crash with alignment fixup.
On the console I am doing a logread when it happens. The last thing printed out before the crash is: May 27 17:12:23 OpenWrt user.debug kernel: batgat: [batgat_ioctl:238] disconnect daemon
b.a.t.m.a.n@lists.open-mesh.org