[B.A.T.M.A.N.] kmalloc() vs. kmem_cache_alloc() for global TT?

List overview All Threads
Download

newer

older

[B.A.T.M.A.N.] [PATCH] RFC:...

[B.A.T.M.A.N.] [PATCHv2]...

Linus Lüssing

14 May 2016 14 May '16

2:51 p.m.

Hi,

Is anyone familiar with the implications of using kmalloc() vs. kmem_cache_alloc()? Not just allocation speed, but also RAM fragmentation is something I'm currently wondering about.

We have had issues with the large, bulk allocations from the debugfs tables before, where if I remember correctly the experssion "RAM fragmentation" has been mentioned before. With the fallback from kmalloc() to vmalloc() in the debugfs internals on more recent kernels, at least that problem has gone for now.

Still, I'm wondering whether a couple of thousand global TT entry allocations (Freifunk Hamburg currently has more than three thousand) could lead to a badly fragmented RAM. Too much for a cute wifi router with just 32MB of RAM, maybe.

I then noticed that the bridge is using kmem_cache_alloc() instead of kmalloc() for its fdb (forwarding database). Would it make sense to do the same for global TT entries (and maybe originator structs, too?)?

If so, I'd be happy to look into providing a patch.

Regards, Linus

PS: Why this thought came up: https://github.com/freifunk-gluon/gluon/issues/753 Could be a memory leak or something else too though. Investigation still in progress.

Show replies by date

Linus Lüssing

14 May 14 May

2:54 p.m.

PPS: I'd also be interested whether anyone knows of any tools to visualize any potential RAM fragmentation.

Sven Eckelmann

15 May 15 May

11:27 a.m.

On Saturday 14 May 2016 16:51:29 Linus Lüssing wrote:

...

Hi,

Is anyone familiar with the implications of using kmalloc() vs. kmem_cache_alloc()? Not just allocation speed, but also RAM fragmentation is something I'm currently wondering about.

Yes, it should reduce the effects of allocating differently sized objects (which can create smaller regions of memory which cannot be used anymore). But my guess is that SLAB isn't that bad because it already has some caches for differently sized memory regions.

I think we should check if this helps by first testing it with the main TT objects. I've send an RFC patch [1]. Unfortunately, I am not aware of any nice tools to really check the size of the available, continuous memory chunks. The only thing partially interesting I know about is /proc/slabinfo which shows you the state of the available caches (including the slab page caches).

Kind regards, Sven

[1] https://patchwork.open-mesh.org/patch/16200/

Linus Lüssing

12:06 p.m.

On Sun, May 15, 2016 at 01:27:39PM +0200, Sven Eckelmann wrote:

...

On Saturday 14 May 2016 16:51:29 Linus Lüssing wrote:

...
Hi,

Is anyone familiar with the implications of using kmalloc() vs. kmem_cache_alloc()? Not just allocation speed, but also RAM fragmentation is something I'm currently wondering about.

Yes, it should reduce the effects of allocating differently sized objects (which can create smaller regions of memory which cannot be used anymore). But my guess is that SLAB isn't that bad because it already has some caches for differently sized memory regions.

Yes, I tested the following patchset [0] yesterday, created a few hundred clients in x86 VMs and obsevered the output of /proc/slabinfo.

* tt_global_entry uses kmalloc-node cache (objsize 192), same size for a batadv_tt_global_cache * tt_orig_list_entry uses a kmalloc-64, same size for a batadv_tt_orig_cache

(sizeof(tt-global) -> 144, sizeof(orig-entry) -> 56, sizeof(tt-common) -> 64, sizeof(tt_local) -> 80)

So indeed it looks like there might not be a difference in using one of the predefined or a custom cache fragmentation wise. And the wasted space seems to be the same (if I'm not misinterpreting the output of slabinfo).

On the other hand it seems common to use custom caches for larger amounts of frequently changing objects of the same type. Filesystems seem to use them regularly.

...

I think we should check if this helps by first testing it with the main TT objects. I've send an RFC patch [1]. Unfortunately, I am not aware of any nice tools to really check the size of the available, continuous memory chunks. The only thing partially interesting I know about is /proc/slabinfo which shows you the state of the available caches (including the slab page caches).

Ok, yes, that's what I had looked at yesterday, too. I'll check whether I can get some guys from Freifunk Rhein-Neckar or Freifunk Hamburg to test these patches and whether they make a difference for them.

...

Kind regards, Sven

[1] https://patchwork.open-mesh.org/patch/16200/

[0] https://git.open-mesh.org/batman-adv.git/shortlog/refs/heads/linus/kmem-cach...

Sven Eckelmann

12:15 p.m.

On Sunday 15 May 2016 14:06:26 Linus Lüssing wrote: [...]

...

[0] https://git.open-mesh.org/batman-adv.git/shortlog/refs/heads/linus/kmem-cac he

This patchset has a bad bug. It still uses kfree for tt_local objects allocated via kmem_cache_(z)alloc.

Kind regards, Sven

Sven Eckelmann

12:17 p.m.

On Sunday 15 May 2016 14:15:01 Sven Eckelmann wrote:

...

On Sunday 15 May 2016 14:06:26 Linus Lüssing wrote: [...]

...
[0] https://git.open-mesh.org/batman-adv.git/shortlog/refs/heads/linus/kmem-ca c he

This patchset has a bad bug. It still uses kfree for tt_local objects allocated via kmem_cache_(z)alloc.

Also the kmem_cache_destroy seems to be broken regarding the RCU handling.

Kind regards, Sven

Linus Lüssing

12:37 p.m.

On Sun, May 15, 2016 at 02:15:01PM +0200, Sven Eckelmann wrote:

...

On Sunday 15 May 2016 14:06:26 Linus Lüssing wrote: [...]

...
[0] https://git.open-mesh.org/batman-adv.git/shortlog/refs/heads/linus/kmem-cac he

This patchset has a bad bug. It still uses kfree for tt_local objects allocated via kmem_cache_(z)alloc.

Where?

Sven Eckelmann

12:53 p.m.

On Sunday 15 May 2016 14:37:33 Linus Lüssing wrote:

...

On Sun, May 15, 2016 at 02:15:01PM +0200, Sven Eckelmann wrote:

...
On Sunday 15 May 2016 14:06:26 Linus Lüssing wrote: [...]

...
[0] https://git.open-mesh.org/batman-adv.git/shortlog/refs/heads/linus/kmem- cac he

This patchset has a bad bug. It still uses kfree for tt_local objects allocated via kmem_cache_(z)alloc.

Where?

Search for kfree( in my patch [1]

Kind regards, Sven

[1] https://git.open-mesh.org/batman-adv.git/blob/60ebd7bc66140380c5931e8d8d2c36...

Linus Lüssing

12:41 p.m.

On Sun, May 15, 2016 at 02:06:26PM +0200, Linus Lüssing wrote:

...

Ok, yes, that's what I had looked at yesterday, too.

Btw., these were the results from slabinfo I got yesterday. The first one before applying the patches, the second one after:

http://metameute.de/~tux/batman-adv/slablog/before/ http://metameute.de/~tux/batman-adv/slablog/after/

The first number is the number of lines from "batctl tg", second one the timestamp.

Sven Eckelmann

8:50 p.m.

On Sunday 15 May 2016 14:41:38 Linus Lüssing wrote:

...

On Sun, May 15, 2016 at 02:06:26PM +0200, Linus Lüssing wrote:

...
Ok, yes, that's what I had looked at yesterday, too.

Btw., these were the results from slabinfo I got yesterday. The first one before applying the patches, the second one after:

http://metameute.de/~tux/batman-adv/slablog/before/ http://metameute.de/~tux/batman-adv/slablog/after/

The first number is the number of lines from "batctl tg", second one the timestamp.

Hm, looks like the the biggest difference is in kmalloc-64. So this would mean that the kmalloc version uses 64 byte entries for tg entries. And the batadv_tt_global_cache version uses 192 bytes (so it has an even larger overhead). The question is now - why?

My first guess was that you you are using ar71xx with MIPS_L1_CACHE_SHIFT == 5. This would cause a cache_line_size() of 32. The tg object is 48 bytes on ar71xx. So it looks like you are using a different architecture [1] because otherwise the (cache) alignment would also be 64 bytes. Maybe you have some debug things enabled that cause the extra used bytes?

Extra debug information would also explain it why bridge_fdb_cache requires 128 bytes (cache aligned) per net_bridge_fdb_entry. I would have expected that it is not using more than 64 bytes and is merged automatically together with something like kmalloc-64 (see __kmem_cache_alias for the code merging different kmem_caches).

Just some thoughts about the kmem_cache approach: We would only have a benefit by using kmem_cache when we could have a objsize which is smaller than any available slub/slab kmalloc-*. Otherwise slub/slab would automatically use a good fitting, internal kmem_cache for everything.

Right now, the size of a tg entry on my system (ar71xx mips, amd64) would have a raw size of 48-80 bytes. These would end up at an objsize (cache line aligned) of 64-96 bytes. On OpenWrt (ar71xx) it should be merged with kmalloc-64 and on Debian (amd64) it should be merged with kmalloc-96 (not tested - but maybe it is important to mention that kmalloc-96 has an objsize of 128 on my running system).

Kind regards, Sven

[1] Yes, I saw the kvm and ACPI lines after I wrote this stuff. So you are most likely testing on some x86 system

Linus Lüssing

9:26 p.m.

On Sun, May 15, 2016 at 10:50:20PM +0200, Sven Eckelmann wrote:

...

Hm, looks like the the biggest difference is in kmalloc-64. So this would mean that the kmalloc version uses 64 byte entries for tg entries. And the batadv_tt_global_cache version uses 192 bytes (so it has an even larger overhead). The question is now - why?

The biggest difference is not only in kmalloc-64 but also kmalloc-node.

tg entries seem to end up in kmalloc-node (192 objsize), tt orig list entries in kmalloc-64 I think (like I wrote in my previous mails).

...

My first guess was that you you are using ar71xx with MIPS_L1_CACHE_SHIFT == 5. This would cause a cache_line_size() of 32. The tg object is 48 bytes on ar71xx. So it looks like you are using a different architecture [1] because otherwise the (cache) alignment would also be 64 bytes. Maybe you have some debug things enabled that cause the extra used bytes?

Yes, it's not ar71xx like you have, it's x86-64/amd64 in a VM. sizeof() actually tells me 144 bytes for a tg entry. And 56 bytes for an orig-list entry (like I wrote before).

...

Extra debug information would also explain it why bridge_fdb_cache requires 128 bytes (cache aligned) per net_bridge_fdb_entry. I would have expected that it is not using more than 64 bytes and is merged automatically together with something like kmalloc-64 (see __kmem_cache_alias for the code merging different kmem_caches).

Hm, could be, yes I have enabled quite a bit of options in the kernel hacking section.

...

Just some thoughts about the kmem_cache approach: We would only have a benefit by using kmem_cache when we could have a objsize which is smaller than any available slub/slab kmalloc-*. Otherwise slub/slab would automatically use a good fitting, internal kmem_cache for everything.

Might be. From /proc/slabinfo output batadv_tt_global_cache and kmalloc-node, as well as batadv_tt_orig_cache and kmalloc-64 looked similar.

But don't know whether there are any internal differences for the custom caches. Unfortunately documentation seems rare regarding kmem-caches :(.

...

Right now, the size of a tg entry on my system (ar71xx mips, amd64) would have a raw size of 48-80 bytes. These would end up at an objsize (cache line aligned) of 64-96 bytes. On OpenWrt (ar71xx) it should be merged with kmalloc-64 and on Debian (amd64) it should be merged with kmalloc-96 (not tested -

...

but maybe it is important to mention that kmalloc-96 has an objsize of 128 on my running system).

In my VMs too, as can be seen in the provided slabinfo.

...

Kind regards, Sven

[1] Yes, I saw the kvm and ACPI lines after I wrote this stuff. So you are most likely testing on some x86 system

Indeed :).

Sven Eckelmann

10:06 p.m.

On Sunday 15 May 2016 23:26:32 Linus Lüssing wrote: [...]

...

Yes, it's not ar71xx like you have, it's x86-64/amd64 in a VM. sizeof() actually tells me 144 bytes for a tg entry. And 56 bytes for an orig-list entry (like I wrote before).

Ah, sorry. I was doing something else when I received the mail which explained the VM setup - I scrolled over the initial part of the mail and only read the part about your own patchset.

Kind regards, Sven

Linus Lüssing

24 May 24 May

12:14 a.m.

Just to throw in some credibile sources.

https://static.lwn.net/images/pdf/LDD3/ch08.pdf:

"The main differences in passing from scull to scullc are a slight speed improvement and better memory use. Since quanta are allocated from a pool of memory fragments of exactly the right size, their placement in memory is as dense as possible, as opposed to scull quanta, which bring in an unpredictable memory fragmentation."

Also the paragraph regarding SLAB_HW_CACHE_ALIGN sounds interesting. (And maybe it might be worth considering to ommit this flag for the sake of our beloved but always memory deprived embedded systems? Most TT entries aren't usually read anyway.)

Should we just go for kmem_cache_alloc() or should someone ask on netdev@ first whether that chapter is still valid for current kernels?

3182

Age (days ago)

3192

Last active (days ago)

b.a.t.m.a.n@lists.open-mesh.org

12 comments

2 participants

tags (0)

participants (2)

Linus Lüssing
Sven Eckelmann