On 08/07/2015 06:16 PM, Linus Lüssing wrote:
Hi Matthias,
here at the Wireless BattleMesh we finally had the chance to get some initial discussions on your patch going. For the start a few comprehension questions came up:
On Wed, Jun 24, 2015 at 08:34:28PM +0200, Matthias Schiffer wrote:
- As batman-adv uses single_open, the whole content of the originators/ transglobal files must fit into a single buffer; in large batman-adv networks this often fails (as an order-5 allocation or even more would be necessary)
- When originators or transglobal aren't just used for debugging, they are first converted to text, and then parsed again in userspace by tools like alfred/batadv-vis. Sending MAC address lists from the kernel to userspace as text makes the buffer size issue even worse.
These two points can be addressed through debugfs too, for instance using sequential debugfs writes, right? (In fact IIRC you had started with that approach until you got the feedback from Gregk, right?)
I've had a look at the different functions the seq_file API provides, but I didn't write any code.
Can you elaborate a little more on the "order-5 allocation"? What amount of free RAM did the machines have where we observed Out-of-Memory kernel panics upon debugfs access? Can you give some numbers / calculations why we ended up with several Megabytes memory allocations on debugfs access?
An order-5 allocation are 2^5 = 32 pages of memory, i.e. 128K of RAM. As these are allocated by kmalloc, these 32 pages are allocated as one piece of physical RAM. As the RAM gets fragmented more and more the longer a system is running, 32 pages in one piece can be hard to find even when there are still tens of MB of free RAM.
The fragmentation issue becomes a bit worse because the seq_file code starts with a single page and loops until the buffer is big enough to fit the whole output, always freeing the buffer in each loop iteration and allocation a new buffer twice the size.
The debugfs race conditions Gregk and you talked about are on adding/removing debugfs files, right? Are there any known race conditions on simple reads/writes in the abscene of removing debugfs files?
The race conditions only occur when files are removed, but that alone is bad enough - I'd really like to avoid enabling debugfs at all on critical systems.
Since you've had a look at both the netlink and sequential debugfs approach already, can you give some estimation about the complexity or rough number of lines of code to change for the sequential debugfs approach?
I guess that should be possible in 100~200 lines of code. Most of it would be similar to the code I've implemented for the netlink API: storing counters between the callback runs to keep track of the current position in the data structures. Of course, it will have the same drawbacks: when originator/tt entries are added or removed between the calls, entries will be duplicate or missing from the output.
The main reason why I didn't consider fixing the debugfs code first was that returning to userspace in the middle of the read will make the race conditions much easier to hit: at the moment, the race can only occur when the file is removed between open() and read(), which is usually very short. The time between several read() calls can be much bigger, especially when the files are read and parsed line by line.
One thought that popped up here was, whether it'd make sense to first "fix" the debugfs approach to the extent possible with a couple of lines instead of 800+ lines to get rid of the issues we frequently observe. And then merge a complete fix but bigger patchset implementing netlink support with a more thorough review and discussions on what we'd need for its API now and upcoming features.
Cheers, Linus