I see some comments about soft lockups during memory pressure. I have struggled with this immensely over the years. I wrote a userspace memory reclaimer daemon and have not had a lockup since: https://gist.github.com/EBADBEEF/f168458028f684a91148f4d3e79... .
The hangs usually happened when I was stressing VFS (the computer was a samba server) along with other workloads. To trigger a hang manually I would read in large files (bigger than available ram) in parallel while running a game. I could get it to hang even with 128GB ram. I tweaked all the vfs settings (swappiness, etc...) to no avail. I tried with and without swap.
In the end it looked like memory was not getting reclaimed fast enough, like linux would wait too long to start reclaiming memory and some critical process would get stuck waiting for some memory. The system would hang for minutes or hours at a time only making the tiniest of progress between reclaims.
If I caught the problem early enough (just as everything started stuttering) I could trigger a reclaim manually by writing to '/sys/fs/cgroup/memory.reclaim' and the system would recover. I wonder if it was specific to btrfs or some specific workload pattern but I was never able to figure it out.
The swap/memory situation in linux has surprised me quite a bit coming from Windows.
Windows remains mostly fully responsive even when memory is being pushed to the limits and swapping gigabytes per second, while on linux when I ran a stress test that ate all the memory I had trouble even terminating the script
There's two things that cause this. First, Windows has a variable swap file size, whereas Linux has a fixed size, so Windows can just fill up your drive, instead of running out of swap space. Second, the default behavior for the out-of-memory killer in Linux isn't very aggressive, with the default behavior being to over-commit memory instead of killing processes.
As far as I know, Linux still doesn't support a variable-sized swap file, but it is possible to change how aggressively it over-commits memory or kills processes to free memory.
As to why there differences are there, they're more historical than technical. My best guess is that Windows figured it out sooner, because it has always existed in an environment where multiple programs are memory hogs, whereas it wasn't common in Linux until the proliferation of web-based everything requiring hundreds of megabytes to gigabytes of memory for each process running in a Chrome tab or Electron instance, even if it's something as simple as a news article or chat client.
Windows "figured it out sooner" because it never really had to seriously deal with overcommitting memory: there is no fork(), so the memory usage figures of the processes are accurate. On Linux, however, the un-negotiable existence of fork() really leaves one with no truly good solution (and this has been debated for decades).
Not really. It elegantly solves the "create a process, letting it inherit these settings and reset these other settings", where "settings" is an ever changing and expanding list of things that you wouldn't want to bake into the API. Thus (omitting error checks and simplifying many details):
pipe (fd[2]); // create a pipe to share with the child
if (fork () == 0) { // child
close (...); // close some stuff
setrlimit (...); // add a ulimit to the child
sigaction (...); // change signal masks
// also: clean the environment, set cgroups
execvp (...); // run the child
}
It's also enormously flexible. I don't know any other API that as well as the above, also lets you change the relationship of parent and child, and create duplicate worker processes.
Comparing it to Windows is hilarious because Linux can create processes vastly more efficiently and quickly than Windows.
Yes, but at what cost? 99% of fork calls are immediately followed by exec(), but now every kernel object need to handle being forked. And a great deal of memory-management housekeeping is done only to be discarded afterward. And it doesn't work at all for AMP systems (which we will have to deal with, sooner or latter).
In 1970 it might have been the only way to provide a flexible API, but nowadays we have a great variety of extensible serialization formats better than "struct".
Windows will also prioritise to keep the desktop and current focussed application running smoothly, the Linux kernel has no idea what's currently focused or what not to kill, your desktop shell is up there on the menu in oom situations.
There are daemons (not installed by default) that monitor memory usage and can increase swap size or kill processes accordingly (you can ofc also configure OOM killer).
I once was trying to set up a VPN that needed to adjust the TTL to keep its presence transparent, only to discover that I'd have to recompile the kernel to do so. How did packet filtering end up privileged, let alone running inside the kernel?
I recently started using SSHFS, which I can run as an unprivileged user, and suspending with a drive mounted reliably crashes the entire system. Back on the topic of swap space, any user that's in a cgroup, which is rarely the case, can also crash the system by allocating a bunch of RAM.
Linux is one of the most advanced operating systems in existence, with new capabilities regularly being added in, but it feels like it's skipped over several basics.
Huh? What does swap area size have to do with responsiveness under load? Linux has a long history of being unusable under memory pressure. systemd-oomd helps a little bit (killing processes before direct reclaim makes everything seize up), but there's still no general solution. The relevance to history is that Windows got is basically right ever and Linux never did.
Nothing to do with overcommit either. Why would that make a difference either? We're talking about interactivity under load. How we got to the loaded state doesn't matter.
In Linux the default swap behaviour is to also swap out the memory mapped to the executable file, not just memory allocated by the process. This is a relatively sane approach on servers, but not so much on desktops. I believe both Windows and macOS don't swap out code pages, so the applications remain responsive, at the of (potentially) lower swap efficiency
Don’t know about Macs, but on Windows executable code is treated like a readonly memory mapped file that can be loaded and restored as the kernel sees fit. It could also be shared between processes, though that is not happening that much anymore due to ASLR.
Oh yeah. Bug 12309 was reported now what, 20 years ago? It’s fair to say that at this point arrival of GNU Mach will happen sooner than Linux will be able to properly work under memory pressure.
Like Linux / open source often, it depends on what you do with it!
The kernel is very slow to kill stuff. Very very very very slow. It will try and try and try to prevent having to kill anything. It will be absolutely certain it can reclaim nothing more, and it will be at an absolute crawl trying to make every little kilobyte it can free, swapping like mad to try options to free stuff.
But there are a number of daemons you can use if you want to be more proactive! Systemd now has systemd-oomd. It's pretty good! There's others, with other strategies for what to kill first, based on other indicators!
The flexibility is a feature, not a bug. What distro are you on? I'm kind of surprised it didn't ship with something on?
Yeah I think SSD / NVME makes all the difference here - I certainly remember XP / Vista / Win 7 boxes that became unusable and more-or-less unrecoverable (just like Linux) once a swap storm starts.
The annoying thing I've found with Linux under memory stress (and still haven't found a nice way to solve) is I want it to always always always kill firefox first. Instead it tends to either kill nothing (causing the system to hang) or kill some vital service.
You can bump /proc/$firefox_pid/oom_score_adj to make it likely target. The easiest way is to make wrapper script that bumps the score and then starts firefox. All children will inherit the score.
I'm not sure that I'd want the OS to kill my browser while I'm working within it.
Of course the browser is the largest process in my system, so when I notice that memory is running low I restart it and I gain some 15 GB.
Basically I am the memory manager of my system and I've been able to run my 32 GB Linux laptop with no swap since 2014. I read that a system with no swap is suboptimal but the only tradeoff I notice is that manual OOM vs less writes on my SSD. I'm happy with it.
There are two pillars to managing RAM with virtual memory: the obvious one is is writing one program's working set to disk, so that another program can use that memory. The other one - which isn't prevented by disabling swap - is flushing parts of a program which were loaded from disk, and reloading them from disk when next needed.
That second pillar is actually worse for interactivity than swapping the working set, which is why disabling swap entirely isn't considered optimal.
By far the best approach is just to have an absurd amount of RAM - which of course is a much less accessible option now than it was a year ago.
If using systemd-oomd, you can launch Firefox into it's own cgroup / systemd.scope, that has memory pressure control settings set to not kill it. ManagedOOMPreference=avoid.
Yeah sytemd-oomd seems tuned for server workloads, I couldn't get it to stop killing my session instead of whichever app had eaten the memory.
Honestly on the desktop I'd rather a popup that allowed me to select which app to kill. But the kernel doesn't seem to even be able to prioritize the display server memory.
> Windows remains mostly fully responsive even when memory is being pushed to the limits and swapping gigabytes per second ...
The problem problem though is all the times when Windows is totally unusable even though it's doing exactly jack shit. An example would be when it's doing all its pointless updates/upgrades.
I don't know what people are smoking in this world when they somehow believe that Windows 11 is an acceptable user experience and a better OS than Linux.
Somehow though it's Linux that's powering billions if not tens of billions of devices worldwide and only about 12% of all new devices sold worldwide are running that piece of turd that Windows is.
OOM killers serve a purpose but - for a desktop OS - they're missing the point.
In a sane world the user would decide which application to shut down, not the OS; the user would click the appropriate application's close gadget, and the user interface would remain responsive enough for that to happen in a matter of seconds rather than minutes.
I understand the many reasons why that's not possible, but it's a huge failing of Linux as a desktop OS, and OOM killers are picking around the edges of the problem, not addressing it head-on.
(Which isn't to say, of course, that OOM killers aren't the right approach in a server context.)
This is effectively what macOS does. It presents any end user GUI programs (or Terminal.app if it's a CLI program) to elect to kill by way of popup with the amount of memory the application is consuming and whether or not it is responsive.
Not a bad system, but macOS has a fundamental memory leak for the past few versions which causes even simple apps like Preview.app, your favorite browser (doesn't matter which one), etc. to 'leak' and bring up the choose-your-own-OOM-kill-target dialog.
> I understand the many reasons why that's not possible
Isn't the only reason that the UI effectively freeze under high memory pressure? If the processing involved in the core parts of the UI could be prioritized there's no reason for it not to work?
I don't understand why we can't have (?) a way of saying "the pages of these few processes are holy"? (yes, it would still be slow allocating new pages, but it should be possible to write a small core part such that it doesn't need to constantly allocate memory?)
I think zswap is the better option because it's not a fixed RAM storage, it merely compresses pages in RAM up to a variable limit and then writes to swap space when needed, which is more efficient.
It worked very well with my preceding laptop limited to 4GB of RAM.
The hangs usually happened when I was stressing VFS (the computer was a samba server) along with other workloads. To trigger a hang manually I would read in large files (bigger than available ram) in parallel while running a game. I could get it to hang even with 128GB ram. I tweaked all the vfs settings (swappiness, etc...) to no avail. I tried with and without swap.
In the end it looked like memory was not getting reclaimed fast enough, like linux would wait too long to start reclaiming memory and some critical process would get stuck waiting for some memory. The system would hang for minutes or hours at a time only making the tiniest of progress between reclaims.
If I caught the problem early enough (just as everything started stuttering) I could trigger a reclaim manually by writing to '/sys/fs/cgroup/memory.reclaim' and the system would recover. I wonder if it was specific to btrfs or some specific workload pattern but I was never able to figure it out.
reply