Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Looks like the big challenge is managing a large, LRU cache, which tends to be a difficult problem for GC runtimes. I bet the JVM, with its myriad tunable GC algorithms, would perform better, especially Shenandoah and, of course, the Azul C4.

The JVM world tends to solve this problem by using off-heap caches. See Apache Ignite [0] or Ehcache [1].

I can't speak for how their Rust cache manages memory, but the thing to be careful of in non-GC runtimes (especially non-copying GC) is memory fragmentation.

Its worth mentioning that the Dgraph folks wrote a better Go cache [2] once they hit the limits of the usual Go caches.

From a purely architectural perspective, I would try to put cacheable material in something like memcache or redis, or one of the many distributed caches out there. But it might not be an option.

It's worth mentioning that Apache Cassandra itself uses an off-heap cache.

[0]: https://ignite.apache.org/arch/durablememory.html [1]: https://www.ehcache.org/documentation/2.8/get-started/storag... [2]: https://blog.dgraph.io/post/introducing-ristretto-high-perf-...



One the one hand, yes. On the other hand, all of this sounds much more complex and fragile. This seems like an important point to me:

"Remarkably, we had only put very basic thought into optimization as the Rust version was written. Even with just basic optimization, Rust was able to outperform the hyper hand-tuned Go version."


I found similarly when I ported an image resizing algorithm from Swift to Rust: I'm experienced in swift thus was able to write in an idiomatic way, and have little Rust experience thus I wrote it in a naive way; yet still the rust algorithm was twice(!) as fast. And swift doesn't even have a GC slowing things down!


> Swift doesn't have a GC slowing things down

This Apple marketing meme needs to die. Reference counting incurs arguably more cost than GC, or at least the cost is spread through-out processing.


It incurs some cost, but whether it is higher is very debatable. This is very much workload dependent. A smart compiler can elide most reference updates.


No it can't, not in Swift's current implementation.

https://github.com/ixy-languages/ixy-languages


It would seem that the Swift compiler is far from smart[1].

[1]: <https://media.ccc.de/v/35c3-9670-safe_and_secure_drivers_in_...


Apple's ARC is not a GC in the classic sense. It doesn't stop the world and mark/sweep all of active memory. It's got "retain" and "release" calls automatically inserted and elided by the compiler to track reference counts at runtime, and when they hit zero, invoke a destructor. That's not even close to what most people think of when they think "gc". Of course it's not free, but it's deterministic.


I think you are being borderline pedantic here. ARC _is_ GC in the classic sense: https://en.wikipedia.org/wiki/Garbage_collection_(computer_s...

I agree with you that most people tend to associate GC with something more advanced nowadays, like mark and sweep as you said in another comment, but it seems pointless to argue that ARC is not a form of GC.


You're using a hearsay, folklore definition of GC, not a CS one.

Refcounting in the presence of threads is usually non-deterministic too.


It is non deterministic, but at a much different scale.


Reference counting is a (low-throughput, low-latency) form of garbage collection.


The low-latency part might not even be true. RC means that you don't have CPU consuming heap scans, but if you free the last reference to a large tree of objects, freeing them can take quite a lot of time, causing high latencies.


Yes and no. From a theoretical perspective, I suppose that's true, but "garbage collection" tends to mean a non-deterministic collector that does its own thing, and you don't have to think at all about memory. That does not apply to Swift, as of course, you need to understand the difference between strong and weak references. It's unfairly simplistic to couple the two.


No, RC is GC.

Most people think of Python as GCed language, and it uses mostly RC.

Any runtime that uses mark & sweep today may elect to use RC for some subset of the heap at some point in a future design, if that makes more sense. The mix of marking GC vs refcounting GC shouldn't affect the semantics of the program.


But it actually is just garbage collection.


It is but in the context of this discussion it's very clear that they meant a tracing garbage collector, which has a very different cost than atomic reference counting. Or to put it another way: you're technically correct, the worst kind of correct.


Swift's form of GC is one of the slowest one, no wonder porting to Rust made it faster, specially given that most tracing GC outperform Swift's current implementation.

If one goes with reference counting as GC implementation, then one should take the effort to use hazardous pointers and related optimizations.


ARC, used by Swift, has its own cost.


True, but it's generally better than most full GC solutions (for processes running for relatively short times without the benefit of profile-guided optimization), and worse than languages with fully statically analyzable memory usage.

Note: that parenthetical is a very big caveat, because properly profile-optimized JVM executables can often achieve exceptional performance/development cost tradeoffs.

In addition however, ARC admits a substantial amount of memory-usage optimization given bytecode, which is now what developers provide to Apple on iOS. Not to mention potential optimizations by allowing Apple to serve last-minute compiled microarchitecture optimized binaries for each device (family).

To satiate the pedants... ARC is more of less GC where calls into the GC mechanism are compiled in statically and where there are at worst deterministic bounds on potential "stop the world" conditions.

While this may not be presently optimal because profile-guided approaches can deliver better performance by tuning allocation pool and collection time parameters, it's arguably a more consistent and statically analyzable approach that with improvement in compilers may yield better overall performance. It also provides tight bounds on "stop the world" situations, which also exist far less frequently on mobile platforms than in long running sever applications.

Beyond those theoretical bounds, it's certainly much easier to handle when you have an OS that is loading and unloading applications according to some policy. This is extremely relevant as most sensible apps are not actually long running.


> but it's generally better than most full GC solutions

I doubt that. It implies huge costs without giving any benefits of GC.

A typical GC have compaction, nearly stack-like fast allocation [1], ability to allocate a bunch of objects at once (just bump the heap pointer once for a whole bunch).

And both Perl and Swift do indeed perform abysmally, usually worse than both GC and manual languages [2].

> ARC is more of less GC

Well, no. A typical contemporary GC is generational, often concurrent, allowing fast allocation. ARC is just a primitive allocator with ref/deref attached.

[1] http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.49....

[2] https://github.com/ixy-languages/ixy-languages


It is nowhere near stack-like. Stack is hot in cache. Heap memory in tlab is cold. Bringing the lines into cache is the major cost, not bumping the pointer.


> Stack is hot in cache. Heap memory in tlab is cold.

What? This doesn't make any sense. From the cache's POV stack and bump-allocated heap are the same thing. Both are continuous chunks of memory where the next value is being allocated right after the previous one.

The only difference between the stack and the bump-allocated heap is that the former has hardware support for pointer bumping and the latter has not. That's all.


You're missing the fact that the tlab pointer is only ever moved forward, so it always points to recently unused memory. Until the reset happens and it points back to the same memory again, the application managed to allocate several megabytes or sometimes hundreds of megabytes, and most of that new-gen memory does not fit even in L3 cache.

The stack pointer moves both directions and the total range of that back-and-forth movement is typically in kilobytes, so it may fit fully in L1.

Just check with perf what happens when you iterate over an array of 100 MB several times and compare that to iterating over 10 kB several times. Both are contiguous but the performance difference is pretty dramatic.

Besides that, there is also an effect that the faster you allocate, the faster you run out of new gen space, and the faster you trigger minor collections. These are not free. The faster you do minor collections, the more likely it is for the objects to survive. And the cost is proportional to survival rate. That's why many Java apps tend to use pretty big new generation size, hoping that before collection happens, most of young objects die.

This is not just theory - I saw this just too many times, when reducing allocation rate to nearly zero caused significant speedups - by order of magnitude of more. Reducing memory traffic is also essential to get good multicore scaling. It doesn't matter each core has a separate tlab, when their total allocation rate is so high that they are saturating LLC - main memory link. It is easy to miss this problem by classic method profiling, because a program with such problem will manifest by just everything being magically slow, but no obvious bottleneck.


> You're missing the fact that the tlab pointer is only ever moved forward, so it always points to recently unused memory. Until the reset happens and it points back to the same memory again, the application managed to allocate several megabytes or sometimes hundreds of megabytes, and most of that new-gen memory does not fit even in L3 cache.

Yes, you are right about stack locality. It indeed moves back and forward making effective used memory region quite small.

> These are not free. The faster you do minor collections, the more likely it is for the objects to survive. And the cost is proportional to survival rate.

Yes, that's true. Immutable languages are doing way better here having small minor heaps (OCaml has 2MB on amd64) and very small survival rates (with many object being directly allocated on older heap if they are known to be lasting in advance).

Now I understand your point better and I agree.


A C app will tend to outperform a Java or Golang app by 3x, so it isn't too surprising.


Could you please provide a source for this?

Java is very fast and 3X slower is a pretty wild claim.


Those people have a really good claim to have the most optimized choice on each language. They've found Java to be 2 to 3 times slower than C and Rust (with much slower outliers).

https://benchmarksgame-team.pages.debian.net/benchmarksgame/...

On the real world, you won't get things as optimized in higher level languages, because optimized code looks completely unidiomatic. A 3x speedup from Java is a pretty normal claim.


Speaking of, I wish there were an "idiomatic code benchmarks game". Some of us want to compare language speed for common use cases vs trying to squeeze every last piece of performance from it.



D, Nim and Crystal all do very well on all metrics. Author finds Rust pretty close but not as maintainable. Interesting that the top 3 (performance close to C++, but more maintainable) all are niche languages that haven't really broken into the mainstream.


I really wish Intel or MS or someone would fund D so it could give Go and Rust a run for their money. It's as fast (or faster), expressive, powerful and, subjectively, easier to pick up and code in than Rust. It just needs some backers with muscle.


You probably have Swift in the same niche... and more elegant, not completely ignoring the last few decades of language research etc. If you want something more minimalistic there's Zig. D is just "C++ + hindsight", nothing special, only extra fragmentation of dev ecosystem by bringing in another language.

Ofc, Apple is not MS, so Swift developer experience and docs kind of suck (if you're not in the "iOS bubble" where it's nice and friendly), and multiplatform dev experience especially sucks even worse...

And "easier to pick up and code in" might not necessarily be an advantage for a systems language - better start slowly putting the puzzle together in your head, and start banging up code others need to see only after you've internalized the language and its advanced features and when (not) to use them! It helps everyone in the long run. This is one reason why I'd bet on Rust!


"completely ignoring the last few decades"

Well, for fairness, D is quite a bit older than Swift. (It's nearly as much older than Swift as it is younger than C++!) But what do you think pushes Swift out of the "C++ with hindsight" basket?


Maybe some big FAANG company. Start at the beginning of the acronym, I guess. I wonder if anyone could persuade anyone at Facebook to do a little bit of D professionally. I bet if even one serious Facebook engineer made a serious effort to use D, its adoption woes would be over.


Facebook is unlikely to move to D at this point and already has multiple serious projects in Rust (that they’re happy about, AFAIK).


Walter Bright the author of D worked at Facebook writing a fast preprocessor for C/C++ in D.


Alexandrescu, too.

They also talked about it pretty frequently.

Maybe ggp's claim that D just needs a heavy hitter backing it might be misplaced.


I don't think it's misplaced. Facebook wasn't really backing D by hiring them. And it doesn't seem like that specific project is active anymore.


Walter too? AFAIK it was just Andrei.


I thought Facebook used to “unofficially” support D? Or am I mixing languages up?

Right now, I think eBay is one of the big companies that uses D.

Edit: Sibling comment beat me to it.


That is what MS is doing with C#/F# and .NET Native/Xamarin AOT/CoreRT, with the experience taken from Midori.

So I doubt they would sponsor D.


C# is an attempt of making Java good, F# is an attempt of making a subset of Haskell popular. .Net Native/Xamarin/CoreRT are UI frameworks. There is nothing there that would compete with C++.

I don't think MS has any interest in improving C++ (look at their compiler). But that's not because of competing activities.


Except the lessons learned from Midori and .NET Native on UWP.

Visual C++ is the best commercial implementation of C++ compilers, better not be caught using xlc, aCC, TI, IAR, icc and plenty of other commercial offerings.

If C++ has span, span_view, modules, co-routines, core guidelines, lifetime profile static analyser, is it in large part to work started and heavily contributed by Microsoft on ISO, and their collaboration with Google.

As for competing with C++, it is quite clear, specially when comparing the actual software development landscape with the 90's, that C++ has lost the app development war.

Nowadays across all major consumer OSes it has been relegated to the drivers and low level OS services layer like visual compositor, graphics engine, GPGPU binding libraries.

Ironically, from those mainstream OSes, Microsoft is the only one that still cares to provide two UI frameworks directly callable from C++.

Which most Windows devs end up ignoring in favour of the .NET bindings, as Kenny Kerr mentions in one of his talks/blog posts.

Back to the D issue, Azure IoT makes use of C# and Rust, and there is Verona at MSR as well, so as much I would like to see them spend some effort on D, I don't see it happening.


CoreRT is an UI frameworks, what ?


I strongly believe, that people should consider the ecosystem part of maintainability risk and claim

Rust have by far the better ecosystem compared to those 3


That’s highly subjective claim to bare without evidence. All three languages are actively maintained and are growing.

My refute it simply: Rusts web development story isn’t out of the box clean like Crystal Lang’s which ships with an HTML language out of the box. So it could be categorized as a poor choice in comparison to Crystal


> ships with an HTML language

Did you mean HTTP server? If so, there are at least 3 good ones in Rust that are only a `cargo add` away. If you've already taken the trouble to set up a toolchain for a new language, surely a single line isn't asking too much.


Rust won't get a maintainability prize for short programs with simple algorithms.

It's Haskell-like features only help on large and complex programs.


They do help library writers, and therefore turn large and complex programs into shorter programs.


— Expressiveness "keep in mind that this is a subjective metric based on the author's experience!"

— Maintenance Complexity "keep in mind that this is a subjective metric based on the author's experience!"


>> I wish there were an "idiomatic code benchmarks game"

Help make one:

— contribute programs that you consider to be "idiomatic"

https://salsa.debian.org/benchmarksgame-team/benchmarksgame/...

— use those measurement and web scripts to publish your own "common use cases" "idiomatic code" benchmarks game

----

>> compare language speed for common use cases vs trying to squeeze every last piece of performance

— please be clear about why you wish to compare the speed of programs written as if speed did not matter

----

??? What do you think is not "idiomatic" about programs such as:

https://benchmarksgame-team.pages.debian.net/benchmarksgame/...


There is a lot more than the language/compiler what influences the results, but at least these benchmarks are closer to real world than solving math puzzles in micro benchmarks.

https://www.techempower.com/benchmarks/


The problem would be to get a definition of idiomatic code.

Would it be fair to accept an optimization on a language and refuse it on another because it is not idiomatic ?


VMs with JIT like the JVM are only ever really fast/competitive with C in small numerical micro-benchmarks where the code can be hyper-optimized.

Most code will be considerably slower due to a lot of factors.

Java in particular is a very pointer-heavy language, made up of pointers to pointers to pointers everywhere, which is really bad for our modern systems that often are much more memory latency than CPU constrained.

A factor of 2-4x to languages like C++ or Rust for most code seems plausible (and even low) unless the limiting factor is external, like network or file system IO.


This stuff is really hard to pin down though. I've been reading these sorts of debates forever.

It's true that pointer chasing really hurts in some sorts of program and benchmark. For sure. No argument. That's why Project Valhalla exists.

But it's also my view that modern C++ programming gets away with a lot of slow behaviours that people don't really investigate or talk about because they're smeared over the program and thus don't show up in profilers, whereas actually the JVM fixes them everywhere.

C++ programs tend to rely much more heavily on copying large structures around than pointer-heavy programs. This isn't always or even mostly because "value types are fast". It's usually because C++ doesn't have good memory management so resource management and memory layout gets conflated, e.g. std::vector<BigObject>. You can't measure this because the overheads are spread out over the entire program and inlined everywhere, so don't really show up in profiling. For the same reasons C++ programs rely heavily on over-specialised generics where the specialisation isn't actually a perf win but rather a side effect of the desire for automatic resource management, which leads to notorious problems with code bloat and (especially) compile time bloat.

Another source of normally obscured C++ performance issues is the heap. We know malloc is very slow because people so frequently roll their own allocators that the STL supports this behaviour out of the box. But malloc/new is also completely endemic all over C++ codebases. Custom allocators are rare and restricted to very hot paths in very well optimised programs. On the JVM allocation is always so fast it's nearly free, and if you're not actually saturating every core on the machine 100% of the time, allocation effectively is free because all the work is pushed to the spare cores doing GC.

Yet another source of problems is cases where the C++ programmer doesn't or can't actually ensure all data is laid out in memory together because the needed layouts are dynamically changing. In this case a moving GC like in the JVM can yield big cache hit rate wins because the GC will move objects that refer to each other together, even if they were allocated far apart in time. This effect is measurable in modern JVMs where the GC can be disabled:

https://shipilev.net/jvm/anatomy-quarks/11-moving-gc-localit...

And finally some styles of C++ program involve a lot of virtual methods that aren't always used, because e.g. there is a base class that has multiple implementations but in any given run of the program only one base class is used (unit tests vs prod, selected by command line flag etc). JVM can devirtualise these calls and make them free, but C++ compilers usually don't.

On the other hand all these things can be obscured by the fact that C++ these days tends only to be used in codebases where performance is considered important, so C++ devs write performance tuned code by default (or what they think is tuned at least). Whereas higher level languages get used for every kind of program, including the common kind where performance isn't that big of a deal.


The costs of languages like C++ also get worse the older a program is.

Without global knowledge of memory lifetimes, maintainers make local decisions to copy rather than share.


> We know malloc is very slow because people so frequently roll their own allocators that the STL supports this behaviour out of the box. But malloc/new is also completely endemic all over C++ codebases. Custom allocators are rare and restricted to very hot paths in very well optimised programs. On the JVM allocation is always so fast it's nearly free, and if you're not actually saturating every core on the machine 100% of the time, allocation effectively is free because all the work is pushed to the spare cores doing GC.

Allocation in a C++ program is going to be about the same speed as in a Java program. Modern mallocs are doing basically the same thing on the hot-path: bumping the index on a local slab allocator.


3x might be a bit too much today, but it's definitely slower than C. Also to be considered is the VM overhead, not just the executed code.

Here are some benchmarks; I'll leave to the experts out there to confirm or dismiss them.

https://benchmarksgame-team.pages.debian.net/benchmarksgame/...


> 3x might be a bit too much today, but it's definitely slower than C.

If anything the gap is increasing not shrinking. JVM is terrible at memory access patterns due to the design of the language, and designing for memory is increasingly critical for maximum performance on modern systems. All the clever JIT'ing in the world can't save you from the constant pointer chasing, poor cache locality, and poor prefetching.

The gap won't shrink until Java has value types. Which is on the roadmap, yes, but still doesn't exist just yet.

The problem with those benchmarks is if you look at the Java code you'll see it's highly non-idiomatic. Almost no classes or allocations. They almost all exclusively use primitives and raw arrays. Even then it still doesn't match the performance on average of the C (or similar) versions, but if you add the real-world structure you'd find in any substantial project that performance drops off.


> Almost no classes or allocations.

Plenty of `inner static classes`. Where's the up-to-date comparison showing that `static` makes a performance difference for those tiny programs?

Plenty of TreeNode allocations.

----

> The problem with …

The problem with saying "in any substantial project that performance drops off" is when we don't show any evidence.


When you say "value types" in this context, you mean "copy-only types"?


I mean "value types" https://www.jesperdj.com/2015/10/04/project-valhalla-value-t... & http://cr.openjdk.java.net/~jrose/values/values-0.html

I don't know what you mean by "copy-only types." I'm not finding any reference to that terminology in the context of language design.


Ah, thanks for the link; I wasn't sure what it meant in the context of Java, since it's possible to get value semantics using a class.

Sorry about the confusion. I meant for the quotes around "copy-only" to indicate that it wasn't really a standard term, but I marked "value types" the same way, so that didn't really work. By "copy-only" I meant something you couldn't have more than one reference to: every name (variable) to which you assign the data would have its own independent copy.


> By "copy-only" I meant something you couldn't have more than one reference to: every name (variable) to which you assign the data would have its own independent copy.

That's not really a requirement of value types, no. C# has value types (structs) and you can have references to them as well (ref & out params).

In general though yes it would be typically copied around, same as an 'int' is. Particularly if Java doesn't add something equivalent to ref/out params.


I know, that's why I asked for clarification originally. :) Anyways, I appreciate the details.


However the memory usage difference is astonishing for some of those benchmarks - using 1000x more memory is only acceptable for some situations.


This may be overly cynical, but don’t you have it backwards? 1000x greater memory usage is acceptable for most applications.

There are only a few applications for which it isn’t acceptable.

Just look at all the massive apps built on Electron. People wouldn’t do that if it wasn’t effective.


Wouldn't you say there's a difference between "effective" and "getting away with it"? If non-technical users see that their daily computing lives are made more complicated (because of lowered performance) by having n Electron apps running at the same time, they may not understand the reasons, but they will certainly choose a different solution that has the features they need, where available.


Wouldn't you say there's a difference between "effective" and "getting away with it"?

I used to think that but now I’m not so sure. :(


— default JVM allocation ~35 MB looks astonishing for tiny programs that don't allocate memory

— memory usage is less different for tiny programs that do allocate memory: reverse-complement, k-nucleotide, binary-trees, mandelbrot, regex-redux


Agreed, and ironically the most widely used Java platform (Android), despite its VM optimizations, is the one which would benefit the most from running only native code.

I mean, those 1GB RAM 7 years old slow as molasses phones getting dust into drawers or being littered into landfills would scream if they didn't have to run everything behind a VM.


Make no mistake — Android isn't memory hungry because of Java alone. Android 4 used to comfortably fit in 1Gb of RAM, but Android 10 no longer can run on such devices. Both old and new versions use JVM, but newer Android has a lot of "helpful" system services, such as "Usage Statistics" service [1] and "Autofill Servide" [2].

Google's spyware is becoming more invasive and thus more memory-hungry.

1: https://developer.android.com/reference/android/app/usage/Us...

2: https://developer.android.com/reference/android/service/auto...


Really depends on the domain. There's some things that are a lot easier to make scale up in a language with a great concurrent gc, because that makes writing some lock free data structures quite fundamentally easier (no complicated memory reclamation logic, trivial to avoid ABA problems).


GC makes it easier to write, but not necessarily better. Modern state-of-the-art Java GCs operate a global heap, so you often pay for what you do not use. In languages like Rust or C++ your can build small locally GCed heaps, where you can limit GC overhead to just a few particular structures that need GC, not everything. Therefore other structures like caches don't affect GC pause times.

And the "hardness" of writing lockless structures is strongly offset by libraries, so unless you're doing something very exotic, it is rarely a real problem.


“GC makes it easier to write.” I disagree. GC means I don’t get deterministic destruction which means I can’t use RAII properly.


The post was about writing lockless structures. With lockless RAII is of no use. See ABA problem. RAII is awesome in other places.


It depends greatly on the problem domain. The difference might be near zero, or you might be able to get ~16x better performance (using say, AVX-512 intrinsics). Then again, is intrinsics really C? Not really, but you can do it. What if you have to abandon using classes when you want to, in order to get the memory layout you want in Java, are you still using Java?


This statement is definitely wrong in this generic blank fashion. Also I would lay upon you the burden of proof for it :).

Tight, low level code in Java and Go is roughly as fast as average C code. The Go compiler is know to be less good at optimizing code than e.g. GCC, but this in many cases creates little practical difference, while the Java JIT compilers have become excellent to a point where they often beat GCC, especially as they can use run time profiling for code optimization. So they can optimize the code for the actual task at hand.

Where the languages differ in "speed" is their runtime environment. Java and Go are languages with garbage collection, which of course means that some amount of CPU is required to perform GC. But as the modern garbage collectors run in parallel with the program, this CPU effort often enough is no bottleneck. On the other side, manual memory management has different performance trade-offs, which in many cases can make it quite slow on its own.


This really depends on what job a program does and how well the C program is implemented.


I call utter bullshit especially when dealing with threads. I think you'll spend so much time debugging pointers, the stack and your memory allocations that switching to a more modern language could save you significant debugging time.

But now I sound like a Geico (insurance) commercial. Sorry about that.


> The JVM world tends to solve this problem by using off-heap caches. See Apache Ignite [0] or Ehcache [1].

For those who care, I was interested how off-heap caching works in Java and I did some quick searching around the Apache Ignite code.

The meat is here:

- GridUnsafeMemory, an implementation of access to entries allocated off-heap. This appears to implement some common Ignite interface, and invokes calls to a “GridUnsafe” class https://github.com/apache/ignite/blob/53e47e9191d717b3eec495...

- This class is the closest to the JVM’s native memory, and wraps sun.misc.Unsafe: https://github.com/apache/ignite/blob/53e47e9191d717b3eec495...

- And this, sun.misc.Unsafe, is what it’s all about: http://www.docjar.com/docs/api/sun/misc/Unsafe.html

It’s very interesting because I did my fair share of JNI work, and context switches between JVM and native code are typically fairly expensive. My guess is that this class was likely one of the reasons why Sun ended up implementing their (undocumented) JavaCritical* etc functions and the likes.


Unsafe lets you manipulate memory without any JNI overhead other than when allocating or de-allocating memory, and that is usually done in larger chunks and pooled to avoid the overhead at steady state. Netty also takes advantage of Unsafe to move a lot of memory operations off the java heap.

Unsafe was one of the cooler aspects to Java that Oracle is actively killing for, well, no good reason at least.


They aren't killing it. They're steadily designing safe and API stable replacements for its features, with equal performance. That is a very impressive engineering feat!

For instance fast access to un-GCd off heap memory is being added at the moment via the MemoryLayout class. Once that's here apps that upgrade won't need to use Unsafe anymore. MemoryLayout gives equivalent performance but with bounds checked accesses, so you can't accidentally corrupt the heap and crash the JVM.

They've been at it for a long time now. For instance VarHandle exposes various low level tools like different kinds of memory barriers that are needed to implement low level concurrency constructs. They're working on replacements for some of the anonymous class stuff too.


> Unsafe was one of the cooler aspects to Java that Oracle is actively killing for, well, no good reason at least.

I mean, there's the obvious reason that it breaks the memory safety aspect that Java in general guarantees. The whole point of the feature is to subvert the language & expectations.

I'm not saying they should remove it, but it's pretty hard to argue there's "no good reason" to kill it, either. It is, after all, taking the worst parts of C and ramming it into a language that is otherwise immune from that entire class of problems.


Unsafe is being replaced by less error prone APIs, not killed.

Project Panama is what is driving that effort.


C# seems to have a neat middle ground for this kind of stuff with their Span<T> api.


True, but we had our own version of unsafe for a much longer time. MS was just pragmatic enough to allow it across the ecosystem.

I'm guessing at least some of that was a side effect of wanting to support C++; not having pointers as an option would have killed C++/CLI from the get go.


> context switches between JVM and native code are typically fairly expensive

Aren't these Unsafe memory read and write methods intrinsified by any serious compiler? I don't believe they're using JNI or doing any kind of managed/native transition, except in the interpreter. They turn into the same memory read and write operations in the compiler's intermediate representation as Java field read and writes do.


They are optimized, yes, but from what I recall from reading the JVM code a few years ago, some optimizations don't get applied to those reads/writes. For example, summing two arrays together will be vectorized to use SSE instructions while doing so through Unsafe won't [0].

[0] https://cr.openjdk.java.net/~vlivanov/talks/2017_Vectorizati...


The idea is that that call is still less expensive than going over the wire and MUCH less expensive than having the GC go through that heap now and then.


Yes sorry I should have elaborated, those Critical JNI calls avoid locking the GC and in general are much more lightweight. This is available for normal JNI devs as well, its just not documented. They were primarily intended for some internal things that Sun needed.

I’m now guessing that this might actually have been those Unsafe classes as an intended use case. It makes total sense and I can see how that will be very fast.


> The JVM world tends to solve this problem by using off-heap caches. See Apache Ignite [0] or Ehcache [1].

Yeah, but I really do not bite your argument.

When you are reduced to do manual memory management and fight the GC of your language, maybe you should simply not use a language with GC in the first place.

They are right to use rust ( or C/C++) for that. It's not for nothing that redis (C) is so successful in the LRU domain.

> It's worth mentioning that Apache Cassandra itself uses an off-heap cache.

And still ScyllaDB (C++) is able to completely destroy Cassandra in term of AVG latency [0]

[0]: https://www.scylladb.com/product/benchmarks/


> I can't speak for how their Rust cache manages memory, but the thing to be careful of in non-GC runtimes (especially non-copying GC) is memory fragmentation.

As far as I know, a mark-and-sweep collector like Go's doesn't have any advantage over malloc/free when it comes to memory fragmentation. Am I missing some way in which Go's GC helps with fragmentation?


Go GC implementation uses memory allocator that was based on TCMalloc (but derived from it quite a bit). They use a free list of multiple fixed allocatable size-classes, which helps in reducing fragmentation. That's why Go GC is non-copying.


I’m not sure I follow. GC implementations that don’t copy (relocate) are inherently subject to the performance cost of “fragmentation” (in the sense of scattering memory accesses over non-adjacent regions). This is a very high price to pay when you’re dealing with modern hardware.


Allocator underneath is keeping track of freed memory, so next allocation has high chance of being squeezed into memory region that has been used before. It's obviously not as good as say GC that relocates after sweep, but at least it doesn't leave gaping holes.


Indeed, but it also doesn’t maintain locality of access nearly as well for young objects (the most commonly manipulated ones) and even older ones that survive.


one related point: the article mentions utilizing rust's BTreeMap, which manages its heap allocations with cache efficiency in mind: https://doc.rust-lang.org/std/collections/struct.BTreeMap.ht....

The guts of BTreeMap's memory management code is here: https://github.com/rust-lang/rust/blob/master/src/liballoc/c.... (warning: it is some of the most gnarly rust code I've ever come across, very dense, complex, and heavy on raw pointers. this is not a criticism at all, just in terms of readability). Anecdotally I've had very good results using BTreeMap in my own projects.

In terms of how the "global" allocator impacts performance, I'd expect it to play a bigger role in terms of Strings (I mean, it's a chat program), and possibly in terms of how the async code "desugars" in storing futures and callbacks on the heap (just guessing, I'm not an expert on the rust async internals).


I was a bit surprised that BTreeMap turned out to be more memory-efficient than HashMap; can anyone shed some light on this?

One of these days I’ll get around to turning my Rust set implementation[0] into a full-blown map (it’s already <50% the size of BTreeSet for ints)...

[0] https://github.com/senderista/rotated-array-set


In the current context, fragmentation refers more to the problem of consuming extra memory through fragmentation, which malloc implementations like the one Go (or Rust, or glibc) uses can often mitigate.


Sure, but the malloc implementation in Rust probably does something similar.

What I wanted to understand is what is the difference in fragmentation between a non-copying, non-compacting GC and a non-GC runtime.


Maybe I've missed this, but why do they need a particularly large LRU cache? Surely this isn't all one process, so presumably they could reduce spikes by splitting the same load across yet more processes?


Larger cache = faster performance and less load on the database.

I only glossed over the article but the problem they had with Go seems to be the GC incurred from having a large cache. Their cache eviction algorithm was efficient, but every 2 minutes there was a GC run which slowed things down. Re-implementing this algorithm in Rust gave them better performance because the memory was freed right after the cache eviction.

Splitting it across more processes will result in more cache misses and more DB calls.


I am of course talking about the same amount of total cache RAM, just split among more processes. Depending on distribution of the calls, you might get more cache misses, but I don't think it's guaranteed, and if it is, I don't think we can assume it's significant. Heck, you could even use more cache RAM; the cost of a total rewrite plus ongoing maintenance in a new language covers a fair bit of hardware these days.


Great comment and thanks for the reading material.

Now I'm wondering if there's a Rust library for a generational copying arena--one that compacts strings/blobs over time.


Generational arenas yes, but copying, I'm not aware of one. It's very hard to get the semantics correct, since you can't auto-re-write pointers/indices.


Perhaps such a library could help you record the location of the variables that contain pointers to the strings and keep that pointer up to data as the ownership of the string moves from variable to variable?

I'm other words, doing some of the work a moving compacting collector would do during compaction but continuously during normal program execution.


There's no way to hook into the move, so I don't see how it would be possible, or at least, not with techniques similar to compacting GCs.


Maybe by reifying the indirection? The compacting arena would hand out smart pointers which would either always bounce through something (to get from an indentity to the actual memory location, at a cost) or it'd keep track and patch the pointers it handed out somehow.

Possibly half and half, I don't remember what language it was (possibly obj-c?) which would hand out pointers, and on needing to move the allocations it'd transform the existing site into a "redirection table". Accessing pointers would check if they were being redirected, and update themselves to the new location if necessary.

edit: might have been the global refcount table? Not sure.


Yeah so I was vaguely wondering about some sort of double indirection; the structure keeps track of "this is a pointer I've handed out", those pointers point into that, which then points into the main structure.

I have no idea if this actually a good idea, seems like you get rid of a lot of the cache locality advantages.


This sounds a lot like classic MacOS (pre-Darwin) memory allocation. You were allocated a handle, which you called Lock on to get a real pointer. After accessing the memory, you called Unlock to release it. There was definitely a performance hit for that indirection.


It's the same on classic Windows (pre-Windows 95) memory allocation. GlobalAlloc with GMEM_MOVEABLE or LocalAlloc with LMEM_MOVEABLE returned a handle, which you called GlobalLock or LocalLock on to get a real pointer. After accessing the memory, you called GlobalUnlock or LocalUnlock to release it. Of course, this being Microsoft, you can still call these functions inherited from 16-bit Windows even on today's 64-bit Windows. (See Raymond Chen's "A history of GlobalLock" at https://devblogs.microsoft.com/oldnewthing/20041104-00/?p=37...).


I don't know that the cache locality would be a big issue (your indirection table would be a small-ish array), however you'd eat the cost of doubling the indirections, each pointer access would be two of them.


On top of the cost of the extra pointer lookup, you also run into cache coherency issues when dealing with threading. So then you need to use atomic ops or locks or cache flushing which makes it even more expensive.

Rust is better suited to deal with it since there's a similar issue with refcounting across threads, so you might be able to get away with doing it for objects that are exclusive to one thread.


I would give out handles and have a Guard Object, that allows you to get smart pointers from handles as long as Guard Object is in scope. Then when Guard Object is out of scope, the smart pointers would get invalidated.


It would be possible (and not that hard) for Copy types, but you'd need some custom derive traits if you had any pointers to managed objects.


> From a purely architectural perspective, I would try to put cacheable material in something like memcache or redis

You cannot use a caching server at that scale with those latency requirements. It has to be embedded


Something like rocksdb (https://rocksdb.org/) then


You cannot use that either, due to I/O block device saturation, even on enterprise PCIe NVMe. Not enough parallel IOPS. Only RAM can be used.


> From a purely architectural perspective, I would try to put cacheable material in something like memcache or redis, or one of the many distributed caches out there. But it might not be an option.

Can you speak to why using something like memcache or redis may not be an option?


For latency-sensitive services, having to traverse the network to access a shared cache may be too slow. To use the current story as an example, you'd be trading off an occasional 100-millisecond latency spike every 2 minutes for an added 1-2ms of latency for every request.


Wow. We literally just published why to not put a cache in front of your server to mask its bad performance behind a layer of complexity. tl;dr: make sure you have a solid DB to begin with. (Forgive the gated asset, but it's a good read!)

https://go.scylladb.com/7-reasons-no-external-cache-database...


Don't know enough about Rust, but I think Go would benefit immensely by allowing its users to disable GC and allow de-allocating memory by hand. GC is great for simpler applications, but more complex projects end up fighting so much with memory and GC in Go that all the benefits of automatic de-allocations are negated. Love every other aspect of Go.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: