Seems like you were hitting: runtime: Large maps cause significant GC pauses #94...

jasondclinton · on Feb 4, 2020

Finding out if that does resolve the author's issue would be interesting but I'm not sure that that would be particularly supportive data in favor of Go. If anything it would reinforce the downsides of Go's GC implementation: prone sudden pitfalls only avoidable with obtuse, error-prone fiddling that makes the code more complex.

After spending weeks fighting with Java's GC tuning for a similar production service tail latency problem, I wouldn't want to be caught having to do that again.

masklinn · on Feb 4, 2020

The good news are that Go's GC has basically no tunables, so you wouldn't have spent weeks on that. The bad news is that it has basically no tunables so if it's a tuning issue you're either fucked or have to put "tuning" hacks right into the code if you find any that works (e.g. twitch's "memory ballast" to avoid overly aggressive GC runs: https://blog.twitch.tv/en/2019/04/10/go-memory-ballast-how-i...)

PeterCorless · on Feb 4, 2020

There are tradeoffs with all languages. C++ avoids the GC, but you then have to make sure you know how to avoid the common pitfalls of that language.

We use C++ at Scylla (saw that we got a shout-out in the blog! Woot!) but it's not like there isn't a whole industry about writing blogs avoiding C++ pitfalls.

C++ pitfalls through the years... • https://www.horstmann.com/cpp/pitfalls.html (1997) • https://stackoverflow.com/questions/30373/what-c-pitfalls-sh... (2008) • http://blog.davidecoppola.com/2013/09/cpp-pitfalls/ (2013) • https://www.typemock.com/pitfalls-c/ (2018)

I am not saying any of these (Go, Rust, C++, or even Java) are "right" or "wrong" per se, because that determination is situational. Are you trying to optimize for performance, for code safety, for taking advantage of specific OS hooks, or oppositely, to be generically deployable across OSes, or for ease of development? For the devs at Scylla, the core DB code is C++. Some of our drivers and utilities are Golang (like our shard aware driver). There's also a Cassandra Rust driver — it'd be sweet if someone wants to make it shard-aware for Scylla!

zorkian · on Feb 4, 2020

(Discord infra person here.)

Actually we didn't update the reference to Cassandra in the article -- the read states workload is now on Scylla too, as of last week. ;)

We'll be writing up a blog post on our migration with Scylla at some point in the next few months, but we've been super happy with it. I replaced our TokuMX cluster with it and it's faster, more reliable, _and_ cheaper (including the support contract). Pretty great for us.

PeterCorless · on Feb 4, 2020

Woot! Go you! (Or Rust you! Whichever you prefer!)

mathw · on Feb 5, 2020

What a glorious combination of things! What a shame faster, more reliable and cheaper don't usually go together, but that's the challenge all developers face...

hinkley · on Feb 4, 2020

The common factor in most of my decisions to look for a new job has been realizing that I feel like a very highly compensated janitor instead of a developer.

Once I spend even the plurality of my time cleaning up messes instead of doing something new (and there are ways to do both), then all the life is sucked out of me and I just have to escape.

Telling me that I have to keep using a tool with known issues that we have to process or patches to fix would be super frustrating. And the more times we stumble over that problem the worse my confirmation bias will be.

Even if the new solution has a bunch of other problems, the set that is making someone unhappy is the one that will cause them to switch teams or quit. This is one area where management is in a tough spot with respect to rewrites.

Rewrites don't often fix many things, but if you suspect they're the only thing between you and massive employee turnover, you're between a rock and a hard place. The product is going to change dramatically, regardless of what decision you make.

outworlder · on Feb 4, 2020

While I completely agree with the "janitor" sentiment... and for Newton's sake I feel like Wall-E daily...

> Telling me that I have to keep using a tool with known issues that we have to process or patches to fix would be super frustrating.

All tools have known issues. It's just that some have way more issues than others. And some may hurt more than others.

Go has reached an interesting compromise. It has some elegant constructs and interesting design choices (like static compilation which also happens to be fast). The language is simple, so much so that you can learn the basics and start writing useful stuff in a weekend. But it is even more limiting than Java. A Lisp, this thing is not. You can't get very creative – which is an outstanding property for 'enterprises'. Boring, verbose code that makes you want to pull your teeth out is the name of the game.

And I'm saying this as someone who dragged a team kicking and screaming from Python to Go. That's on them – no-one has written a single line of unit tests in years, so now they at least get a whiny compiler which will do basic sanity checks before things blow up in prod. Things still 'panic', but less frequently.

pstuart · on Feb 4, 2020

I'll take boring over WTF code any day :-)

nicoburns · on Feb 4, 2020

It's not necessarily an either-or though. I'll take clear, concise expressive code over either!

tensor · on Feb 4, 2020

Most development jobs on products that matter involve working on large established code bases. Many people get satisfaction from knowing that their work matters to end users, even if it's not writing new things in the new shiny language or framework. Referring to these people as "janitors" is pretty damn demeaning, and says more about you than the actual job. Rewrites are rarely the right call, and doing simply to entertain developers is definitely not the right call.

heinrich5991 · on Feb 4, 2020

>Referring to these people as "janitors" is pretty damn demeaning,

"Referring to the term of "janitors" as demeaning is pretty demeaning and says more about you than your judgement of the parent."

I don't like this rhetoric device you just used.

Also, I think that janitors do important work as well.

hu3 · on Feb 4, 2020

Let's not fool ourselves.

The demeaning of janitors was introduced by GP by describing it as something they would rather not do.

No mental gymnastics required.

cma · on Feb 5, 2020

He said he felt like a janitor, next guy said he demeaned others as janitors, and now you are saying he demeaned janitors. There is a level of gymnastics going on.

hu3 · on Feb 5, 2020

First paragraph:

> The common factor in most of my decisions to look for a new job has been realizing that I feel like a very highly compensated janitor instead of a developer.

So for that person, feeling like a janitor is incentive for seeking a new job. It's that simple really.

CDSlice · on Feb 5, 2020

That doesn't mean he is demeaning janitors, just that he doesn't want to be one. There are loads of reasons to not want to be a "code janitor" besides looking down at janitors.

rswail · on Feb 5, 2020

I'm not a gymnast, but comparing people comparing their work to janitors and calling it gymnastics demeans gymnasts.

/s

therockhead · on Feb 5, 2020

It helps if your think gardener instead of janitor.

rvcdbn · on Feb 4, 2020

For any tracing GC, costs are going to be proportional to the number of pointers that need to be traced. So I would not call reducing the use of pointers to ameliorate a GC issue "obtuse, error-prone fiddling". On the contrary, it seems like one of the first approaches to look at when faced with the problem of too much GC work.

Really all languages with tracing GC are at a disadvantage when you have a huge number of long-lived objects in the heap. The situation is improved with generational GC (which Go doesn't have) but the widespread use of off-heap data structures to solve the problem even in languages like Java with generational GC suggests this alone isn't a good enough solution.

In Go's defense, I don't know another GC'ed language in which this optimization is present in the native map data structure.

pjmlp · on Feb 5, 2020

Except that plenty of languages with tracing GC have also off GC memory allocation.

Since you mention not knowing such languages, have a look at Eiffel, D, Modula-3, Active Oberon, Nim, C#/F# (specially after the latest improvements).

Also Java will eventually follow the same idea as Eiffel (where inline classes are similar to expanded classes in Eiffel), and ByteBuffers can be off-GC heap.

typical182 · on Feb 4, 2020

Maybe that is what they hit... but it seems there is a pretty healthy chance they could have resolved this by upgrading to a more modern runtime.

Go 1.9 is fairly old (1.14 is about to pop out), and there have been large improvements on tail latency for the Go GC over that period.

One of the Go 1. 12 improvements in particular seems to at least symptomatically line up with what they described, at least at the level of detail covered in the blog post:

https://golang.org/doc/go1.12#runtime

“Go 1.12 significantly improves the performance of sweeping when a large fraction of the heap remains live.“

gwbas1c · on Feb 4, 2020

Everything I've read indicates that RAM caches work poorly in a GC environment.

The problem is that garbage collectors are optimized for applications that mostly have short-lived objects, and a small amount of long-lived objects.

Things like large in-RAM LRU are basically the slowest thing for a garbage collector to do, because the mark-and-sweep phase always has to go through the entire cache, and because you're constantly generating garbage that needs to be cleaned.

Nitramp · on Feb 5, 2020

> The problem is that garbage collectors are optimized for applications that mostly have short-lived objects, and a small amount of long-lived objects.

I think it's not quite that.

Applications typically have a much larger old generation than young generation, i.e. many more long lived objects than short lived objects. So GCs do get optimized to process large heaps of old objects quickly and efficiently, e.g. with concurrent mark/sweep.

However as an additional optimization, there is the observation that once an application has reached steady state, most newly allocated objects die young (think: the data associated with processing a single HTTP request or user interaction in a UI).

So as an additional optimization, GCs often split their heap into a young and an old generation, where garbage collecting the young generation earlier/more frequently overall reduces the mount of garbage collection done (and offsets the effort required to move objects around).

In the case of Go though, the programming language allows "internal pointers", i.e. pointers to members of objects. This makes it much harder (or much more costly) to implement a generational, moving garbage collector, so Go does not actually have a young/old generation split nor the additional optimization for young objects.

sorokod · on Feb 4, 2020

In this[1] video at about 32 min, mark there is a discussion on GC and apps that do caching.

[1] https://www.youtube.com/watch?v=VCeHkcwfF9Q

pjmlp · on Feb 5, 2020

Which is why on GC languages that also support value types and off GC-heap allocations, one makes use of them, instead of throwing out the baby with the water.

pkolaczk · on Feb 4, 2020

A high number of short lived allocations is also a bad thing in a compacting GC environment, because every allocation gets you a reference to a memory region touched very long time ago and it is likely a cache miss. You would like to do an object pool to avoid this but then you run into a pitfall with long living objects, so there is really no good way out.

mcguire · on Feb 4, 2020

???

The allocation is going to be close to the last allocation, which was touched recently, no? The first allocation after a compaction wii be far from recent allocations, but close to the compacted objects?

pkolaczk · on Feb 5, 2020

Close to the last allocation doesn't matter. What matters is the memory returned to the application - and this is memory that has been touched long ago and unlikely in cache. If your new generation size is larger than L3 cache it will have to be fetched from main memory for sure every time you start the next 64 bytes. I believe a smart cpu will notice the pattern and will prefetch to reduce cache miss latency. But a high allocation rate will use a lot of memory bandwidth and would thrash the caches.

An extreme case of that problem happens when using GC in an app that gets swapped out. Performance drops to virtually zero then.

bearcherian · on Feb 4, 2020

The article also mentions the service was on go 1.9.2, which was released 10/2017. I'd be curious to see if the same issues exist on a build based on a more recent version of Go.

kristianp · on Feb 5, 2020

I was thinking that if their cache is just one large hash table, essentially an array of structs, the GC wouldn't need to scan it. What you say about strings contained in the map would explain their problems, however I don't see the reason for it. Wouldn't you make sure every identifier uses a fixed-length GUID or similar, which would be contained in such a struct used in the array-of-structs?

asimpletune · on Feb 4, 2020

Ok but in rust those pointers can just be borrowed obviating the need for gc at all.

masklinn · on Feb 4, 2020

Given it's a cache the entries would not have an existing natural owner… except for the cache itself.

There would be no need for a GC to traverse the entire map, but that's because rust doesn't use a GC.

falcolas · on Feb 4, 2020

While Rust does not have a discrete runtime GC process, it does utilize reference counting for dynamic memory cleanup.

So you could argue that they are still going to suffer some of the downsides of a GC'ed memory allocation. Some potential issues include non-deterministic object lifespan, and ensuring that any unsafe code they write which interacts with the cache does the "right thing" with the reference counts (potentially including de-allocation; I'm not sure what unsafe code needs to do when referencing reference counted boxes).

masklinn · on Feb 4, 2020

> While Rust does not have a discrete runtime GC process, it does utilize reference counting for dynamic memory cleanup.

That's so misleading as to essentially be a lie.

Rust uses reference counting if and only if you opt into it via reference-counted pointers. Using Rc or Arc is not the normal or default course of action, and I'm not aware of any situation where it is ubiquitous.

> So you could argue [nonsense]

No, you really could not.

mcguire · on Feb 4, 2020

On the other hand, Rust's RAII management model behaves similarly to a reference counting system where the counts are limited to 0 and 1 (well, for a loose approximation of the 0 state), right?

steveklabnik · on Feb 4, 2020

Some people say this, but I think it's misleading; refcounting can make things live longer, but the borrow checker cannot.

Jweb_Guru · on Feb 5, 2020

Well, with ownership, a move can make things "live" longer, I guess.

steveklabnik · on Feb 5, 2020

You're not wrong. I just think there's enough difference that the analogy doesn't really work.

ta20200202 · on Feb 5, 2020

RAII for references (pointers) is a no-op. If the cache returns references to the data in its own array there is no overhead.

falcolas · on Feb 5, 2020

I was making an assumpotion that using a vector of ARC<T> would be the best way to handle a global LRU cache. Perhaps I should have specified it, but it seemed pretty obvious. Sorry if it wasn’t.

If there’s a better way to handle a global LRU cache, I’m all ears.

jhgg · on Feb 5, 2020

Assuming only one thread at a time needs to access the LRU cache (not hard with the shared-nothing message passing architecture which we employ here), the lifetime of the object being checked out from the cache is able to be understood at compile time, and we can just use the borrow checker to ensure that it remains that way (we've got a mutable reference to the LRU, and we can use that to get a mutable reference to an object within the LRU. By the time the function that is mutating the data in the LRU finishes, the references to the objects must be dead (the borrow checker will enforce that.) Since all this information is available during compile time, runtime ref-counting (via rc/arc) is not necessary.

This is made possible by rust's memory model, where it understands ownership of data, and the lifetime of each reference that's being taken from that owned data. This means that the compiler can statically determine how long an object needs to live, and that references to the object don't outlive the owned data. For use-cases where the lifetime of references are able to be statically understood, an arc/rc is not required. This blog-post goes into it in much better detail than I can: https://words.steveklabnik.com/borrow-checking-escape-analys...

falcolas · on Feb 5, 2020

Yes, I'm quite familiar with rust's borrow checking model. I've programmed some in rust, and the rest has been beaten into my head quite thoroughly by Rustaceans. I don't care for Rust, but I understand it.

Locking on one thread at a time seems like a pretty obvious performance flaw. It just doesn't seem like an appropriate design for the given workload (lots of requests, lots of stored items, largely write-only (except for its position in the queue)). It would make a lot more sense to grant multiple threads access the LRU at any given time.

And early optimization and all that aside, creating the LRU in such a way that it can be easily restricted to one thread or opened up makes the most sense to me. Otherwise, you get to re-write the LRU (and all the code which accesses it) if it should be identified as a bottleneck.

Of course, I'm not responsible for the code or truly involved in the design process, so my perspective may be limited.

jhgg · on Feb 5, 2020

In practice, for our service, most of our CPU time is not spent in data mutation, but rather networking and serialization (this is btw, the same conclusion Redis came to when they added "multi-threading".)

You can scale-out by running multiple instances of the service (shared-nothing, N many depending on how cores you want to run on.) Or, you can do message-passing between cores.

In this case, we have 2 modes of scale-up/out (add more nodes to the cluster, or add more shared-nothing LRU caches that are partitioned internally that the process runs, allowing for more concurrency).

We however only run one LRU per node, as it turns out that the expensive part is not the bottleneck here, nor will it probably ever be.

jstrong · on Feb 5, 2020

what kind of design do you have in mind? I assume you don't mean simultaneous reads/writes from multiple threads without synchronization - yolo! there's a lot of possible designs, mutex, read/write lock, concurrent hashmap. I've never worked on an LRU cache, asking because interested in what plays well in that use case, and how you would approach it in another language.

iknowstuff · on Feb 4, 2020

I think you're confusing Rust's ownership model with Swift's ARC. Rust doesn't do reference counting unless you use Rc<T> or Arc<T>.

falcolas · on Feb 5, 2020

Given the model of memory we are discussing (a global per-process LRU cache), that’s exactly what I was discussing using. Unless there’s another way to handle such global caches.