Achieving 10Gbps Line-rate Key-value Stores with FPGAs [pdf]

_wmd · on Nov 17, 2013

I've been thinking about something similar for most of the year, but not using custom hardware. With a zero-copy database like LMDB combined with userspace networking (e.g. netmap or PACKET_MMAP), it should already be possible to saturate 1Gbit with a single core even with just 100 byte values. With a little more effort, given that network cards commonly support scatter-gather IO, combined with a database (such as LMDB) that avoids copying or (mostly) touching record data, it's easy to imagine a design where most grunt work is kept away from the CPU entirely.

Given a 2x quad core box and a decent network card, I'm pretty sure you could approach or exceed 10Ge without requiring original research or an insane amount of code, especially as record size increases. Software could at least could compete on throughput, although the paper also indicates latency benefits using the FPGA design.

jasonwatkinspdx · on Nov 18, 2013

It's getting a little stale, but you may find this paper interesting: http://simula.no/research/nd/publications/Simula.ND.8/simula...

With anything that provides stable pointers to values like LMDB it should be possible to output to the network extremely efficiently. One thing to keep in mind with LMDB is that it's performance isn't so great with purely random keys such as SHA hashes. I'm fooling around on something similar myself targeting this use, and so far critbit tries are looking good (though I have yet to implement benchmarks I'm confident of).

I haven't dug through the OP paper yet, but my guess is that while the performance numbers are interesting, the cost of the FGPA itself would make it less desirable unless you really did need the latency performance/predictability.

scott_wilson46 · on Nov 18, 2013

One thing to note is that if as the paper says:

"Many well-known web-sites deploy distributed in- memory caches, which are implemented with standard DRAM on a large array of x86 servers, to reduce access load on databases"

then a single server with an FPGA card on it is going to be a lot cheaper than an array of x86 servers. Plus I would imagine that real-estate in a data centre is another significant cost and having one server with an attached FPGA would also be an advantage in this situation

jasonwatkinspdx · on Nov 18, 2013

That may be true. I think you'd have to look at the numbers to see for sure. The paper uses a virtex 6 with 24GB of DDR on a PCIe card. The evaluation board for one of those is about $2k, though obviously the FGPA itself would hopefully be lower in price. I can't seem to find any open pricing for the virtex line. They say using a larger FGPA would allow addressing more DRAM, but of course, that raises your chip costs as well. You'll have to add to that small volume PCB costs, as well as the cost of the hosting server node.

Compare all that to a $2k 1U Xeon with 32GB of ram and dual 10GbE and the prices aren't that dissimilar. Also one is a research project while the other is a commodity a phone call and fedex away.

My guess would be you'd need the FPGA design in high volume for the costs to work out. So it'd be good as an appliance product for a vendor, but unless you're google/facebook huge trying to build this sort of thing yourself would likely be a boondoggle. The exception is the latency performance: you'd have trouble duplicating that with the commodity server design. So if you're doing HFT or somesuch maybe this is interesting again.

A middle road might be something like the Fast Array of Wimpy Nodes design but with a lot more dram and hacking up the kernel to get you true zero copy io with minimal syscall overhead.

jandrewrogers · on Nov 18, 2013

You are correct, doing this at 10GbE wire speed does not require any exotic hardware, just savvy high-performance software. Assuming your NIC and related drivers are of good quality (e.g. Intel) then it is just a matter of good design to have simple database-like operations (insertion, selection) keep up with the wire without ultra-powerful CPUs.

In fact, I've designed systems that go quite a bit beyond simple key-value stores e.g. wire speed polygon indexes with spatial operators. On modern hardware, if the system is efficient, the "put" side of a key-value store should only require a core or two to saturate the wire; the rest of the cores can be dedicated to saturating the outbound "get" side. It is dead simple to do at 1GbE; at 10GbE it does require some skill at high-performance algorithm design but I've seen it done many times.

erichocean · on Nov 18, 2013

+1 on not needing exotic hardware. This is—while not trivial—not exactly hard today with an Intel E5 processor and it's DPDK.

What's strange is that they used an FPGA at all. The entire industry for network processors is moving the opposite way.

colanderman · on Nov 17, 2013

Comparing to a TilePRO running memcached is disingenuous – the TilePRO running Linux doesn't have the zero-copy network I/O the GX does, and shuttling packets through Linux adds a ton of context switching overhead.

You could do this – heck, probably even twice this – easily on a processor with an integrated NIC like the TileGX or the Broadcom XLP. Is there a market for these things? Should I quit my job and make this product?

jdmt · on Nov 17, 2013

"the chosen approach resolves collisions only up to a certain degree. If, however unlikely, a further key maps to the same hash index, then the request simply fails and an appropriate response is generated. We believe this to be an acceptable behaviour for what is essentially equiv-alent to a cache full/miss."

So it's more like a cache than a key/value store? Nothing wrong with that but to me a key/value store implies that data isn't lost in the case of a hash collision. Pretty cool stuff nonetheless.

justincormack · on Nov 17, 2013

Yes, more memcache like. Its a research project not a production system.

Nrsolis · on Nov 17, 2013

What I'm not getting from this is why it's any different from a properly implemented TCAM.

Key/value lookups are something that's done by existing networking hardware very well. There are existing merchant-silicon chipsets that do this on-die very well.

There was once an evolving market for "network processor units" but it looks like that all cooled down. IIRC, there was an effort to do something akin to a database lookup on an NPU, but the tradeoff between flexibility and what the NPUs could provide wasn't amenable to the kind of work required to bring those solutions to market.

bakul · on Nov 18, 2013

TCAMs have three problems for such applications: 1) limited key length (usually less than 64 bytes) 2) power-hungry as you are comparing against every key 3) limited size -- try getting a 4GB TCAM!

cwzwarich · on Nov 17, 2013

At least as far as I understand the article, it's simplified in that it can drop store requests without breaking its semantic guarantees. Implementing full CAMs on FPGAs is pretty inefficient.

dchichkov · on Nov 18, 2013

Something's fishy. What was your measurement platform? It feels like, you should have been doing much better than "4-5us" (what was your 99.9 percentile, BTW?) with your FPGA approach. Unless you are doing TCP/IP. But it looked like you are not, right? It is just UDP?

MasterScrat · on Nov 17, 2013

Mmh, Amazon should offer "hardware accelerated KVS" along with its EC2 instances. Maybe even using ASICs?

It could act like a super super fast memcached.

thezilch · on Nov 17, 2013

AWS doesn't even offer great hardware-solutions for networking needs, such as there ELB and NAT/PAT offerings. It'd be brilliant if they'd start exposing ASIC-backed instances.

dsl · on Nov 18, 2013

What makes you think S3 isn't already implemented in hardware?

recuter · on Nov 17, 2013

My only question is when can I use this myself?