Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
Casync – A Content-Addressable Data Synchronization Tool (github.com/systemd)
145 points by mmmmkay on April 23, 2022 | hide | past | favorite | 25 comments


I was recently playing[0] with the ZSTD seekable format[1]: a *valid* ZSTD stream format (utilizing ZSTD skippable frames, that are ignored by de-compressor) w/ an additional index at the end of the file that allows for random access to the underlying compressed data based on uncompressed offsets. This combined w/ a Content-Defined-Chunking[2] and a cryptographic hash[3] in the index allows for a very efficient content-addressable storage. For example I've successfully applied it to a bazel-cache, which gave me between 10x and 100x wins on repository size w/ negligible CPU usage increase.

[0] https://github.com/SaveTheRbtz/zstd-seekable-format-go

[1] https://github.com/facebook/zstd/blob/dev/contrib/seekable_f...

[2] e.g FastCDC https://www.usenix.org/system/files/conference/atc16/atc16-p...

[3] https://github.com/facebook/zstd/pull/2737


BitKeeper used a seekable compressed file format for revision control data. It allowed a data-structure to be dynamically loaded on demand without needed to uncompress the whole file. A large empty memory buffer was allocated and then read permission was removed with mprotect(). Then a signal handler populate regions of that buffer with data from the compressed file on demand using the ability to seek to certain boundaries.

This change achieved a 10X speedup on normal operations compared to the old code that used SCCS files.

The compressed format stored data blocks in arbitrary order and then the index at the end of the file gave the data layout. Then allows write to the file without rewriting. BitKeeper itself only needed to append to the end of the file, but the format could support inserting data in the middle by only appending to the physical file.

It also had a data redundancy CRC layer that could detect data corruption and recover data from some types of corruption.

https://www.bitkeeper.org/ https://github.com/bitkeeper-scm/bitkeeper/blob/master/src/l... https://github.com/bitkeeper-scm/bitkeeper/blob/994fb651a404...


Really wish this was part of official zstd (https://github.com/facebook/zstd/issues/395#issuecomment-535...) and not a contrib / separate tool.


That last proposal is mine! Thanks for the Go implementation, I’ll have to play with it some time. Looks like you’ve built something very close to my intention. Did you come across any additional changes needed to the spec? Any suggestions or results from going through the implementation that would help?

Another thought is if it is possible and how to coordinate te re-using dictionaries.


Thanks for creating this proposal! Would love to see it committed upstream. For me it worked just fine -- I'll need to play a bit more with it and maybe put it into a real-world production setup to see how well it preforms on real-world data.


Is your bazel cache implementation open source? I am dabbling in bazel and I am not sure where zstd fits in the bazel cache model. I'm interested in learning more about this


I did PoC experiments with compression, chunking, and IPFS here: https://github.com/SaveTheRbtz/bazel-cache

If you need a mature compression implementation for bazel I would recommend using recent bazel versions w/ gRPC-based bazel-remote: https://github.com/buchgr/bazel-remote

bazel nowadays supports end-to-end compression w/ `--experimental_remote_cache_compression`: https://github.com/bazelbuild/bazel/pull/14041


I was wondering how this gets any common chunks at all with the removed file boundaries. Turns out that chunks don't have a set size, just min/max/avg values, so unaligned streams may end up synchronizing. https://github.com/systemd/casync/blob/master/src/cachunker.... If I understood that correctly, that's pretty cool.

But looking at the code I'm having strong "nope" feelings. First, because of lines like "q += m, n -= m;". Second, because of int/enum/semantic abuse: `compression_type` may be _CA_COMPRESSION_TYPE_INVALID which I hope is -1, `>= 0` as a known compression type, or `-EAGAIN` as an error. (from https://github.com/systemd/casync/blob/99559cd1d8cea69b30022... ) I'd bet that just throwing afl at the decompressor will find issues :( (source - I had this feeling about systemd-resolved and threw afl at it, found issues)

I do like the idea though.


How hard is it to set up AFL? I’ve never had the opportunity to set up a fuzzer


Depends on the app / file format. If you can isolate some trivial behaviour on a single file you may not even need to change the app. For example "casync list --store=/var/lib/backup.castr input.caidx" could maybe be used to fuzz the index reading if it doesn't care about looking into the store immediately. And if it does, you could comment out those parts as needed. You can get very fancy with the process, but you can start with "here's your single valid input sample, go!" Check out https://fuzzing-project.org/tutorial3.html and the AFL guide.


Thanks! I'll be looking at using that for some of our code


Mentioning the more portable desync is obligatory: https://github.com/folbricht/desync


desync is what git-lfs should have been: rolling hash chunk based storage, with local caching of chunks, proxying of remote chunks, etc.

When I have big 100MB binary files that i want to version, the changes are small (1MB in one place, and a few more KB in others). I also have a few multi GB SQLite databases I would like to version where this would help. (Changes are less than 5% of the data, but index pages throughout the file mean rsync-style partitions still transfer 50% or so of the file. To actually achieve storage efficiency, I store textual dumps, and they also fit better with borg/restic/casync than git or git-lfs

Would have been extra awesome if desync would be able to use a git repo as storage.


git itself should have been rolling hash chunk based storage. Storing very large text files (like bash history or the Debian security team git repo) in git is very inefficient due to the current design.


It is inefficient up until the point where you ask git to do a gc+repack, at which point git will look for deltas - and if it is changes to the same file, it will likely find them and encode them as a delta.

For text files with really small changes, this is comparable to a "diff" size, and is better than what a rolling hash would achieve. However, for text files with larger changes, and of course for binary files - a rolling hash is much more effective. Additionally, a rolling hash would easily reuse cross-file similarity, where as the current delta-finding code is likely to only find redundancy in the history of the same file (or similarly named in the same directory).


How is go more portable than C?. Genuine question.


I think they mean to OSes that don't run systemd / Linux kernel?


It is not systemd dependent.


The repo linked answers this question.


We did all this stuff for packages in fuchsia and it works really well. We went further though, the on-disk storage is a write-once immutable store of read-verified data.


Seems quite similar to IPFS, but no comparison in the readme/article.


It's much more similar to a related Protocol Labs project: CAR (content-addressable archives) https://ipld.io/specs/transport/car/


Why is this under systemd? I hope I don't have to update all of systemd to get updates to a sync tool.


From the announcement blog post (http://0pointer.net/blog/casync-a-tool-for-distributing-file...):

> Is this a systemd project? — casync is hosted under the github systemd umbrella, and the projects share the same coding style. However, the code-bases are distinct and without interdependencies, and casync works fine both on systemd systems and systems without it.


> However, the code-bases are distinct and without interdependencies

Didn't they say that about udev?




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: