Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
ULIDs and Primary Keys (2022) (daveallie.com)
76 points by hu3 on April 12, 2024 | hide | past | favorite | 33 comments


ULID has an unnecessarily convoluted monotonically incrementing variant in its spec which requires serialization to generate IDs without conflicts. That makes the spec look like trying to solve multiple irrelevant problems in one shot. That’s confusing and more like a design smell. If you need a sequentially incrementing identifier, just use a big enough int.

I think the issue above and UUIDv7’s existence make ULID completely unnecessary.


I've found ULID to be really useful in my API backend that runs on AWS Lambda and writes to DynamoDB. This is a use case where I need Lambda's parallelism, meaning no sharing some common sequential integer. Meanwhile I can roughly order items by creation time which makes pagination and selecting items by time range possible, right on the sort key instead of as a filter. In other words items might end up out of order by a few milliseconds but in my use case that's not a big deal, and I get the benefit of communication-free parallelism.


I guess that was before UUIDv7 was a thing? Is there a reason to pick ULID over it?


The monotonic version of ULID is probably faster to generate the next number vs UUIDv7. Whether or not that performance difference matters to you is application-specific.


Monotonic version relies on locking to avoid conflicts. Not applicable in this case.


Can't you have a per-thread ULID without any locks?


What’s wrong with timestamp + worker number in your situation?


I agree that the spec is a bit messy on this, but monotonicity is not required so in the end it's not a big deal.


It’s a big deal because it causes a semantic confusion: for any ID scheme that claims to be “universal”, users expect pervasive uniqueness without resorting to serialization. The “U” in ULID becomes misleading when devs bump into the monotonic variant. That can lead to hard to detect bugs because conflicts or perf degradation may happen only in production. Bad idea to make it part of the spec.

It doesn’t even make sense to have such a scheme in ULID. The only advantage it has over a 64-bit integer is that ULID is 128-bits but, that doesn’t solve any problem or help with anything. It’s just waste, and will cause pain and confusion.


In my eyes, the fact the UUIDs (or specifically v4, as mentioned in the article) encode no information is an advantage. I don’t think encoding the creation date into the UUID is likely to be a huge security hole, but I can imagine some instances where you would want to keep that secret, or at least not have a good reason to publish it.

I would also argue, strongly, that if you are planning to use the datetime an account/event/item was generated, you should track it separately from the UUID, in a dedicated column.

This isn’t intended to refute the article - it was an interesting read and I definitely learned some stuff, but rather to provide an argument for why you might want to stick with v4.


It's a security vs convenience tradeoff. For most people, making the ID practically unguessable, while also preserving index locality in databases, is the right balance.


> making the ID practically unguessable

80 bits of randomness is just on the edge of unguessable, though. Most people accept that 128+ bits is cryptographically secure now and in the future.


It is a huge security hole. The problem is that people are using v4 all over the place as a source of “unguessable” tokens. This is not the purpose of a UUID and is wrong of course, but it doesn’t change the reality that many services will become vulnerable if they change UUID schemes.


You can track it separately, but if you use a database like DynamoDB using a ULID as the SK leads to optimized access patterns. If you use a UUID as your SK, then sort order of two items created at the same second is not guaranteed.


> While there are 48 bits of time information in both UUIDv7 (with millisecond precision) and ULID, the ULID standard encodes the time as Unix Epoch in milliseconds for the whole 48 bits whereas UUIDv7 splits the time into 36 seconds bits and 12 millisecond bits.

This is incorrect / outdated. UUIDv7 switched to 48 bits of milliseconds instead of 36 seconds / 12 milliseconds: https://datatracker.ietf.org/doc/html/draft-peabody-dispatch...


Ironically, the information and link that you've shared is incorrect / outdated.

The newer draft proposal allows for multiple different options of time information to be added to a UUIDv7 encoding, beyond the typical 48-bits of Unix epoch millisecond data.

https://datatracker.ietf.org/doc/html/draft-ietf-uuidrev-rfc...


This article is a little old.

First thing that isn't correct or changed since this was written:

> There don't seem to be any existing built-in or extension functions that support generating UUIDv7s in PostgreSQL. To generate a UUIDv7, the function below could be tweaked in order to support the correct formatting.

This is the extension I use until UUIDv7 native support lands in PostgreSQL https://github.com/fboulnois/pg_uuidv7. As for native support, it's definitely underway, though it seems to have a few issues keeping up with the version 17 target release: https://commitfest.postgresql.org/47/4388/


It was already out of date when published in Feb 2022:

> PostgreSQL has built in support for UUIDv4 through the pgcrypto or the uuid-ossp extensions.

PostgreSQL hasn't needed an extension to generate UUIDv4 since v13 which was released in Sep 2020.


I like the ULID properties for the database, but it misses so many of the human factors that would make it excel.

Your tokens/identifiers should look like this:

namespace_crockfordEncodedEntropy

- There must be token/identifier prefixes like Stripe. While one could concatenate/prefix these bytes in application code, it's best to just accept the overhead and store them in your records for ease of lookup. It makes an engineer's life so much easier when debugging foreign keys on the fly or during an emergency SEV.

- There should be no hyphens (in string serialization). You probably want to encode and store crockford directly rather than do bytes to hex encoding, again for engineer ease of use.

- You may want to leave some trailing bytes for request routing in active-active high SLA systems. Especially for when record lookup cannot fail.


I’ve seen this sort of design referred to as “typed IDs”: https://github.com/jetify-com/typeid

Doesn’t use the crockford encoding, but does is another one that minimises confusables. Definitely looks pretty solid, so keen to give them a shot in my next project.


I feel like the benchmarks and conclusion did not do any real comparison of uuidv7 vs ulid, and instead only compared uuidv4 and ulid.


uuidv7 will be supported in PostgreSQL 17, at which point generation should be as fast as uuidv4, or if you implement it in pgsql now it will be as fast as the proposed ULID algorithm.

Insert performance could be even better, iirc for BTree Indexes monotonic increasing values are better than random ones, but feel free to correct me on that ;)


Commenting on this a bit late, but in case anyone reads this later too:

UUIDv7 support unfortunately didn't make it to Postgres 17, since the RFC wasn't completely finalized yet by the time of feature freeze (April 8), see discussion on pgsql-hackers:

https://www.postgresql.org/message-id/flat/ZhzFQxU0t0xk9mA_%...

So I guess we'll unfortunately have to rely on extensions or client-side generation for the time being, until Postgres 18.


In that case I don't understand why the author didn't go for uuidv7? It seems like existing tooling (both in database and outside) deals with it better, it seems like there are no downsides unless you expect your identifiers to be generated past 4147 but don't care if they are generated past 10889 (I'd love to hear that use-case, seems like it must be interesting).


As with databases it always depends, for maximum insert performance you'd actually often go with random uuids so you dont get a hot page.


Hot page?

Using a monotonically increasing PK would cause pages in the index to be allocated and filled sequentially, increasing throughput.

Using random UUIDs would lead to page-splitting and partially-filled pages everywhere, negatively impacting performance and size-on-disk.


Not always, this article describes the problem: https://learn.microsoft.com/en-us/troubleshoot/sql/database-...


Exactly, with one big insert its better to have a sequential value, for many small ones its often better to not.

As with all databases, measure before you cut.


The way to take advantage of the bandwidth of multiple storage devices is to distribute concurrent writes across them, rather than forcing everything to commit sequentially using contended locks or rollbacks.


The DB (or any application) should not have any need to know what devices are underneath its mount point. If you’re striping across disks, that’s a device (or filesystem, for ZFS) level implementation.


Discussed at the time:

ULIDs and Primary Keys - https://news.ycombinator.com/item?id=34175639 - Dec 2022 (46 comments)


Recent and related:

The UX of UUIDs - https://news.ycombinator.com/item?id=39974465 - April 2024 (196 comments)


man, I'm glad someone find this interesting so that I don't have to.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: