Blobz : Your wobbly Garbage Persistor
This is Vaporware! 💨
Yet I’ve wanted to build it for a while now.
I’m probably going to build it gradually, one step at a time—treating it as both a useful tool for my own projects and an experiment to explore how far I can push the idea. What idea, you ask? A GC-style in-process key-value store with persistence, built in Zig.
I know hardly anyone is ever going to read this piece, but: if you do and you have opinions, please share them with me on X or hop onto the Zap Discord Server and tell me why my approach sucks, if you have better ideas, what existing solution I should use instead, etc.
Heads up: it’s going to be Zig.
What follows is me rambling about a pet project I wanted to start a long time ago but never really needed to, because my Zap apps were so robust that I was never forced to do it.
When you read this, there might even be something to look at on GitHub.
Intro : Garbage Collection Style Persistence
What I have in mind is an in-process key/value store with the following characteristics, vaguely inspired by mark-and-sweep garbage collection:
- it is meant for multithreaded environments like Zap apps.
- you define key and value types
- I think of a specialization where the key is just an index into an array.
- you can iterate through it fast if you need to (ArrayHashMap)
- it has an
upsert
method: you can insert/update the value under a key - it marks timestamps of upserts as
dirty_time
. - it runs a “garbage persistor” in its own thread:
- iterates through all objects
- it knows when it last ran (
last_garbage_run
). - all objects whose
dirty_time > last_garbage_run
will be snapshotted into JSON (for now) - the collected JSONs are persisted to disk
- you can bootstrap it from persisted JSON.
Why This Might Actually Be Useful?
Consider the following (“my”) scenario: You run an online experiment. The moment you take it live, hundreds or even thousands of users waiting for a new challenge might storm your site. With every action they take, they change their user state. Per-user state changes can be quite frequent—maybe even automated by some JavaScript that hits an API endpoint every half a second or so to keep track of some very important, granular progress.
The frequent updates to the server are meant to ensure accurate information on a dashboard you might create and, more importantly, that users can close their browsers and resume to almost exactly the previous state, even on a different computer, i.e., not relying on client-side storage.
While my scenario happens in the context of online experiments, similar characteristics might apply to other web apps. If not, then Blobz will be just for me. 😊
In-process
I love SQLite. An in-process database. Simple to deploy, battle-tested in the millions of millions (probably). No database connections, no shit. You link to it and that’s it. Now your app can database. Actually, now your app IS a database.
What I don’t love about SQLite is its SQLiness. I often don’t want schemas, tables, columns, etc. I’d rather have a persistent copy of an array or HashMap. I don’t want to translate zig structs to SQL tables.
Having such persistence in-process is valuable to me, as it simplifies building and deployment. Just zig buid
to build. True, that can also be achieved with SQLite, but I hope you get the idea. To deploy, just start the one and only process. No need to spin up a docker-compose with a Redis, for example.
Starting Simple…
Blobz, with its limited scope, seems like a good candidate to grow from a basic proof-of-concept to a basic production-ready version - and then experimenting with alternative, even more robust/performant/better implementations.
For a start, if it can provide SQLite-like or better performance and robustness as a storage layer, I’d call it a first success.
What follows, is me rambling about some implementation details that came to mind before starting to write the first line of code…
Locking and JSON: First Experiments in Persistence
The first approach to getting to a working solution is locking the entire HashMap of users to iterate over its KV pairs and persisting them. It’s also not the most performant. Conversion to JSON is not super-efficient. This approach slows down your entire experiment app for all users until all JSON conversions are done.
Well, we could take a COPY of each value, essentially creating a snapshot of the entire HashMap (via many individual value snapshots), then slowly convert to JSON and persist it. That seems like a sensible approach. But what if user objects might be large and constantly growing? It seems like a waste of memory to convert them all in one go. So we iterate, then copy, and jsonify individual user objects.
We need to discuss object consistency here. What if the object changes while we’re taking the copy? The resulting copy might be in an inconsistent state. We can mitigate that by introducing read/write locks to the user objects.
Now that we can briefly lock objects, we don’t want to hold the locks for too long, as that might slow down our app. You might think, “Let’s take copies. It’s OK to waste a bit of memory and then jsonify all the copies in one go.”
But what is a copy? How deep or shallow is it? Only their .ptr
and .len
fields are copied if a user object holds data slices. That’s a problem. Suppose the app deletes the user object and frees its associated memory after we’ve taken the copy but before we could jsonify it. In that case, our copy holds slices with invalidated pointers.
This is where we contemplate abandoning the idea of copying. Sure, we could impose on our app to never delete and destroy any user objects. We could also deep-copy all slices and the slices they might contain, etc. Or we could impose on user objects that they must not hold slices that point outside their memory. They must contain fixed buffers for every slice, so the slices only ever point to memory, which is part of the object itself. Of course, we would then need to correct the pointers in the copies. That doesn’t seem like a good idea. Instead, we could impose on the user struct to have a clone(alloc)
method. Now, I hope you understand why we abandoned the idea of copying. None of the approaches mentioned above seem right. However, I might revisit that thought later if Blobz’s performance turns out to be terrible. 😊
So let’s try this: we add a read/write lock to every user object. Our garbage persistor only locks user objects when taking a copy and jsonifying. Our app makes sure to make changes in “transactions” by acquiring the write lock. So when we jsonify = read objects, they’ll always be in a valid state. Acquiring the write lock also immediately marks the object as dirty.
How should the persistence ACTUALLY work?
That’s actually the wrong question. Because nothing is good enough until it has reached TigerBeetle level.
So the question should rather be: How can we organize persistence so we can avoid the most obvious traps?
For one, let’s not store everything in one big JSON file. Instead, let’s write objects into individual files. At least in the past, some filesystems (especially on Windows) did not cope well with large directories. So we better create subdirectories and limit the number of files in them. Everybody does it, even git.
Here, an interesting question arises: how can we make persistence robust in the sense that a crash while writing does not leave inconsistent state in the file representation? Something like writing into .json.tmp
files and renaming them to .json
later.
I guess more inspiration will come automatically once we start writing and testing some code.
The Big Flaw(s)
Our persistence is naive. It relies on the file system and its surrounding infrastructure (e.g. kernel, fsync
). Until I have found a better solution, this will have to do. TigerBeetle fans will probably stop reading now and that’s OK. We’re close to the end anyway 🤣.
Oh, and speaking of flaws, the elephant in the room is JSON. I know. But it’s an easy fix for now, and it’s human-readable, too.
Room for Improvement
Improvements I can think of and want to tackle:
No JSON
Objects implementing a persistence method, analog to pub fn format(...)
, could get us out of the JSON hell for those structures that implement them:
pub fn toBlobz(self: *const @This(),
alloc: std.mem.Allocator) ![]const u8;
pub fn fromBlobz([]const u8) !@This();
pub fn writeBlobz(self: *const @This(), writer: std.io.Writer) !void;
pub fn readBlobz(self: *const @This(), writer: std.io.Reader) !void;
or similar.
Better Data Partitioning
We might soon abandon the idea of storing one file per object and switch to more optimal partitioning of data.
While we’re at it, we might want to explore imposing a size limit on persisted object state, and working just with memory mapped files.
Where to go from here.
I created a project page for Blobz, which I’ll populate with links to anything useful related to the project, like follow-up articles, until Blobz can stand on its own feet. 🦶 🦶
TL;DR: I need this. So I’m building it. Let’s see how far it can go. 🤠