Don't Put That PDF in Your Event¶

Someone asked us recently what to do when an event has to carry a document. A signed contract as a PDF, a Word file, a scanned invoice. The same question comes up every few weeks, and over the years it has arrived in every shape imaginable: not just documents, but images, audio files, and more than once a video. The phrasing changes, the underlying worry does not. Something big needs to be remembered, and the event feels like the place to keep it.

The short answer fits in a single sentence: keep the big thing out of the event and put a reference to it inside. But the short answer is the least interesting part. The interesting part is why, because the reasoning leads straight back to a question most of us never stop to ask. What is an event, really, and what is it for?

The Question Behind the Question¶

Often the question arrives with a fact attached. In EventSourcingDB, a single event may not exceed 64 KiB, and people quite reasonably ask what they are supposed to do when their data is larger than that. There is a faint note of complaint in it, as if the limit were an obstacle to work around.

We see it the other way. The limit is a feature, not a restriction. Without it, the easy thing and the right thing would diverge, and the easy thing usually wins. You could drop a ten-megabyte PDF into an event, the write would succeed, and nothing would object until much later, when the bill comes due in a place far from where the decision was made. The limit moves that bill forward. It forces the conversation at the moment you are designing the event, which is exactly the moment you can still do something about it.

So when someone hits the limit and stops to ask, that is the system working as intended. The question is not a sign that something is wrong. It is a sign that someone is thinking about how big an event can be, and, more importantly, how big it should be.

What an Event Actually Is¶

To answer the question well, you have to be clear about what an event is in the first place. An event is the record that something of significance happened in your domain. It answers the obvious questions a story needs to answer: what happened, when, and to what. Depending on the context it may also carry the why and the who. ContractSigned, InvoiceIssued, DocumentApproved – each one is a fact, stated in the language of the business, that will never change again.

Notice what is not in that description. An event is not a container for moving bytes around. It is not a filesystem, not a content delivery network, not a place to park a blob because no better home was handy. An event is a statement about the world, and a statement should be small enough to read. When you reach for an event to hold a video file, you are using a sentence to do the work of a warehouse.

This is the same foundation we laid out in Thinking in Events: events are the natural language of stories, immutable and ordered. A story told well uses words that name what happened. ContractSigned is such a word. The forty pages of the contract itself are not part of the sentence – they are the thing the sentence refers to.

Why Every Kilobyte Travels¶

There is a concrete, physical reason this matters, and it has nothing to do with taste. Events move. They cross the network when they are written, when they are read, when they are projected into a read model, and when a downstream context subscribes to them. A single event being a few kilobytes larger than it needs to be is invisible. The trouble is that events rarely come one at a time.

The defining operation of an event-sourced system is the replay: reading the history back, in order, to rebuild state. With a few hundred events, the difference between a lean event and a bloated one is academic. With a few hundred thousand, it is the difference between a replay that finishes over a coffee break and one you schedule for the weekend. We once worked out that eighteen months of real events fit on four floppy disks – but that is only true because those events carried domain facts and not attachments. Slip one video into each event and the same history would no longer fit on a stack of hard drives.

If this argument feels familiar, it should. It is the same one the industry settled decades ago about storing files in a relational database. The usual advice was, and still is, to keep the file outside and store only a reference to it, for precisely this reason: when you load one row you accept the cost of that row, and when you load a million rows you do not want that cost multiplied by a blob you did not even need. Event-sourced systems load the equivalent of every row, every time they replay. The discipline matters more here, not less.

An Old Answer to an Old Question¶

The pattern that solves this has a name. In the world of messaging it is called the Claim Check: instead of sending the heavy payload through the system, you store it once in a place built for heavy payloads and pass around a small claim ticket that can retrieve it. The name comes from a coat check. You do not carry your coat into the theatre; you carry the numbered stub and leave the coat where coats belong.

In practice the place built for heavy payloads is object storage – S3 or one of its many relatives – or even just a directory on a filesystem. You write the document there once, you get back an identifier or a URL, and that is what goes into the event. The event stays a crisp statement of fact, DocumentUploaded with a reference and a little metadata, and the bytes live where bytes are cheap to keep and cheap to serve. This is not a workaround we invented to dodge a limit; it is the documented best practice, spelled out under writing oversized events, and the limit is simply what makes the right choice the path of least resistance.

What goes into the event alongside the reference is worth a moment's thought, too. The useful metadata – a filename, a content type, a size, perhaps a human-readable label – is small, domain-relevant, and exactly the kind of thing an event should carry. It is enough to reason about the document, to display it in a list, to decide whether to fetch it at all, without ever touching the bytes.

A Reference, and Maybe a Hash¶

A bare URL works, but you can do something more elegant for very little extra effort. Compute a cryptographic hash of the content – a SHA-256, say – and put that in the event next to the reference. Now the event does not merely point at a document; it pins down which document, exactly, down to the last byte. When you later fetch the file, you can hash what you got and compare. If the two agree, you are holding precisely what was referenced when the event was written. If they do not, something changed underneath you, and you want to know.

This buys you integrity almost for free, and it fits the spirit of the system perfectly. EventSourcingDB already hashes every event it stores, following the CloudEvents specification, which is what makes the history tamper-evident in the first place – the property we explored in Proving Without Revealing and the same instinct behind what aviation teaches us about auditing. Extending that discipline to the documents you keep outside the store is a natural continuation, not a new idea. The event guarantees the fact; the hash guarantees the artifact the fact refers to.

You can go one step further and let the hash be the reference, storing the content under its own fingerprint. This is content-addressable storage, and when it fits it is genuinely lovely: identical files collapse to a single copy automatically, and the reference can never point at the wrong thing, because the reference is the thing's identity. It does not fit everywhere – it complicates deletion, and it assumes the content is meant to be immutable – but where those conditions hold, it is hard to beat. Reach for it when it earns its place, not by reflex.

Load It Only When You Need It¶

There is one more benefit to keeping the document outside, and it is the one teams appreciate most once they live with it. When the payload is not in the event, fetching the event no longer means fetching the payload. You get the fact and its metadata immediately, and the heavy content stays put until you actually ask for it.

In real systems, you ask for it far less often than you would guess. Rendering a list of uploaded contracts needs the names, the dates, the statuses – not the contracts. Rebuilding a read model needs the facts, not the forty-page attachments behind them. Auditing who did what and when needs the events, not the bytes. The document itself is needed at one specific moment, when a user clicks to open it, and at that moment one extra network call is a perfectly fair price. The point is that you decide when to pay it, instead of paying it on every read whether you needed the document or not.

That inversion of control is the quiet reward. Embedding the blob makes the expensive case the default and gives you no way out. Referencing it makes the cheap case the default and the expensive case a deliberate, on-demand choice. Most of the time the metadata is all anyone wanted anyway.

And When It Has to Disappear¶

There is a subtler payoff, and it is the one that turns a performance argument into a design principle. Event Sourcing makes a promise that sits awkwardly next to privacy law: events are immutable, and we never erase them. That is a feature when the events hold facts, and a genuine problem when they hold a customer's signed contract and that customer invokes their right to be forgotten. You cannot honor the request without rewriting history, which is precisely the thing the whole approach exists not to do.

Keeping the payload outside dissolves the conflict. The sensitive bytes live in storage you control and can delete, while the event keeps only a reference and a hash. When the law, or simply good housekeeping, says the document must go, you delete the document. The event stays exactly as it was: DocumentUploaded still happened, the fact is still true, the history is still intact. A reference that now points at nothing is not a bug – it is an honest statement that the thing existed and was later, deliberately, removed. You do not falsify the past to forget something; you stop storing the part that was never the fact in the first place. That is the same instinct we followed in Soft Delete Is a Workaround: the history is sacred, but what hangs off it does not have to be.

So, About That PDF¶

Back to the question we started with. What do you do when an event has to carry a document? You realize that it does not have to, and that it never should. The document goes into storage built for documents, a reference goes into the event, a hash goes alongside it if integrity matters, and the bytes are fetched only when someone genuinely needs them. The event goes back to doing its one job: stating, briefly and permanently, that something happened.

The 64 KiB limit, the one that prompted the question, turns out to have been pointing at the answer the whole time. It is not in your way. It is the nudge that keeps your events small, portable, and meaningful, the gentle insistence that a fact and a file are two different things that deserve two different homes.

This is the kind of question we love, because it is never really about the limit – it is about what an event is for. If you are weighing how to model large payloads in your own system, or wrestling with a case where the line is genuinely hard to draw, we would enjoy thinking it through with you. Write to us at hello@thenativeweb.io, and tell us what your events are trying to carry.