CQRS & Event Sourcing

TL;DR: Command Query Responsibility Segregation & Event Sourcing are techniques used to achieve both resilience and elasticity in a system.

The Reactive Manifesto states that a system needs to be both Resilient and Elastic in order to achieve Responsiveness. It does this by leveraging a Message Driven framework. But what do those messages look like? And how are they used to construct systems that fit the Reactive principles?

Those events can be used to construct views of the model using a pattern called Command Query Responsibility Segregation (CQRS). This pattern allows us to decouple our Command model from our Query model. By decoupling the models we create isolation in our system. This in turn allows us to build our system in a way that is more elastic and more resilient.

Broadcasting the events in order to build the corresponding read models requires that we have a technique for persisting the events. If we were to lose the events for any reason, then it would jeopardize the consistency of the queries. One way that we can ensure that the models remain consistent is using a technique called Event Sourcing (ES). With Event Sourcing, we eliminate the persistence of state, and instead focus only on persisting the events. This provides us with a powerful persistence model that is capable of adapting over time.

CQRS and ES allow us to build flexible, decoupled models. Those models can be scaled and replicated, allowing us to achieve the goals of the Reactive Manifesto. The techniques aren't without costs, but when properly implemented, they result in systems that are able to evolve with the changing environment.

What is CQRS/ES

CQRS/ES stands for Command Query Responsibility Segregation and Event Sourcing. They are two different tools that can be used independently, but are often combined together. When combined together, they give us a set of techniques that allow us to build applications that are more resilient and more elastic.

In essence they're giving us a way to build applications that are going to be more Reactive. We do have to be careful though, when we decide to use those tools, we have to understand that they come with a cost. In all cases we should be looking at what our business needs are, and deciding whether that cost that we have to pay is worth it in order to achieve the business needs. 

Just be aware that it is something that you have to consider when you decide whether to adopt those tools or not. Now some of the cases where you would want to consider adopting CQRS are, for example, if you needed applications that were auditable. So you needed some way not just to know what your current state is in your application, but also how you got there.

Banking is a good example of this. Banking is a case where auditability is a legal requirement generally speaking. You need to know not just how much money is in the bank, but how you got to that amount. Accounting is another example. You need to be able to know when you're doing your accounting ledgers, how you got to the final balance. It's not sufficient just to say what that balance is. Therefore, if auditability is important in your business, then it's worth looking at CQRS/ES.

We should also consider whether this is a portion of the business that provides competitive advantage. If this is a piece of your business that you actually believe will give you an edge over your competitors, then CQRS/ES may be valuable for you. It may give you some of the tools and some of the techniques that allow you to do things that your competitors can't. And if competitive advantage is important in this part of the business, then you should consider CQRS/ES. Now on the other hand if this is a small portion of the business that could potentially be replaced by an off-the-shelf product, and isn't really giving you any competitive an advantage, then maybe CQRS/ES is going to be overkill for you.

If, for example, you need high scalability, or you need your application to be extremely resilient, that might be a case where you want to look at CQRS/ES; but if your application is serving only a couple of hundred requests a minute and maybe it doesn't really matter that much if it goes down for a period of time, maybe this is overkill. Maybe you don't want CQRS/ES in most cases.

Event Sourcing

State Based Persistence

State Based Persistence means each update replaces the previous state with a new state.

Adapting to An Evolving Domain

  • The requirements in your domain are not always fixed.
  • Changes to the requirements can show up for many reasons:
    • External legal requirements
    • New business opportunities.
    • Better domain understanding.
  • We can introduce changes to the model to accommodate the new requirements as they come up.
  • How do we capture changes that require us to look into the past history?

Example: Changing Reservations

  • Our systems captures reservations and their locations.
  • Management wants to know how often a reservation is changed from one location to another.
    • This information may help them to plan new locations, it may help with staffing, etc.
  • We can adapt our system to start recording this information in the future.
  • How can we get this information from the past?

Correcting Past Mistakes

  • Despite our best efforts, errors creep into our code.
  • Those errors can result in incorrect state in our databases.
  • We can correct the errors when they happen. This fixes any new state.
  • How can we fix the incorrect state that has already been persisted?

Example: Cancelling Reservations

  • There is a bug in the code.
  • It turns out that when someone wanted to change their reservation, they were actually cancelling it.
  • Cancellations results in a row being deleted.
  • The bug is fixed, but now we need to know all the reservations that were moved during specific period of time.
  • Is this easy to determine?

The Problem With State

  • Keeping the current state captures where you are, not how you got there.
  • You can’t retroactively apply new domain insights.
  • You can’t fix bad state caused by issues in the code.

Event Sourcing

Building an Audit Log

  • In addition to persistent state, you persist an edit log.
  • The log captures the history and can be used to fix issues with the state.
  • It can be used to retroactively develop new domain insights.
  • What if the audit log gets out of sync with the state (due to a bug for example).
    • Which is the source of truth, the log or the state? The obvious answer is to say the log, and the reason is because the log has the full history of everything that happened. Because it's got the full history of everything that happened, we can always use it to rebuild the state. So let's just say the audit log is the source of truth. If the audit log is the answer, why do we need state?

Event Sourcing

  • State based persistence models fail to capture intent.
  • Event Sourcing eliminates the persistence of the state. It captures only the intent.
  • Intent is captured in the form of events which are stored in the log.
  • This log is the single source of truth of the system.
  • Event sourcing captures the journey, rather than the destination.

Recovery

  • When an Event Sourced object is loaded, rather than loading the final state, we replay the events.
  • Each event is replayed causing the same update to the state that occurred from the original command.
  • It is important when replaying events to avoid replying side effects.
    • The side effect has already occurred. We don’t need to do it again.

By doing this, we end up with (in many cases) a very short list of events, but in some cases it could be a long list of events. It could be, on the average case, five to ten events, but maybe in the upper limits we might get into hundreds or thousands of events. Therefore, eventually replaying all those events could be problematic.

Snapshots

  • Eventually, the time to replay all events may be too long.
  • In addition to persistent events, we can periodically persist a snapshot.
  • A snapshot captures the current state of the object.
  • When replaying, we can start from the most recent snapshot and replay only the events after that snapshot.
    • Note: Snapshots are an optimization. Be wary of optimizing prematurely.

Advantages of Event Sourcing

  • Creates a built in audit log showing where the current state comes from.
  • Allows you to rewind or undo the changes back to a particular time.
  • Enables you to correct errors in your state by fixing computation and replaying the events.
  • Append-only is usually more efficient in databases.

Evolving the Model

Whether we use state-based persistence or event sourcing, there's going to come a time when we're building our system and we will have to evolve our model. We may have some new domain insight that requires us to add something, we may just want to add a new feature, but whatever the case is we're gonna have to evolve the data model that we've been using up to this point. With state based persistence techniques, this isn't too bad. We may have to add a new column for example, or a new table. We may have to provide some default values. It gets more complicated though when we do event sourcing.

Event Integrity

  • Event sourced solutions are only as good as the log. It is important to maintain its integrity.
  • The events in the log are facts. They represent the history of the system.
  • History should never be rewritten, therefore, events must be immutable.
  • Your event log is append only. You never update or delete.

Versioning Events

  • Over time, the events may need to change.
  • Because the events are immutable we can’t change them.
  • This require us to create a versioning scheme for the events.
  • Best Practice: Use a format for the log that is flexible (eg. Protobuf or JSON rather than Serialization).

This means that we have to create a versioning scheme for the events. We need to have something like, if our original event was "ReservationMoved" we need something like "ReservationMovedV1", "ReservationMovedV2", "ReservationMovedV3", and so on. Every time we have to make a change to the event we don't go back and modify the existing events. Instead we introduce a new version. What that does is it allows us to guarantee the integrity of the log.

We can guarantee that our log remains append-only. But it allows us to update the events with new fields (rename fields, remove fields, those kinds of things). It allows us to make updates. It also creates some issues as well, which means for example, we have to maintain support for all of those old versions. We have to continue to be able to parse those old versions into the software. We have to be able to continue to read them and so we have this old version support that has to stay around for a long period of time, theoretically forever.

Command Sourcing

Command Sourcing is similar to event sourcing. Commands are persisted upon receipt, before being executed. Domain objects can be constructed or updated by executing the commands. Command Sourcing allows commands to be processed asynchronously.

Challenges with Command Sourcing

  • The commands may be executed multiple times (due to failures, etc.). They must be idempotent.
  • If they are not validated first, then the commands could get stuck in the queue as they continually fail.
  • In the event of a bad command, we are decoupled from the sender, so we cannot inform them of the problem.
  • It is often safer to perform validation first, then emit an event to update the state.
  • Eg: AddItem —> AddItemValidated —> ItemAdded

CQRS

Read Models VS Write Models

When we start doing event sourcing, usually if we're doing something like domain driven design, we would event source our aggregates our entities.

Event Sourcing Your Aggregates

  • Event Sourcing, when combined with Domain Driven Design, usually starts with the Aggregates or Entities.
  • Entities or Aggregate Roots create events which are then persisted.
  • Problems arise when you need to perform queries that can’t be answered by a single aggregate root.

Complexity of Queries

  • Queries become problematic because an entity or aggregate root must be rebuilt from events.
  • Querying across multiple aggregate roots requires you to rebuild all of them and very them individually.
    • You can’t simply do a database query.

Conflicting Models

  • The problem is we have conflicting concerns within our model.
  • This can happen even when you don’t do event sourcing.
  • The model that you use to persist is often not compatible with the model you want to use for queries.

As complexity increases, supporting all queries through aggregates becomes problematic. The requirements for read and write are very different and supporting both in a single model may not be a good idea. Command Query Responsibility Segregation (CQRS) aims to separate the two.

Simple CQRS

  • The CQRS concept is actually very simple at its core.
  • The write model is used for processing commands (writes).
  • One or more read models are produced to handle queries (reads).
  • Both read and write models are optimized for their precise purpose.

CQRS + Event Sourcing

  • CQRS and Event Sourcing are often combined.
  • Commands go through the write model and events are persisted.
  • A separate process consumes those events.
  • A denormalized model, called a Projection, is created and is used by the read side.

The idea here is for write purposes to store the events. But for a lot of read purposes, reading those events is not ideal, that's why a separate projection is created and that projection is optimized for the read model.

In a reservations case, we would have all the reservations events but in the denormalized store we might have something that organizes reservations by customer, something that organizes reservations by location, and whatever other queries we need to support. That way, the read model can just go directly to that denormalized store and read the data directly out exactly as it needs it. This makes it very very fast, because there's not a lot of processing. We don't need complex joins and complex queries. We just read the data exactly as it was stored in the database.

Flexibility of CQRS/ES

  • CQRS allows you to build a very flexible model.
  • The write side can be highly optimized for write purposes.
  • The read side can be highly optimized for read purposes.
  • The read and write side can use different databases if necessary (Polyglot Persistence).

Evolving the Model

  • Creating new Projections is easy in a CQRS/ES based system.
  • Because the full history is captured in the events, new projections are retroactive.
  • The read model and write model are decoupled, and can be evolved independently from each other.

Consistency, Availability, & Scalability with CQRS

Fine Grained Microservices

Models as Microservices

  • CQRS allows you to further break apart your bounded context into microservices.
  • The read and write models can each live in their own separate microservices if necessary.
  • You can take it further and have each projection live in its own microservice.
  • Caution: You can have very fine grained microservices with this approach, but it may not be worth it. Make sure you understand your needs before turning everything into a microservice.

We can look at optimizing the queries using CQRS and event sourcing. We can consider those things when they become a problem. The great thing about CQRS and event sourcing is that, because we have the full history, we can always retro actively make those decisions, and we will still have all of the data since the the beginning of our application, and that's the power here. We can start small and then we can expand as necessary, which allows us to be very pragmatic when we build.

Consistency in CQRS/ES

  • Simple CQRS (without ES) has the same consistency guarantees as any non-CQRS based systems.
  • CQRS/ES based systems can be implemented with different consistency concerns for the read and write models.

Write Model Consistency

  • Strong Consistency is often important in the write model.
  • Decisions and computations are often made based on the current state.
  • Making decisions with stale data could result in an incorrect state.
  • The write model often leverages transactions, locks, or more reactive approaches like sharding.

Read Model Consistency

  • Read Model Consistency is more complicated.
  • It is important to understand strong consistency only matters when you are writing data.
    • When you write data, you may need to ensure the write is based off of the current state.
  • Pure reads are never strong consistent.
    • The data you read might be changed immediately after you read it.
    • Pure reads are always working with stale data.
    • You can’t lock the data for the duration of the read. It may be unbounded.
  • Read Models don’t need strong consistency.

Scalability in CQRS/ES

  • Because they are separate, the read and the write models can be scaled independently.
  • Most applications are read heavy. The eventually consistent nature of the read side allows high scalability.
    • Eventual consistency is built into the model, so caching is easy.
    • If necessary, different microservices can host different projections, allowing further scalability.
    • Multiple copies of the same projection can be created.
  • The write side of CQRS/ES often requires stronger consistency. It can be scaled using techniques like sharding.

Availability in CQRS/ES

  • The write model is often strongly consistent and therefore sacrifices have to be made in availability.
  • The read model is eventually consistent, so high availability is possible.
  • In failure scenarios, the system may have to disable the writing of data, while still allowing data to be read.

Pros and Cons of CQRS/ES

Pros

  • CQRS/ES is often criticized for being complex, but it can be simple.
  • Without CQRS, models become bloated, complex, and rigid.
  • CQRS allows smaller models, which are easier to modify and understand.
  • Eventual Consistency in CQRS can be isolated to where it is necessary.
    • All systems have Eventual Consistency in them. CQRS is just more explicit about it.

Cons

  • Usually, results in an increased number of objects/classes (commands, events, read models, etc.).
  • May result in introducing multiple data stores, potentially of different type.
  • UI most be designed to accept the Eventually Consistent architecture.
  • Maintaining support for old event versions can become challenging.
  • May require additional storage due to long event history and read model data duplication.
  • Data duplication can result in synchronization issues.
    • These are solved by rebuilding projections.