Distributed Messaging Patterns

TL;DR: Message driven systems are those that communicate primarily through asynchronous, non-blocking messages.

According to the Reactive Manifesto, a critical element in any Reactive system is that it is message driven. But what does it mean to be message driven? Message driven systems are those that communicate primarily through asynchronous, non-blocking messages. Messages enable us to build systems that are both resilient, and elastic, and therefore responsive under a variety of situations. But when we choose to build systems using asynchronous, non-blocking messages, there are consequences.

One major consequence of building distributed systems is how to deal with transactions. Many solutions have been proposed, such as distributed transactions, and two phase commit protocols, but the reality is that those solutions tend to be too brittle and inflexible for a Reactive system. So how can we deal with problems that resemble transactions in a more resilient way?

As we build out our distributed systems, we are forced to confront the reality that delivering messages in a distributed system is complicated. We need to be careful to ensure that messages are delivered in a way that is predictable. This means guaranteeing that a message is delivered to the correct recipients, but also that the message arrives an expected number of times. In order to provide those guarantees, we need new tools and techniques.

Evolution of Communication

If we try to send an asynchronous message, but we do it in a way where we expect a synchronous response things break down. When we start to do things that really need to be asynchronous and we try to use synchronous messages to accomplish the task, things break down. If we try to use an asynchronous message when a synchronous message is really more appropriate, things break down.

That's really what we're trying to get at here, we need to understand when asynchronous is appropriate and when synchronous is appropriate, and make sure that we're using the right tool for the job, because if we use them incorrectly then our system will start to fall apart.

Message Driven Architecture

One of the key elements of the Reactive Manifesto is that Reactive Systems have to be built on a Message Driven Architecture. That is the foundation of a Reactive System, and it's what allows the system to be Reactive.

Asynchronous and Non-Blocking

Reactive Systems put an emphasis on Async, Non-Blocking Messages
Messages are sent without waiting for a response.
The sender may be interested in a response, but it comes asynchronously.

Advantages to Asynchronous and Non-Blocking

Resources (threads, memory, etc.) can be freed immediately.
Reduced contention means potential for higher scalability.
Messages can be queued for delivery in case the receiver is offline.

The Role of Synchronous Messages

Async messages, should form the backbone of Reactive Systems
Synchronous messages can be used, but their requirements can often be relaxed.
- Can you acknowledge the receipt of a message synchronously, but process it asynchronously?
The need for synchronous messages should be driven by domain requirements rather than technical convenience.
Best Practice: Use asynchronous by default, fallback to synchronous only when forced.
- Understand the consequences of the choice.

The Cost of Asynchronous Messages

In a monolithic architecture, coordination and consensus building is often done using transactions.
Microservices, Distributed Systems, and Async Messages all make transactions more difficult.
Holding transactions open for long periods results in slow, brittle systems.
How can we manage long running transactions that span multiple microservices?

Sagas

The Saga Pattern is a way of representing a long running transaction. Sagas can be represented as Finite State Machines.

Multiple requests are managed by a Saga.
Individual requests are run in sequence or in parallel.
When all requests are complete, the Saga is complete.
What happens when a request fails?

Failure in Sagas

Each request is paired with a compensation action.
If any request fails, compensating actions are executed for all completed steps.
The saga is then completed with a failure.
If a compensating action fails, then it is retried until it succeeds. This requires idempotence.

Timeouts in Distributed Systems

In a Distributed System, timeouts present a difficult problem.
A request timeout presents us with at least 3 possible scenarios.
- The request failed.
- The request succeeded but the reply failed.
- The request is still queued and may succeed (or fail) eventually.
Because the request may have succeeded, we need to run the compensating action.
The request may still be queued, therefore the compensating action may complete first.
- Therefore, compensating action and request must be commutative.

Compensating vs Rolling Back

Compensating actions are different from rollbacks.
Rollback:
- Implies the transaction has not completed.
- Removes evidence from the transaction.
Compensation:
- Applied on top of a previously completed action.
- Evidence of the original action remains.

For example, if we want to apply a compensating action for something like a bank transaction, "I deposit five dollars into my account", a rollback action would be to eliminate that deposit. We erase the deposit from the log, so that when I look at my account, I never see that I attempted to deposit five dollars. Now with a compensating action, the deposit of five dollars is already there. We don't erase it. What we do is we apply a compensating action on top of it, like for example withdraw $5. So withdrawing the $5 will reduce our balance to what we expect it to be, but it will leave the evidence of the original deposit behind, and that's the difference. We leave the evidence behind.

Moving Forward with Failures

Saga’s apply compensating actions to undo changes made during previous states.
An alternative approach would be to move forward despite the failures.
A saga is coupled to the failures, but if we can move forward we break this coupling.
Eg.
- Retries.
- Fallback to cached or default values.
- Move failures to an error queue and precise them separately.

Messaging Patterns

Two Generals

The Generals Problem illustrates the impossibility of reaching a consensus over an unreliable communication channel:

Two armies try to coordinate a synchronized attack.
Messengers travel through enemy territory and may be killed or captured.
The potential for a loss of messengers results in an infinite acknowledgement chain.
No matter how many acks are sent, neither A nor B can ever be 100% sure that the other general is ready to attack.

When we talk about distributed systems, that unreliable communication channel is basically the network. At least at the moment, it's impossible for us to build a bulletproof network. There's always going to be the possibility that messages will be dropped, or delayed, for some period of time. And there's just no way to avoid that, at least with today's technology.

In a distributed system, achieving strong consistency is impossible, instead we have to look for things like eventual consistency and we have to use tools to achieve that.

Delivery Guarantees

The two generals problem shows that over an unreliable network we can to guarantee message receipt. This means that “Exactly Once” delivery of messages is impossible.

Instead, we must be satisfied with either:
- At Most Once
- At Least Once.

We can simulate Exactly Once delivery using At Least Once delivery and Idempotence.

At Most Once Delivery

At most once delivery promises that no message will ever be delivered more than once. An effort is made to deliver the message, but if a failure occurs we never retry.

No retry means:

Messages are never duplicated.
Messages may be lost.

At most once delivery is easy to implement. An example of this kind of messages are simple and plain HTTP requests. It requires no storage of messages.

At Least Once Delivery

At least once delivery guarantees that all messages will eventually be delivered. When a failure occurs:

The message may not have been delivered.
The message may have been delivered, but not acknowledged.

Failure always results in a retry. Meaning that messages may be delivered more than once, but they are never lost.

At least once delivery requires messages to be stored by the sender to enable retries. The store needs to be durable (database, files, etc.) never in memory.

Technically, at least once delivery is also not possible to guarantee because it is possible that the reason that the message didn't get through is that the receiver failed, and then the receiver never comes back. If the receiver never comes back then we can't guarantee delivery, because there's nobody to receive it. Having said that, we generally accept that at least once delivery is possible, because, if your receiver goes away and never comes back you've got bigger problems, and you're going to need to fix those problems pretty quickly.

Exactly Once Delivery

Exactly Once Delivery is not possible. In the event of a network partition, or lost message, we can’t guarantee whether our message was received. Failure requires resending the message, which creates potential duplicates.

Exactly Once Delivery is simulated using:

At Least Once Delivery.
Deduplication or Idempotence of messages.
Requires storage on both the sender and the receiver.

It's very tempting to just do at least once delivery everywhere. That's usually not necessary. Try to do at least once delivery at the edges of your system. So when you're communicating between other micro services, for example you might want at least once delivery there, but within your system at most once is usually fine. You can usually fall back to the at least once delivery that's happening on the edge in order to fix any problems that maybe happen somewhere inside the system.

Messaging Patterns

When managing communications between Microservices, there are two distinct approaches we can take.

Each microservice can directly depend on other microservices, sending messages in a point to point fashion.
Each microservice can leverage a publish/subscriber message broker or bus to decouple it from other services.

Point To Point

Each service depends on other services directly.
Services are directly coupled to each other’s API.
Services know about and understand their dependencies.

Publish/Subscribe

Services publish messages to a common message bus.
Other services subscribe to the messages.
The publishing service has no knowledge of the subscribing services.
Subscribing services also have no knowledge of the publishing services.
Services are coupled to the message format, not to each other.

Point To Point vs Publish/Subscribe

Point To Point
- Dependencies are more clear and distinct than with Pub/Sub
- Coupling is higher.
- Complexity is more directly observable.
Pub/Sub
- Dependencies are more flexible.
- Coupling is lower.
- Complexity may be harder to see.

Building systems that have a backbone of pub/sub with a little bit of point-to-point is a probably a good idea. Building systems that are entirely based on point-to-point it's certainly possible but it does create a lot of coupling that you may want to avoid. It's a bit of a balance in most systems. Most systems will have some pub/sub and some point-to-point, depending on the use case that you're aiming for.

Message Bus

Both Point To Point and Pub/Sub can be done internally or they may leverage an external service. An external message bus may be used as the transport for messages. Not to be confused with an Enterprise Service Bus. In Reactive Systems these tend to be more lightweight and focused. They may use different combinations of Point To Point and Pub/Sub with At Most Once Delivery or At Least Once Delivery. Some examples of those kind of services could be Kafka, Rabbit MQ, Kinesis, etc.