Event-Driven Architecture: What You Need to Know

Welcome to part one of our four part series on Event-Driven Architecture (EDA).

This first post cover the basics of an EDA and why you might want to consider building one.

The second post will cover whether an EDA is right for your company and how to win organizational buy-in if so. In the same post we'll also explore some of the basic questions you need to consider before committing such as:

Which event router should you run?
Should you run the event router yourself?
How should you organize events?

In the third part entitled Building for Failure, we will discuss common failure modes of event-driven systems, as well as some patterns you can implement to help manage these challenges.

The final post will cover long term ownership and maintenance; the least sexy part of any software project and therefore rarely discussed in detail. We hope to change that!

Encore loves Go, so where necessary and possible examples will be in Go and may reference the Go ecosystem (for example, in maintenance we will discuss managing Go modules).

What is an Event-Driven Architecture and Why do I need One?

An event-driven architecture (EDA) is a means of building distributed systems that allow for individual services to be decoupled and have their own software lifecycle; i.e. they can be built, shipped, scaled (and also fail) independently of other services.

Instead of communicating by more traditional means such as REST, services communicate by producing an event to some sort of router which ensures it reaches other services that are interested in it.

Events typically contain information that falls into two categories:

Changes in state (user created, user updated, user deleted).
Identifiers (order with ID 12345 has now been dispatched, block 64564 has been mined).

EDAs give you lots of advantages once you hit a certain level of scale (both in terms of requests served and engineers). The biggest of which are:

You can "Scale and fail" independently.
You can build an audit log system relatively easily.
You gain the ability to move fast on stable infrastructure.
It enables teams to choose the programming language of their choice.
The potential to save costs.

Let's dig into each of these a little more.

"Scale and Fail" Independently

When building an EDA, services are not aware of each other. They are only aware of the event router. This means that if one service in our system is experiencing issues or fails, it does not have an impact on others in the system. This is powerful for isolating failure in your system and ensuring that you can still service other workloads and, therefore, your customers.

This ability to isolate failure is not unique to an EDA though, and can also be achieved with well planned microservices. However, what is unique is what happens when a failing service recovers.

In synchronous microservices, the whole time we are down, we are not able to service requests. In our event-driven system, we continue to buffer events in our event router and once our service is available again, we can begin processing them. Instead of having to reject entire workloads, we have made a trade off and chosen to incur lag (the time delay between the occurrence of an event and its processing or consumption by interested services).

So where might this be useful?

On Linkedin, you can like a post. When this happens, an event such as "postLiked" is emitted to an event router. This event is interesting to a wide suite of services, but there will be one that listens and uses the like data and other user activity to update your timeline. A service that does this might be called the Timeline Enrichment Service.

If this service is down, your feed does not get updated and is stale.

However, as soon as it becomes available again it can start consuming "postLiked" events again and uses it to reorder your timeline:

This is really powerful. Even though we incurred downtime we did not display any error messages to our customers and continued to serve all traffic. We did not lose any data and as soon as we could, we processed our user's events to get back to where we were had we not experienced any issues. We will explore this in more detail in a later post entitled building for failure.

Relatively Simple to Build an Audit Log System

Depending on your business, audit logs can be one of the most powerful and important features.

For example, if your business allows customers to configure various parts of their infrastructure, it's really important to know who changed what and when, so that in the event of an incident or a misconfiguration, it's possible to quickly identify it and rectify it.

Building an auditing system can be incredibly complex as it depends on every single change in the system being trackable. In some systems I have worked on, this is manual and depends on engineers remembering to do it.

In an event-driven system, by the nature of the way you are building, you have a change-stream available to you and building a useful auditing system becomes much simpler (but not free). One naive approach you could take is to read every event off of your event router and store it in a database. You'd need to be really careful here not to leak sensitive information, but this is a good starting point!

There are still going to be some events you may need to manually configure but a large proportion are available immediately.

Ability to Move Fast on Stable Infrastructure

Mark Zuckerberg famously changed Facebook's motto from "move fast and break things" to "move fast on stable infrastructure" in ~2014 as an acknowledgement that as a company grows you really can have both. An EDA is one way to have your cake and eat it too.

By adopting an EDA as a company, your infrastructure will be simplified and a lot of boilerplate code can be generated. For example, pretty much every application you write is going to need to read or write from your event router and will probably connect to it using a standard library.

Furthermore, there is no coordination needed between services. This is powerful as it means that other teams can subscribe to events from your system without you having to be involved in onboarding them in any capacity. This also means that both they and you can continue to ship independently at your own cadence.

Enable teams to Choose the Programming Language of Their Choice

This is a big benefit as a company grows! I still find it hard to believe, but some teams just don't want to write Go (or more accurately, another language meets their needs much better).

An example where you might see this is within a data team. Python has an amazing eco-system for machine learning and mathematics. By using a language agnostic approach to managing your event schemas (such as Protobuf), teams can use other languages without consequence. This is powerful and ensures you can continue to use the right tool for the job.

However, there are other ways to achieve this particular benefit without having to adopt an EDA. For example Open-API and gRPC both allow you to generate clients for various programming languages. If your sole reason for adopting an EDA is for this benefit, you may be better exploring those options. In most companies I have worked at, we use both an EDA and gRPC together; having an EDA does not mean all synchronous communication can or should be eliminated!

(Potentially) Save Costs

EDA systems "only" need to be alive when there is work to do; an event to consume. If there is no work to do, your service can sit idle and use less CPU, bandwidth and memory. If you are using a serverless model, it might even be no cost for you depending on your scale as lots of platforms give a generous free tier for this.

Using cost savings as a reason to move to an EDA is a tough sell in my opinion and you should proceed with caution if you take this path as the primary means to justify an EDA. Most system types can be optimized to achieve greater cost savings and you need to ensure you have done everything you can to optimize your current system before looking at adopting an EDA in an attempt to save money. This is explored more in next week's post.

Sounds Great! Why Doesn't Everyone Have an Event-Driven Architecture?

EDAs ultimately end up reflecting your organization and any complexity within it. This complexity would still be there in a monolith, but by decentralizing it into potentially hundreds of services, you inherit other problems that you need to ensure you invest time to solve.

For example, without investment, visibility into system behavior as a whole can be much more difficult in an event-driven system. Investing in something like Open Telemetry and a service catalog is a good idea. Getting started with these things are relatively simple, but if you want to store your traces somewhere that are searchable, you are going to have to either pay for a SaaS tool that ingests them or you are going to have to run and maintain an open source tool capable of this such as Jaeger. For service cataloging, Backstage is becoming a very popular option. Depending on the capabilities and the capacity of your engineering team, this might be a good option and many companies do have platform teams that provide tooling such as this. With the average salary of a platform Engineer being ~$144k, companies should think carefully on whether the benefits of an EDA are going to outweigh the cost. We will dig deeper into this in part 2 and 3 of the series.

In addition to cost, you need to think about what sort of resilience requirements are necessary for your system and workloads to ensure that you pick the correct patterns for you. For example, what is the consequence in your system if you achieve at least once delivery (meaning you might receive the same message more than once)? Can your system handle it today? If not, what changes do you need to make to support it?

Furthermore, even if you use a managed service for your message router, it becomes such an essential piece of infrastructure that you need to ensure you have the expertise in your team to understand its failure modes. I did a talk at QCon which is summarized nicely here in which I talk through some issues we have seen with Kafka (Cloudflare's event router of choice) and how we solved them. It should not be underplayed what a large undertaking this is, and large companies have big platform teams for running things such as Kafka.

Finally, it's often argued that debugging an issue can be more difficult. I actually don't agree with this and I think you have to reframe how you think of it. In a monolithic application, you can spin up the entire system on your laptop or in a cloud workspace and test every workflow. In an event-driven system, you should not attempt to do this and should think of each individual system as a boundary. You should use the enhanced visibility we discussed earlier to identify the service causing an issue and then test it independently. Once you get used to this pattern of working (and have got your visibility stack in a good place) you'll find this is probably easier than the previous ways of working.

Deciding if you should adopt an EDA is a tough decision, and we will explore it much more in our follow up post where we discuss how you can decide if an EDA is right for you, and how to win buy-in if so.

Event-Driven Architectures With Go

Over the course of my career I have built producer and consumer services in PHP, Javascript, Java, Go and a couple of other languages. In general, I find Go to be an incredible "sensible default" for building systems that will be part of your EDA.

Firstly, Go is well supported by all the major cloud providers. If you want to use serverless functions, Go is a first class citizen. If you do not want to use serverless or it does not match your use case, Go is incredibly easy to containerise, with a production ready image creatable in just a few lines.

Secondly, Go is incredibly easy to learn and in my opinion, maintain. This means that if you're a growing company and expect to onboard new teams and team members, having Go as a basis for your systems should mean that new engineers can get up to speed quickly. Below is a small sample application that can connect to Google PubSub, subscribe to a topic, send an event and then clean up. In total, its 82 lines of code including liberal line breaks. Even if you have never written or read a line of Go before, I hope you'll agree that it's quite clear and readable:

package main

import (
	"cloud.google.com/go/pubsub"
	"context"
	"fmt"
	"log"
	"time"
)

const (
	projectID  = "your-project-id"
	topicID    = "your-topic-id"
	subscriber = "your-subscriber-id"
)

func publishMessage(ctx context.Context, topic *pubsub.Topic, messageData []byte) error {
	result := topic.Publish(ctx, &pubsub.Message{
		Data: messageData,
	})

	id, err := result.Get(ctx)
	if err != nil {
		return fmt.Errorf("publish error: %w", err)
	}

	log.Printf("Published a message with ID: %s\n", id)
	return nil
}

func handleMessage(ctx context.Context, message *pubsub.Message) {
	log.Printf("Received message: %s\n", message.Data)
	message.Ack()
}

func subscribeToMessages(ctx context.Context, subscription *pubsub.Subscription) error {
	err := subscription.Receive(ctx, func(ctx context.Context, message *pubsub.Message) {
		handleMessage(ctx, message)
	})
	if err != nil {
		return fmt.Errorf("receive error: %w", err)
	}
	return nil
}

func main() {
	ctx := context.Background()

	// Create a new Pub/Sub client
	client, err := pubsub.NewClient(ctx, projectID)
	if err != nil {
		log.Fatalf("Failed to create client: %v", err)
	}

	topic := client.Topic(topicID)

	// Publish a message
	err = publishMessage(ctx, topic, []byte("Your message here!"))
	if err != nil {
		log.Fatalf("Failed to publish message: %v", err)
	}

	// Create a new subscription
	subscription, err := client.CreateSubscription(ctx, subscriber, pubsub.SubscriptionConfig{
		Topic:       topic,
		AckDeadline: 10 * time.Second,
	})
	if err != nil {
		log.Fatalf("Failed to create subscription: %v", err)
	}

	// Subscribe to messages
	if err := subscribeToMessages(ctx, subscription); err != nil {
		log.Fatalf("failed to subscribe to messages: %v", err)
	}

	// Clean up
	err = subscription.Delete(ctx)
	if err != nil {
		log.Fatalf("failed to delete subscription: %v", err)
	}
}

Finally, Go is incredibly performant without additional complexity. I really enjoyed this blog post where the author migrated his application from PHP to Go and:

Reduced execution time by 99%
Reduced CPU usage by 60-70%
Added concurrency.
Decreased Latency.

My favorite thing about success stories such as the above is the user was not a Go expert and did not have to go looking for "hacks" to attain those performance gains; even badly written Golang is pretty performant compared to alternatives!

Wrapping Up

In this post we covered at a high level what an EDA is, why you may or may not want to consider it for your company. We also briefly discussed why you might want to consider Go as a sensible default for building out services in this system.

Next, check out part two in the Event-Driven Architecture series: Making a Business Case for an Event-Driven Architecture and Taking the First Steps.

We'll see you there!

About The Author

Matthew Boyle is an experienced technical leader in the field of distributed systems, specializing in using Go.

He has worked at huge companies such as Cloudflare and General Electric, as well as exciting high-growth startups such as Curve and Crowdcube.

Matt has been writing Go for production since 2018 and often shares blog posts and fun trivia about Go over on Twitter (@MattJamesBoyle).

If you enjoyed this blog post, you should checkout Matt's book entitled Domain-Driven Design Using Go, which is available from Amazon here.

Event-Driven Architecture: What You Need to Know

What is an Event-Driven Architecture and Why do I need One?

Event-Driven Architecture: What You Need to Know

What is an Event-Driven Architecture and Why do I need One?

More Articles