Service Mesh Comparison

What is a Service Mesh?

A service mesh is a dedicated infrastructure layer that adds features to a network between services. It allows to control traffic and gain insights throughout the system. Observability, traffic shifting (for canary releasing), resiliency features (such as circuit breaking and retry/timeout) and automatic mutual TLS can be configured once and enforced in a decentralized fashion. In contrast to libraries, which are used for similar functionality, a service mesh does not require code changes. Instead, it adds a layer of additional containers that implement the features reliably and agnostic to technology or programming language.

Service Mesh Architecture

An image showing a comparison of a two-service microservice architecture with and without a Service Mesh. — Without a service mesh,
... each microservice implements business logic and cross cutting concerns (CCC) by itself.

With a service mesh,
... many CCCs like traffic metrics, routing, and encryption are moved out of the microservice and into a proxy. business logic and business metrics stay in the microservices. Incoming and outgoing requests are transparently routed through the proxies. In addition to a layer of proxies (data plane), a service mesh adds a so-called control plane. It distributes configuration updates to all proxies and receives metrics collected by the proxies for further processing, e.g. by a monitoring infrastructure such as Prometheus.

Who Needs a Service Mesh?

The value of a service mesh grows with the number of services an application consists of. Logically, microservices architectures are the most common use cases for a service mesh. However, the specific interaction might be more relevant in regards to how a service mesh can improve the control, reliability, security, and observability of the services. Even a monolith could benefit from a service mesh and some concrete microservice applications might not.

Service Mesh Implementations

Istio

If you have heard about service mesh, you have probably heard about Istio too. Istio is by far the most popular service mesh because of its rich feature set and Google's and IBM's support.

Linkerd

Linkerd was the first service mesh. The modern 2.x versions are committed to simplicity, performance, and building on top of Kubernetes as the underlying platform.

Consul

HashiCorp's Consul has been well known as a service discovery solution for a long time. Now that it has adopted the Envoy proxy and Sidecar pattern, Consul can serve as a service mesh for a variety of platforms like Kubernetes and VMs.

AWS App Mesh

Not long after the service mesh hype, AWS added its own service mesh for applications on AWS.

Traefik Mesh

As the name already reveals, Traefik Mesh (formerly Maesh) is the service mesh based the cloud-native API gateway Traefik.

Kuma

Kuma is a service mesh using Envoy and the sidecar pattern made by developers of an API gateway - Kong. It focuses on multi-cloud and can run non Kubernetes workloads.

Cilium

Cilium is a network plugin for Kubernetes using the eBPF features of the Linux kernel. Recent releases also include mesh features, with or without sidecars.

Gloo Mesh

Gloo Mesh Enterprise is an Istio-based service mesh. It aims to simplify application networking with unified control, reliability, observability, extensibility, and security.

Gateway API

Gateway API is a collection of resources, initially designed to replace the Kubernetes Ingress API. The expressive, extensible and role-oriented interfaces are now extending into service mesh use cases, with broad industry support.

How to Choose a Service Mesh Implementation

While service meshes have no impact on the code, they change operations procedures and require familiarization with new concepts and technology. So - especially until the Gateway APIs for mesh are complete and widely supported – adopting a service mesh implementation is a long term decision. Therefore, the implementations should be compared and tested carefully in advance. Choosing the most flexible service mesh with the most features seems logical at first. But the contrary could be true because features and flexibility are often paid with cognitive and technical complexity.

The goal of the evaluation is to figure out which features are important to you and how you benefit from them. As service meshes impact the latency and resource consumption, these disadvantages have to be measured, too.

We recommend to include the following steps in your decision process:

As a team, identify the most important problems to be solved by the service mesh. Keep in mind that libraries or adaptions to the architecture could be good alternatives in some cases (see below).
Discuss your requirements regarding simplicity/usability, performance, and compatibility.
Identify your top two or three implementations based on your feature and non-feature requirements. The table below should assist you.
Try the implementations by executing their respective tutorials (links below) and possibly discard one candidate.
Test the latency and resource overhead for your individual application. For each service mesh candidates, set up an identical test environment and install the service mesh. Set up an additional mesh-free environment. Install your application in all environments. Perform a load test in all and measure request latency, CPU and memory consumption by using tools like Locust or Fortio.

Service Mesh Comparison

	Istio	Linkerd	AWS App Mesh	Consul	Traefik Mesh (formerly Maesh)	Kuma	Cilium
Current version	1.20	2.13		1.18	1.4	2.6	1.12
License	Apache License 2.0	Apache License 2.0	Closed Source	Business Source License	Apache License 2.0	Apache License 2.0	Apache License 2.0
Initiated by	Google, IBM, Lyft	Buoyant	AWS	HashiCorp	Traefik Labs	Kong	Cilium
Service Proxy	Envoy, Ambient Mesh (alpha)	Linkerd2-proxy	Envoy	defaults to Envoy, exchangeable	Traefik Proxy on each node	Envoy	Cilium agent on each node, Proxy Injection option for L7 (Envoy)
Ingress Controller	Envoy, support for Kubernetes Gateway API	not provided	AWS Virtual Gateway, exposable with AWS Elastic Load Balancer/AWS Load Balancer Controller	Envoy. Support for Kubernetes Gateway API with Consul API Gateway	not provided	Support for Kubernetes Gateway API, Kuma builtin routes, or Ingress API through delegated Gateways	Cilium Ingress for TLS & path-based routing features, compatible with any other
Governance	see Istio Community and CNCF Charter	see Linkerd Governance and CNCF Charter	AWS	see Contributing to Consul	see Contributing notice	see Contributing notice, Governance, and CNCF Charter	see Governance and CNCF Charter
Tutorial	Istio Tasks	Linkerd Getting Started Guide	AWS App Mesh Getting Started	HashiCorp Learn platform	Traefik Mesh Example	Install Kuma on Kubernetes	Cilium Quick Installation
Used in production	yes	yes	yes	yes			yes
Advantages	Istio is a CNCF Graduated service mesh. It can be adapted and extended like no other mesh. Its many features are available for Kubernetes and other platforms.	Linkerd is a CNCF Graduated service mesh. It provides best-in-class operational simplicity, security, and performance, and is extremely easy to adopt and use.	AWS App Mesh is integrated into the AWS landscape and it is fully managed for you.	Consul service mesh can be used in any Consul environment and therefore does not require a scheduler. The proxy can be changed and extended.	Traefik Mesh focuses on a selection of features to achieve good usability and performance.	Kuma supports both Kubernetes and VMs - including hybrid multi-zone deployments - and scales to many autonomous zones with different network constraints, it also allows you to customize the Envoy Proxy.	Cilium takes a different approach on service mesh by making use of eBPF and therefore it doesn't need sidecars at all for some very simple use cases.
Drawbacks	Istio's flexibility can be overwhelming for teams who don't have the capacity for more complex technology. Also, Istio takes control of the ingress controller.	Linkerd is deeply integrated with Kubernetes and does not currently support non-Kubernetes workloads. It also does not currently support data plane extensions.	AWS App Mesh configuration cannot be migrated to an environment outside AWS.	Consul uses its own internal storage, and does not on rely Kubernetes for persistent storage.	Traefik Mesh currently does not support transparent TLS encryption.	Kuma is possibly the most flexible service mesh. Teams should thoroughly consider whether their project can handle the complexity involved.	Any L7 use cases require the use of an Envoy proxy, which is shared between all workloads on the node and not a recommended security configuration.
Supported Protocols
TCP	yes	yes	yes	yes	yes	yes	yes
HTTP/1.1+	yes	yes	yes	yes	yes	yes	yes
HTTP/2	yes	yes	yes	yes	yes	yes	yes
gRPC	yes	yes	yes	yes	yes	yes	yes
Sidecar / Data Plane
Automatic Sidecar Injection	yes	yes	yes	yes	yes (per Node)	yes	yes (per Node)
CNI plugin to avoid pod network privileges	yes, in beta	yes	yes	yes	no	yes	not necessary
Platform and Extensibility
Platform	Kubernetes, VMs	Kubernetes	ECS, Fargate, EKS, EC2	Kubernetes, Nomad, VMs, ECS, Lambda	Kubernetes	Kubernetes, VMs, ECS	Kubernetes
Cloud Integrations	Google Cloud, Alibaba Cloud, IBM Cloud, Microsoft Azure, Huawei Cloud, Red Hat OpenShift, DaoCloud, VMware Tanzu, Tencent Cloud, Baidu AI Cloud, Oracle Cloud Native Environment	DigitalOcean	AWS	HCP Consul on AWS and Azure		Kong Konnect
Mesh ExpansionExtension of the Mesh by containers/VMs outside the cluster	yes	no	yes, within AWS	yes	no	yes	yes
Multi-Cluster MeshControl and observe multiple clusters	yes	yes	yes, through AWS Cloud Map	yes	no	yes	yes
Monitoring Features
Service Log Collection	no	no	no, use AWS FireLens for ECS and Fargate instead	no	no	no	no
Access Log Generation	yes	yes and tap feature	yes	yes	yes	yes	yes, via Proxy injection
"Golden Signal” Metrics Generation	yes	yes	yes	yes, depending on the proxy used	yes	yes	yes, L7 metrics via Proxy injection
Integrated, pre-configured Prometheus	yes	yes, in an extension	no	yes, for non-prod environments	yes	yes	yes
Integrated, pre-configured Grafana	yes	yes, in an extension	no	no	yes	yes, including a datasource	yes
Per-Route MetricsCollect values for each HTTP endpoint individually	experimental	yes		depending on the proxy used	no	no	yes, via Proxy injection
Dashboard	yes, Kiali	yes	yes, AWS Cloud Watch	yes	no	yes, with a service topology map in grafana	yes, Hubble
Compatible Tracing-Backends	Jaeger, Zipkin, Solarwinds	all Backends supporting OpenTelemetry	AWS X-Ray, Jaeger and DataDog	Datadog, Jaeger, Zipkin, OpenTracing, Honeycomb	Jaeger	Jaeger, Zipkin, Datadog, and OpenTelemetry	OpenTelemetry via hubble-otel
Integrated, pre-configured Tracing-Backends	yes, Jaeger or Zipkin for nonprod environments	Jaeger, in an extension	yes, AWS X-Ray	yes	yes, Jaeger	yes, Jaeger	no
Routing Features
Load Balancing	yes (Round Robin, Random, Weighted, Least Request)	yes (EWMA, exponentially weighted moving average)	yes	yes (Round Robin, Random, Weighted, Least Request, Ring Hash, Maglev)	yes	yes (Round Robin, Least Request, Ring Hash, Random, Maglev)	yes
Percentage-based Traffic Splits	yes	yes, through Gateway API or SMI	yes	yes	yes, through SMI	yes	via manual configuration of Envoy proxy
Header- and Path-based Traffic SplitsRouting rules based on request header and path	yes	yes, through Gateway API	yes	yes	no	yes	via manual configuration of Envoy proxy
Resilience Features
Circuit Breaking	yes	yes	yes	yes	yes	yes	via manual configuration of Envoy proxy
Retry & Timeout	yes	yes	yes	yes	yes	Retry and Timeout	via manual configuration of Envoy proxy
Path- & Method-based Retry & TimeoutDifferent retry and timeout config for each endpoint	yes	yes	yes	yes	no	yes, Retry, and Timeout (with targerRef: MeshHTTPRoute)	no
Fault Injection	yes	yes, by adding a deployment and a traffic split config		yes	no	yes	no
Delay Injection	yes	no		yes	no	yes	no
Security Features
mTLS	yes	yes, on by default	yes	yes	no	yes	yes, with manually created certs
mTLS Enforcement	yes	yes	yes, via client policies	yes	no	yes	yes
mTLS Permissive Mode	yes	yes		yes	no	yes
mTLS by default	yes, permissive mode	yes, permissive mode	no	yes	no	yes, optional	no
External CA certificate and key pluggable e.g. Vault, cert-manager	yes, CA cert pluggable and CA integration (experimental), SPIRE Integration	yes	yes	yes, HashiCorp Vault, ACM Private CA, custom CA	no	yes	yes
Service-to-Service Authorization Rules	yes, including External Authorization	yes	no, but support for IAM for user-authorization	yes	no	yes	yes
*Might be possible through manual configuration/templating of proxy

Found a mistake? Or have something to add? We appreciate your issues or pull requests on GitHub!

That's just a table.
For advice, trainings, and support around Kubernetes and Service Mesh send an email to info@innoq.com

Alternatives to Service Meshes

Undoubtedly, service mesh is a useful pattern and some current implementations are very promising. But they also go along with challenges such as cognitive and technical complexity. Like any tool, they are not useful in every situation. Sometimes it might be wise to keep existing well-known "boring" technology or to go with alternative solutions.

Libraries

Libraries are included in the microservices. The drawbacks are dependencies on specific technologies/languages, potential inconsistency in implementations and missing separation of service infrastructure and business logic.

However, the developer productivity can (at least in the short term) be better through the familiar use of libraries. Also, sometimes domain knowledge is needed, for example, to configure the fallback for a circuit breaker or to define business metrics. In these cases, a service mesh is of no use.

Service meshes require a change to the infrastructure. So it is not possible to use them if the infrastructure can or should not be changed. Sometimes the risk of changing the infrastructure is deemed too high even though services meshes can be applied to specific services only.

No (synchronous) Microservices

Service meshes are in particular helpful for synchronous communication. They usually rely on the HTTP protocol to transfer additional information and e.g. understand if a call failed.

One of the reasons for adopting microservices is their potential to reduce the time-to-market for software. Despite several drawbacks such as high latency and tight coupling, it's a common practice to implement microservice communication synchronously.

However, it is overseen that there are more approaches to perform microservice communication or to even avoid dependencies in the first place. (Read more in the free Microservices Recipes Book) Patterns like SCS and asynchronous communication aim to mitigate many problems of classic (synchronously communicating) microservices. Of course, you can have asynchronous microservices with HTTP e.g. by polling a feed for new events. As service meshes rely on HTTP, they would still be of some use. However, features e.g. for resilience are of less use as asynchronous communication supports resilience anyway.

Unjustifiably, monolithic architectures are often not even considered as a solution. Obviously, service meshes can only help a monolith with communication to other systems but not with internal communication.

Service Mesh Primer

Our free Service Mesh Primer explains the service mesh pattern and features in detail and contains examples for Istio.