Service Mesh Comparison

What is a Service Mesh?

A service mesh is a dedicated infrastructure layer that adds features to a network between services. It allows to control traffic and gain insights throughout the system. Observability, traffic shifting (for canary releasing), resiliency features (such as circuit breaking and retry/timeout) and automatic mutual TLS can be configured once and enforced in a decentralized fashion. In contrast to libraries, which are used for similar functionality, a service mesh does not require code changes. Instead, it adds a layer of additional containers that implement the features reliably and agnostic to technology or programming language.

Service Mesh Architecture

An image showing a comparison of a two-service microservice architecture with and without a Service Mesh.

Without a service mesh,
... each microservice implements business logic and cross cutting concerns (CCC) by itself.

With a service mesh,
... many CCCs like traffic metrics, routing, and encryption are moved out of the microservice and into a proxy. business logic and business metrics stay in the microservices. Incoming and outgoing requests are transparently routed through the proxies. In addition to a layer of proxies (data plane), a service mesh adds a so-called control plane. It distributes configuration updates to all proxies and receives metrics collected by the proxies for further processing, e.g. by a monitoring infrastructure such as Prometheus.


Who Needs a Service Mesh?

The value of a service mesh grows with the number of services an application consists of. Logically, microservices architectures are the most common use cases for a service mesh. However, the specific interaction might be more relevant in regards to how a service mesh can improve the control, reliability, security, and observability of the services. Even a monolith could benefit from a service mesh and some concrete microservice applications might not.


Service Mesh Implementations



How to Choose a Service Mesh Implementation

While service meshes have no impact on the code, they change operations procedures and require familiarization with new concepts and technology. So - especially until the Gateway APIs for mesh are complete and widely supported – adopting a service mesh implementation is a long term decision. Therefore, the implementations should be compared and tested carefully in advance. Choosing the most flexible service mesh with the most features seems logical at first. But the contrary could be true because features and flexibility are often paid with cognitive and technical complexity.

The goal of the evaluation is to figure out which features are important to you and how you benefit from them. As service meshes impact the latency and resource consumption, these disadvantages have to be measured, too.

We recommend to include the following steps in your decision process:


Service Mesh Comparison

Istio Linkerd AWS App Mesh Consul Traefik Mesh (formerly Maesh) Kuma Cilium
Current version 1.20 2.13 1.18 1.4 2.6 1.12
License Apache License 2.0 Apache License 2.0 Closed Source Business Source License Apache License 2.0 Apache License 2.0 Apache License 2.0
Initiated by Google, IBM, Lyft Buoyant AWS HashiCorp Traefik Labs Kong Cilium
Service Proxy Envoy, Ambient Mesh (alpha) Linkerd2-proxy Envoy defaults to Envoy, exchangeable Traefik Proxy on each node Envoy Cilium agent on each node, Proxy Injection option for L7 (Envoy)
Ingress Controller Envoy, support for Kubernetes Gateway API not provided AWS Virtual Gateway, exposable with AWS Elastic Load Balancer/AWS Load Balancer Controller Envoy. Support for Kubernetes Gateway API with Consul API Gateway not provided Support for Kubernetes Gateway API, Kuma builtin routes, or Ingress API through delegated Gateways Cilium Ingress for TLS & path-based routing features, compatible with any other
Governance see Istio Community and CNCF Charter see Linkerd Governance and CNCF Charter AWS see Contributing to Consul see Contributing notice see Contributing notice, Governance, and CNCF Charter see Governance and CNCF Charter
Tutorial Istio Tasks Linkerd Getting Started Guide AWS App Mesh Getting Started HashiCorp Learn platform Traefik Mesh Example Install Kuma on Kubernetes Cilium Quick Installation
Used in production yes yes yes yes yes
Advantages Istio is a CNCF Graduated service mesh. It can be adapted and extended like no other mesh. Its many features are available for Kubernetes and other platforms. Linkerd is a CNCF Graduated service mesh. It provides best-in-class operational simplicity, security, and performance, and is extremely easy to adopt and use. AWS App Mesh is integrated into the AWS landscape and it is fully managed for you. Consul service mesh can be used in any Consul environment and therefore does not require a scheduler. The proxy can be changed and extended. Traefik Mesh focuses on a selection of features to achieve good usability and performance. Kuma supports both Kubernetes and VMs - including hybrid multi-zone deployments - and scales to many autonomous zones with different network constraints, it also allows you to customize the Envoy Proxy. Cilium takes a different approach on service mesh by making use of eBPF and therefore it doesn't need sidecars at all for some very simple use cases.
Drawbacks Istio's flexibility can be overwhelming for teams who don't have the capacity for more complex technology. Also, Istio takes control of the ingress controller. Linkerd is deeply integrated with Kubernetes and does not currently support non-Kubernetes workloads. It also does not currently support data plane extensions. AWS App Mesh configuration cannot be migrated to an environment outside AWS. Consul uses its own internal storage, and does not on rely Kubernetes for persistent storage. Traefik Mesh currently does not support transparent TLS encryption. Kuma is possibly the most flexible service mesh. Teams should thoroughly consider whether their project can handle the complexity involved. Any L7 use cases require the use of an Envoy proxy, which is shared between all workloads on the node and not a recommended security configuration.
Supported Protocols
TCP yes yes yes yes yes yes yes
HTTP/1.1+ yes yes yes yes yes yes yes
HTTP/2 yes yes yes yes yes yes yes
gRPC yes yes yes yes yes yes yes
Sidecar / Data Plane
Automatic Sidecar Injection yes yes yes yes yes (per Node) yes yes (per Node)
CNI plugin to avoid pod network privileges yes, in beta yes yes yes no yes not necessary
Platform and Extensibility
Platform Kubernetes, VMs Kubernetes ECS, Fargate, EKS, EC2 Kubernetes, Nomad, VMs, ECS, Lambda Kubernetes Kubernetes, VMs, ECS Kubernetes
Cloud Integrations Google Cloud, Alibaba Cloud, IBM Cloud, Microsoft Azure, Huawei Cloud, Red Hat OpenShift, DaoCloud, VMware Tanzu, Tencent Cloud, Baidu AI Cloud, Oracle Cloud Native Environment DigitalOcean AWS HCP Consul on AWS and Azure Kong Konnect
Mesh ExpansionExtension of the Mesh by containers/VMs outside the cluster yes no yes, within AWS yes no yes yes
Multi-Cluster MeshControl and observe multiple clusters yes yes yes, through AWS Cloud Map yes no yes yes
Monitoring Features
Service Log Collection no no no, use AWS FireLens for ECS and Fargate instead no no no no
Access Log Generation yes yes and tap feature yes yes yes yes yes, via Proxy injection
"Golden Signal” Metrics Generation yes yes yes yes, depending on the proxy used yes yes yes, L7 metrics via Proxy injection
Integrated, pre-configured Prometheus yes yes, in an extension no yes, for non-prod environments yes yes yes
Integrated, pre-configured Grafana yes yes, in an extension no no yes yes, including a datasource yes
Per-Route MetricsCollect values for each HTTP endpoint individually experimental yes depending on the proxy used no no yes, via Proxy injection
Dashboard yes, Kiali yes yes, AWS Cloud Watch yes no yes, with a service topology map in grafana yes, Hubble
Compatible Tracing-Backends Jaeger, Zipkin, Solarwinds all Backends supporting OpenTelemetry AWS X-Ray, Jaeger and DataDog Datadog, Jaeger, Zipkin, OpenTracing, Honeycomb Jaeger Jaeger, Zipkin, Datadog, and OpenTelemetry OpenTelemetry via hubble-otel
Integrated, pre-configured Tracing-Backends yes, Jaeger or Zipkin for nonprod environments Jaeger, in an extension yes, AWS X-Ray yes yes, Jaeger yes, Jaeger no
Routing Features
Load Balancing yes (Round Robin, Random, Weighted, Least Request) yes (EWMA, exponentially weighted moving average) yes yes (Round Robin, Random, Weighted, Least Request, Ring Hash, Maglev) yes yes (Round Robin, Least Request, Ring Hash, Random, Maglev) yes
Percentage-based Traffic Splits yes yes, through Gateway API or SMI yes yes yes, through SMI yes via manual configuration of Envoy proxy
Header- and Path-based Traffic SplitsRouting rules based on request header and path yes yes, through Gateway API yes yes no yes via manual configuration of Envoy proxy
Resilience Features
Circuit Breaking yes yes yes yes yes yes via manual configuration of Envoy proxy
Retry & Timeout yes yes yes yes yes Retry and Timeout via manual configuration of Envoy proxy
Path- & Method-based Retry & TimeoutDifferent retry and timeout config for each endpoint yes yes yes yes no yes, Retry, and Timeout (with targerRef: MeshHTTPRoute) no
Fault Injection yes yes, by adding a deployment and a traffic split config yes no yes no
Delay Injection yes no yes no yes no
Security Features
mTLS yes yes, on by default yes yes no yes yes, with manually created certs
mTLS Enforcement yes yes yes, via client policies yes no yes yes
mTLS Permissive Mode yes yes yes no yes
mTLS by default yes, permissive mode yes, permissive mode no yes no yes, optional no
External CA certificate and key pluggable e.g. Vault, cert-manager yes, CA cert pluggable and CA integration (experimental), SPIRE Integration yes yes yes, HashiCorp Vault, ACM Private CA, custom CA no yes yes
Service-to-Service Authorization Rules yes, including External Authorization yes no, but support for IAM for user-authorization yes no yes yes
*Might be possible through manual configuration/templating of proxy

Found a mistake? Or have something to add? We appreciate your issues or pull requests on GitHub!


That's just a table.
For advice, trainings, and support around Kubernetes and Service Mesh send an email to info@innoq.com


Alternatives to Service Meshes

Undoubtedly, service mesh is a useful pattern and some current implementations are very promising. But they also go along with challenges such as cognitive and technical complexity. Like any tool, they are not useful in every situation. Sometimes it might be wise to keep existing well-known "boring" technology or to go with alternative solutions.

Libraries

Libraries are included in the microservices. The drawbacks are dependencies on specific technologies/languages, potential inconsistency in implementations and missing separation of service infrastructure and business logic.

However, the developer productivity can (at least in the short term) be better through the familiar use of libraries. Also, sometimes domain knowledge is needed, for example, to configure the fallback for a circuit breaker or to define business metrics. In these cases, a service mesh is of no use.

Service meshes require a change to the infrastructure. So it is not possible to use them if the infrastructure can or should not be changed. Sometimes the risk of changing the infrastructure is deemed too high even though services meshes can be applied to specific services only.

No (synchronous) Microservices

Service meshes are in particular helpful for synchronous communication. They usually rely on the HTTP protocol to transfer additional information and e.g. understand if a call failed.

One of the reasons for adopting microservices is their potential to reduce the time-to-market for software. Despite several drawbacks such as high latency and tight coupling, it's a common practice to implement microservice communication synchronously.

However, it is overseen that there are more approaches to perform microservice communication or to even avoid dependencies in the first place. (Read more in the free Microservices Recipes Book) Patterns like SCS and asynchronous communication aim to mitigate many problems of classic (synchronously communicating) microservices. Of course, you can have asynchronous microservices with HTTP e.g. by polling a feed for new events. As service meshes rely on HTTP, they would still be of some use. However, features e.g. for resilience are of less use as asynchronous communication supports resilience anyway.

Unjustifiably, monolithic architectures are often not even considered as a solution. Obviously, service meshes can only help a monolith with communication to other systems but not with internal communication.

Service Mesh Primer

Our free Service Mesh Primer explains the service mesh pattern and features in detail and contains examples for Istio.