What is a Service Mesh?
A service mesh is a dedicated infrastructure layer that adds features to a network between services. It allows to control traffic and gain insights throughout the system. Observability, traffic shifting (for canary releasing), resiliency features (such as circuit breaking and retry/timeout) and automatic mutual TLS can be configured once and enforced in a decentralized fashion. In contrast to libraries, which are used for similar functionality, a service mesh does not require code changes. Instead, it adds a layer of additional containers that implement the features reliably and agnostic to technology or programming language.
Service Mesh Architecture
Who Needs a Service Mesh?
The value of a service mesh grows with the number of services an application consists of. Logically, microservices architectures are the most common use cases for a service mesh. However, the specific interaction might be more relevant in regards to how a service mesh can improve the control, reliability, security, and observability of the services. Even a monolith could benefit from a service mesh and some concrete microservice applications might not.
Service Mesh Implementations
How to Choose a Service Mesh Implementation
While service meshes have no impact on the code, they change operations procedures and require familiarization with new concepts and technology.
So - especially until the Gateway APIs for mesh are complete and widely supported – adopting a service mesh implementation is a long term decision.
Therefore, the implementations should be compared and tested carefully in advance.
Choosing the most flexible service mesh with the most features seems logical at first.
But the contrary could be true because features and flexibility are often paid with cognitive and technical complexity.
The goal of the evaluation is to figure out which features are important to you and how you benefit from them. As service meshes impact the latency and resource consumption, these disadvantages have to be measured, too.
We recommend to include the following steps in your decision process:
- As a team, identify the most important problems to be solved by the service mesh. Keep in mind that libraries or adaptions to the architecture could be good alternatives in some cases (see below).
- Discuss your requirements regarding simplicity/usability, performance, and compatibility.
- Identify your top two or three implementations based on your feature and non-feature requirements. The table below should assist you.
- Try the implementations by executing their respective tutorials (links below) and possibly discard one candidate.
- Test the latency and resource overhead for your individual application. For each service mesh candidates, set up an identical test environment and install the service mesh. Set up an additional mesh-free environment. Install your application in all environments. Perform a load test in all and measure request latency, CPU and memory consumption by using tools like Locust or Fortio.
Service Mesh Comparison
|Istio||Linkerd||AWS App Mesh||Consul||Traefik Mesh (formerly Maesh)||Kuma||Cilium|
|License||Apache License 2.0||Apache License 2.0||Closed Source||Business Source License||Apache License 2.0||Apache License 2.0||Apache License 2.0|
|Initiated by||Google, IBM, Lyft||Buoyant||AWS||HashiCorp||Traefik Labs||Kong||Cilium|
|Service Proxy||Envoy, Ambient Mesh (alpha)||Linkerd2-proxy||Envoy||defaults to Envoy, exchangeable||Traefik Proxy on each node||Envoy||Cilium agent on each node, Proxy Injection option for L7 (Envoy)|
|Ingress Controller||Envoy, support for Kubernetes Gateway API||not provided||AWS Virtual Gateway, exposable with AWS Elastic Load Balancer/AWS Load Balancer Controller||Envoy. Support for Kubernetes Gateway API with Consul API Gateway||not provided||not provided||Cilium Ingress for TLS & path-based routing features, compatible with any other|
|Governance||see Istio Community and CNCF Charter||see Linkerd Governance and CNCF Charter||AWS||see Contributing to Consul||see Contributing notice||see Contributing notice, Governance, and CNCF Charter||see Governance and CNCF Charter|
|Tutorial||Istio Tasks||Linkerd Getting Started Guide||AWS App Mesh Getting Started||HashiCorp Learn platform||Traefik Mesh Example||Install Kuma on Kubernetes||Cilium Quick Installation|
|Used in production||yes||yes||yes||yes||yes|
|Advantages||Istio is a CNCF Graduated service mesh. It can be adapted and extended like no other mesh. Its many features are available for Kubernetes and other platforms.||Linkerd is a CNCF Graduated service mesh. It provides best-in-class operational simplicity, security, and performance, and is extremely easy to adopt and use.||AWS App Mesh is integrated into the AWS landscape and it is fully managed for you.||Consul service mesh can be used in any Consul environment and therefore does not require a scheduler. The proxy can be changed and extended.||Traefik Mesh focuses on a selection of features to achieve good usability and performance.||Kuma supports both Kubernetes and VMs - including hybrid multi-zone deployments - and scales to many autonomous zones with different network constraints, it also allows you to customize the Envoy Proxy.||Cilium takes a different approach on service mesh by making use of eBPF and therefore it doesn't need sidecars at all for some very simple use cases.|
|Drawbacks||Istio's flexibility can be overwhelming for teams who don't have the capacity for more complex technology. Also, Istio takes control of the ingress controller.||Linkerd is deeply integrated with Kubernetes and does not currently support non-Kubernetes workloads. It also does not currently support data plane extensions.||AWS App Mesh configuration cannot be migrated to an environment outside AWS.||Consul uses its own internal storage, and does not on rely Kubernetes for persistent storage.||Traefik Mesh currently does not support transparent TLS encryption.||Kuma is possibly the most flexible service mesh. Teams should thoroughly consider whether their project can handle the complexity involved.||Any L7 use cases require the use of an Envoy proxy, which is shared between all workloads on the node and not a recommended security configuration.|
|Sidecar / Data Plane|
|Automatic Sidecar Injection||yes||yes||yes||yes||yes (per Node)||yes||yes (per Node)|
|CNI plugin to avoid pod network privileges||yes, in beta||yes||yes||yes||no||yes||not necessary|
|Platform and Extensibility|
|Platform||Kubernetes, VMs||Kubernetes||ECS, Fargate, EKS, EC2||Kubernetes, Nomad, VMs, ECS, Lambda||Kubernetes||Kubernetes, VMs, ECS||Kubernetes|
|Cloud Integrations||Google Cloud, Alibaba Cloud, IBM Cloud, Microsoft Azure, Huawei Cloud, Red Hat OpenShift, DaoCloud, VMware Tanzu, Tencent Cloud, Baidu AI Cloud, Oracle Cloud Native Environment||DigitalOcean||AWS||HCP Consul on AWS and Azure|
|Mesh ExpansionExtension of the Mesh by containers/VMs outside the cluster||yes||no||yes, within AWS||yes||no||yes||yes|
|Multi-Cluster MeshControl and observe multiple clusters||yes||yes||yes, through AWS Cloud Map||yes||no||yes||yes|
|Service Log Collection||no||no||no, use AWS FireLens for ECS and Fargate instead||no||no||no||no|
|Access Log Generation||yes||yes and tap feature||yes||yes||yes||yes||yes, via Proxy injection|
|"Golden Signal” Metrics Generation||yes||yes||yes||yes, depending on the proxy used||yes||yes||yes, L7 metrics via Proxy injection|
|Integrated, pre-configured Prometheus||yes||yes, in an extension||no||yes, for non-prod environments||yes||yes||yes|
|Integrated, pre-configured Grafana||yes||yes, in an extension||no||no||yes||yes, including a datasource||yes|
|Per-Route MetricsCollect values for each HTTP endpoint individually||experimental||yes||depending on the proxy used||no||no||yes, via Proxy injection|
|Dashboard||yes, Kiali||yes||yes, AWS Cloud Watch||yes||no||yes, with a service topology map in grafana||yes, Hubble|
|Compatible Tracing-Backends||Jaeger, Zipkin, Solarwinds||all Backends supporting OpenTelemetry||AWS X-Ray, Jaeger and DataDog||Datadog, Jaeger, Zipkin, OpenTracing, Honeycomb||Jaeger||Jaeger, zipkin, datadog||OpenTelemetry via hubble-otel|
|Integrated, pre-configured Tracing-Backends||yes, Jaeger or Zipkin for nonprod environments||Jaeger, in an extension||yes, AWS X-Ray||yes||yes, Jaeger||yes, Jaeger||no|
|Load Balancing||yes (Round Robin, Random, Weighted, Least Request)||yes (EWMA, exponentially weighted moving average)||yes||yes (Round Robin, Random, Weighted, Least Request, Ring Hash, Maglev)||yes||yes (Round Robin, Least Request, Ring Hash, Random, Maglev)||yes|
|Percentage-based Traffic Splits||yes||yes, through Gateway API or SMI||yes||yes||yes, through SMI||yes||via manual configuration of Envoy proxy|
|Header- and Path-based Traffic SplitsRouting rules based on request header and path||yes||yes, through Gateway API||yes||yes||no||yes||via manual configuration of Envoy proxy|
|Circuit Breaking||yes||yes||yes||yes||yes||yes||via manual configuration of Envoy proxy|
|Retry & Timeout||yes||yes||yes||yes||yes||yes, retry and timeout||via manual configuration of Envoy proxy|
|Path- & Method-based Retry & TimeoutDifferent retry and timeout config for each endpoint||yes||yes||yes||yes||no||only Method-based retry other can be done with Proxy templating||no|
|Fault Injection||yes||yes, by adding a deployment and a traffic split config||no*||no||yes||no|
|mTLS||yes||yes, on by default||yes||yes||no||yes||yes, with manually created certs|
|mTLS Enforcement||yes||yes||yes, via client policies||yes||no||yes||yes|
|mTLS Permissive Mode||yes||yes||yes||no||yes|
|mTLS by default||yes, permissive mode||yes, permissive mode||no||yes||no||no||no|
|External CA certificate and key pluggable e.g. Vault, cert-manager||yes, CA cert pluggable and CA integration (experimental), SPIRE Integration||yes||yes||yes, HashiCorp Vault, ACM Private CA, custom CA||no||yes||yes|
|Service-to-Service Authorization Rules||yes, including External Authorization||yes||no, but support for IAM for user-authorization||yes||no||yes||yes|
|*Might be possible through manual configuration/templating of proxy|
Found a mistake? Or have something to add? We appreciate your issues or pull requests on GitHub!
That's just a table.
For advice, trainings, and support around Kubernetes and Service Mesh send an email to firstname.lastname@example.org
Alternatives to Service Meshes
Undoubtedly, service mesh is a useful pattern and some current implementations are very promising. But they also go along with challenges such as cognitive and technical complexity. Like any tool, they are not useful in every situation. Sometimes it might be wise to keep existing well-known "boring" technology or to go with alternative solutions.
Service Mesh Primer
Our free Service Mesh Primer explains the service mesh pattern and features in detail and contains examples for Istio.