What is a Service Mesh?
A service mesh is a dedicated infrastructure layer that adds features to a network between services. It allows to control traffic and gain insights throughout the system. Observability, traffic shifting (for canary releasing), resiliency features (such as circuit breaking and retry/timeout) and automatic mutual TLS can be configured once and enforced in a decentralized fashion. In contrast to libraries, which are used for similar functionality, a service mesh does not require code changes. Instead, it adds a layer of additional containers that implement the features reliably and agnostic to technology or programming language.
Service Mesh Architecture
Who Needs a Service Mesh?
The value of a service mesh grows with the number of services an application consists of. Logically, microservices architectures are the most common use cases for a service mesh. However, the specific interaction might be more relevant in regards to how a service mesh can improve the control, reliability, security, and observability of the services. Even a monolith could benefit from a service mesh and some concrete microservice applications might not.
Service Mesh Implementations
How to Choose a Service Mesh Implementation
While service meshes have no impact on the code, they change operations procedures and require familiarization with new concepts and technology.
So - especially until the Service Mesh Interface is widely supported – adopting a service mesh implementation is a long term decision.
Therefore, the implementations should be compared and tested carefully in advance.
Choosing the most flexible service mesh with the most features seems logical at first.
But the contrary could be true because features and flexibility are often paid with cognitive and technical complexity.
The goal of the evaluation is to figure out which features are important to you and how you benefit from them. As service meshes impact the latency and resource consumption, these disadvantages have to be measured, too.
We recommend to include the following steps in your decision process:
- As a team, identify the most important problems to be solved by the service mesh. Keep in mind that libraries or adaptions to the architecture could be good alternatives in some cases (see below).
- Discuss your requirements regarding simplicity/usability, performance, and compatibility.
- Identify your top two or three implementations based on your feature and non-feature requirements. The table below should assist you.
- Try the implementations by executing their respective tutorials (links below) and possibly discard one candidate.
- Test the latency and resource overhead for your individual application. For each service mesh candidates, set up an identical test environment and install the service mesh. Set up an additional mesh-free environment. Install your application in all environments. Perform a load test in all and measure request latency, CPU and memory consumption by using tools like Locust or Fortio.
Service Mesh Comparison
|Istio||Linkerd||AWS App Mesh||Consul||Traefik Mesh (formerly Maesh)||Kuma||Open Service Mesh (OSM)||Cilium|
|License||Apache License 2.0||Apache License 2.0||Closed Source||Mozilla License||Apache License 2.0||Apache License 2.0||Apache License 2.0||Apache License 2.0|
|Initiated by||Google, IBM, Lyft||Buoyant||AWS||HashiCorp||Traefik Labs||Kong||Microsoft||Cilium|
|Service Proxy||Envoy, proxyless for gRPC (experimental)||Linkerd2-proxy||Envoy||defaults to Envoy, exchangeable||Traefik Proxy on each node||Envoy||Envoy||Cilium agent on each node, Proxy Injection option for L7 (Envoy)|
|Ingress Controller||Envoy / Own Concept, support for Kubernetes Gateway API||any||Envoy. Support for Kubernetes Gateway API with Consul API Gateway||any||any||prepared config for Contour, compatible with any other||Cilium Ingress for TLS & path-based routing features, compatible with any other|
|Governance||see Istio Community and Open Usage Commons||see Linkerd Governance and CNCF Charter||AWS||see Contributing to Consul||see Contributing notice||see Contributing notice, Governance, and CNCF Charter||see Contributing notice and CNCF Charter||see Governance|
|Tutorial||Istio Tasks||Linkerd Getting Started Guide||AWS App Mesh Getting Started||HashiCorp Learn platform||Traefik Mesh Example||Install Kuma on Kubernetes||Install OSM on Kubernetes||Cilium Quick Installation|
|Used in production||yes||yes||yes||yes|
|Advantages||Istio can be adapted and extended like no other mesh. Its many features are available for Kubernetes and other platforms.||Linkerd is designed to be non-invasive and is optimized for performance and usability. Therefore, it requires little time to adopt.||AWS App Mesh is integrated into the AWS landscape and it is fully managed for you.||Consul service mesh can be used in any Consul environment and therefore does not require a scheduler. The proxy can be changed and extended.||Traefik Mesh focuses on a selection of features to achieve good usability and performance.||Kuma supports both Kubernetes and VMs - including hybrid multi-zone deployments - and scales to many autonomous zones with different network constraints, it also allows you to customize the Envoy Proxy.||OpenServiceMesh is driven by Microsoft and therefore expected to be well integrated with Azure. It also supports the SMI API.||Cilium takes a different approach on service mesh by making use of eBPF and therefore it doesn't need sidecars at all, which saves complexity and cost.|
|Drawbacks||Istio's flexibility can be overwhelming for teams who don't have the capacity for more complex technology. Also, Istio takes control of the ingress controller.||Linkerd is deeply integrated with Kubernetes and does not currently support non-Kubernetes workloads. It also does not currently support data plane extensions.||AWS App Mesh configuration cannot be migrated to an environment outside AWS.||Consul uses its own internal storage, and does not on rely Kubernetes for persistent storage.||Traefik Mesh currently does not support transparent TLS encryption.||Kuma is possibly the most flexible service mesh. Teams should thoroughly consider whether their project can handle the complexity involved.||OpenServiceMesh (OSM) is the latest service mesh Implementation and simply too young to be production-ready.||To enable the same feature set as Service Meshes with sidecars, a lot of maunal configuration is used.|
|Sidecar / Data Plane|
|Automatic Sidecar Injection||yes||yes||yes||yes||yes (per Node)||yes||yes||yes (per Node)|
|CNI plugin to avoid pod network privileges||yes, in beta||yes||yes||yes||no||yes||no||not necessary|
|Platform and Extensibility|
|Platform||Kubernetes||Kubernetes||ECS, Fargate, EKS, EC2||Kubernetes, Nomad, VMs, ECS, Lambda||Kubernetes||Kubernetes, VMs, ECS||Kubernetes||Kubernetes|
|Cloud Integrations||Google Cloud, Alibaba Cloud, IBM Cloud||DigitalOcean||AWS||HCP Consul on AWS and Azure||Microsoft Azure|
|Mesh ExpansionExtension of the Mesh by containers/VMs outside the cluster||yes||no||yes, within AWS||yes||no||yes||no||yes|
|Multi-Cluster MeshControl and observe multiple clusters||yes||yes||yes||no||yes||planned||yes|
|Service Mesh Interface Compatibility|
|Traffic Access Control||yes (unofficial/3rd party support)||no||no||yes||yes||no||yes||no|
|Traffic Specs||yes (unofficial/3rd party support)||no||no||no||yes||no||yes||no|
|Traffic Split||yes (unofficial/3rd party support)||yes||no||no||yes||no||yes||no|
|Traffic Metrics||yes (unofficial/3rd party support)||yes (unofficial/3rd party support)||no||no||no||no||yes||no|
|Service Log Collection||no||no||no, use AWS FireLens for ECS and Fargate instead||no||no||no||yes, using Fluent Bit||no|
|Access Log Generation||yes||no (tap feature instead)||yes||yes||yes||yes||no||yes, via Proxy injection|
|"Golden Signal” Metrics Generation||yes||yes||yes||yes, depending on the proxy used||yes||yes||yes||yes, L7 metrics via Proxy injection|
|Integrated, pre-configured Prometheus||yes||yes, in an extension||no||yes, for non-prod environments||yes||yes||yes||yes|
|Integrated, pre-configured Grafana||yes||yes, in an extension||no||no||yes||yes, including a datasource||yes||yes|
|Per-Route MetricsCollect values for each HTTP endpoint individually||experimental||yes||depending on the proxy used||no||no||no||yes, via Proxy injection|
|Dashboard||yes, Kiali||yes||yes, AWS Cloud Watch||yes||no||yes, with a service topology map in grafana||no||yes, Hubble|
|Compatible Tracing-Backends||Jaeger, Zipkin, Solarwinds||all Backends supporting OpenTelemetry||AWS X-Ray||Datadog, Jaeger, Zipkin, OpenTracing, Honeycomb||Jaeger||Jaeger, zipkin, datadog||Jaeger||OpenTelemetry via hubble-otel|
|Integrated, pre-configured Tracing-Backends||yes, Jaeger or Zipkin for nonprod environments||Jaeger, in an extension||yes, AWS X-Ray||yes||yes, Jaeger||yes, Jaeger||yes (install with flag), Jaeger||no|
|Load Balancing||yes (Round Robin, Random, Weighted, Least Request)||yes (EWMA, exponentially weighted moving average)||yes||yes (Round Robin, Random, Weighted, Least Request, Ring Hash, Maglev)||yes||yes (Round Robin, Least Request, Ring Hash, Random, Maglev)||yes||yes|
|Percentage-based Traffic Splits||yes||yes, through SMI||yes||yes||yes, through SMI||yes||yes, through SMI||via manual configuration of Envoy proxy|
|Header- and Path-based Traffic SplitsRouting rules based on request header and path||yes||planned||yes||yes||no||yes||Header-based via SMI||via manual configuration of Envoy proxy|
|Circuit Breaking||yes||no, planned for 2.12.0||yes||yes||yes||yes||yes||via manual configuration of Envoy proxy|
|Retry & Timeout||yes||yes||yes||yes||yes||yes, retry and timeout||no||via manual configuration of Envoy proxy|
|Path- & Method-based Retry & TimeoutDifferent retry and timeout config for each endpoint||yes||yes||yes||yes||no||only Method-based retry other can be done with Proxy templating||no||no|
|Fault Injection||yes||yes, by adding a deployment and a traffic split config||no*||no||yes||no||no|
|mTLS||yes||yes, on by default||yes||yes||no||yes||yes||yes, with manually created certs|
|mTLS Enforcement||yes||yes||yes, via client policies||yes||no||yes||yes, via https://linkerd.io/2.11/features/server-policy/||yes|
|mTLS Permissive Mode||yes||yes||no||no||yes||yes|
|mTLS by default||yes, permissive mode||yes, permissive mode||no||yes||no||no||yes||no|
|External CA certificate and key pluggable e.g. Vault, cert-manager||yes, CA cert pluggable and CA integration (experimental)||yes||yes||yes, HashiCorp Vault, ACM Private CA, custom CA||no||yes||HashiCorp Vault, cert-manager and Azure Key Vault||yes|
|Service-to-Service Authorization Rules||yes||yes||no, but support for IAM for user-authorization||yes||no||yes||yes||yes|
|*Might be possible through manual configuration/templating of proxy|
Found a mistake? Or have something to add? We appreciate your issues or pull requests on GitHub!
That's just a table.
For advice, trainings, and support around Kubernetes and Service Mesh send an email to firstname.lastname@example.org
Alternatives to Service Meshes
Undoubtedly, service mesh is a useful pattern and some current implementations are very promising. But they also go along with challenges such as cognitive and technical complexity. Like any tool, they are not useful in every situation. Sometimes it might be wise to keep existing well-known "boring" technology or to go with alternative solutions.
Service Mesh Primer
Our free Service Mesh Primer explains the service mesh pattern and features in detail and contains examples for Istio.