What is a Service Mesh?
A service mesh is a dedicated infrastructure layer that adds features to a network between services. It allows to control traffic and gain insights throughout the system. Observability, traffic shifting (for canary releasing), resiliency features (such as circuit breaking and retry/timeout) and automatic mutual TLS can be configured once and enforced in a decentralized fashion. In contrast to libraries, which are used for similar functionality, a service mesh does not require code changes. Instead, it adds a layer of additional containers that implement the features reliably and agnostic to technology or programming language.
Service Mesh Architecture
Who Needs a Service Mesh?
The value of a service mesh grows with the number of services an application consists of. Logically, microservices architectures are the most common use cases for a service mesh. However, the specific interaction might be more relevant in regards to how a service mesh can improve the control, reliability, security, and observability of the services. Even a monolith could benefit from a service mesh and some concrete microservice applications might not.
Service Mesh Implementations
How to Choose a Service Mesh Implementation
While service meshes have no impact on the code, they change operations procedures and require familiarization with new concepts and technology.
So - especially until the Service Mesh Interface is widely supported – adopting a service mesh implementation is a long term decision.
Therefore, the implementations should be compared and tested carefully in advance.
Choosing the most flexible service mesh with the most features seems logical at first.
But the contrary could be true because features and flexibility are often paid with cognitive and technical complexity.
The goal of the evaluation is to figure out which features are important to you and how you benefit from them. As service meshes impact the latency and resource consumption, these disadvantages have to be measured, too.
We recommend to include the following steps in your decision process:
- As a team, identify the most important problems to be solved by the service mesh. Keep in mind that libraries or adaptions to the architecture could be good alternatives in some cases (see below).
- Discuss your requirements regarding simplicity/usability, performance, and compatibility.
- Identify your top two or three implementations based on your feature and non-feature requirements. The table below should assist you.
- Try the implementations by executing their respective tutorials (links below) and possibly discard one candidate.
- Test the latency and resource overhead for your individual application. For each service mesh candidates, set up an identical test environment and install the service mesh. Set up an additional mesh-free environment. Install your application in all environments. Perform a load test in all and measure request latency, CPU and memory consumption by using tools like Locust or Fortio.
Service Mesh Comparison
|Istio||Linkerd 2||AWS App Mesh||Consul||Traefik Mesh (formerly Maesh)||Kuma||Open Service Mesh (OSM)|
|License||Apache License 2.0||Apache License 2.0||Closed Source||Mozilla License||Apache License 2.0||Apache License 2.0||MIT License|
|Developed by||Google, IBM, Lyft||Buoyant||AWS||HashiCorp||Containous||Kong||Microsoft|
|Service Proxy||Envoy||linkerd-proxy||Envoy||defaults to Envoy, exchangeable||Traefik||Envoy||Envoy|
|Ingress Controller||Envoy / Own Concept||any||Envoy and Ambassador in Kubernetes||any||any||Nginx, Azure Application Gateway Ingress Controller|
|Governance||see Istio Community and Open Usage Commons||see Linkerd Governance and CNCF Charter||AWS||see Contributing to Consul||see Contributing notice||see Contributing notice, Governance, and CNCF Charter||see Microsoft OpenSource|
|Tutorial||Istio Tasks||Linkerd Tasks||AWS App Mesh Getting Started||HashiCorp Learn platform||Traefik Mesh Example||Install Kuma on Kubernetes||Install OSM on Kubernetes|
|Used in production||yes||yes|
|Advantages||Istio can be adapted and extended like no other mesh. Its many features are available for Kubernetes and other platforms.||Linkerd 2 is designed to be non-invasive and is optimized for performance and usability. Therefore, it requires little time to adopt.||AWS App Mesh is integrated into the AWS landscape and it is fully managed for you.||Consul service mesh can be used in any Consul environment and therefore does not require a scheduler. The proxy can be changed and extended.||Traefik Mesh focuses on a selection of features to achieve good usability and performance.||Kuma supports both Kubernetes and VMs - including hybrid multi-zone deployments - and allows you to customize the Envoy Proxy.||OpenServiceMesh is driven by Microsoft and therefore expected to be well integrated with Azure. It also supports the SMI API.|
|Drawbacks||Istio's flexibility can be overwhelming for teams who don't have the capacity for more complex technology. Also, Istio takes control of the ingress controller.||Linkerd 2 is deeply integrated with Kubernetes and cannot be expanded. Since Linkerd 2 does not rely on a third-party proxy, it cannot be extended easily.||AWS App Mesh configuration cannot be migrated to an environment outside AWS.||Consul service mesh can only be used in combination with Consul.||Traefik Mesh currently does not support transparent TLS encryption.||Kuma is possibly the most flexible service mesh. Teams should thoroughly consider whether their project can handle the complexity involved.||OpenServiceMesh (OSM) is the latest service mesh Implementation and simply too young to be production-ready.|
|Sidecar / Data Plane|
|Automatic Sidecar Injection||yes||yes||yes||yes||yes (per Node)||yes||yes|
|CNI plugin to avoid pod network priviledges||yes||yes||yes||no||no||yes||no|
|Platform and Extensibility|
|Platform||Kubernetes||Kubernetes||ECS, Fargate, EKS, EC2||Kubernetes, Nomad, VMs (Universal)||Kubernetes||Kubernetes, VMs (Universal)||Kubernetes|
|Cloud Integrations||Google Cloud, Alibaba Cloud, IBM Cloud||DigitalOcean||AWS||Microsoft Azure|
|Mesh ExpansionExtension of the Mesh by containers/VMs outside the cluster||yes||no||yes, within AWS||yes||no||yes|
|Multi-Cluster MeshControl and observe multiple clusters||yes||yes||yes||no||yes||no|
|Service Mesh Interface Compatibility|
|Traffic Access Control||yes (unofficial/3rd party support)||no||no||yes||yes||no||yes|
|Traffic Specs||yes (unofficial/3rd party support)||no||no||no||yes||no||yes|
|Traffic Split||yes (unofficial/3rd party support)||yes||no||no||yes||no||yes|
|Traffic Metrics||yes (unofficial/3rd party support)||yes (unofficial/3rd party support)||no||no||no||no||yes|
|Access Log Generation||yes||no (tap-Feature instead)||yes||yes||yes||yes|
|"Golden Signal” Metrics Generation||yes||yes||yes||yes, depending on the proxy used||yes||no*||yes|
|Integrated, pre-configured Prometheus||yes||yes, option to use own installation||no||no||yes||yes||yes|
|Integrated, pre-configured Grafana||yes||yes||no||no||yes||yes||yes|
|Per-Route MetricsCollect values for each HTTP endpoint individually||experimental||yes||depending on the proxy used||no|
|Dashboard||yes, Kiali||yes||yes, AWS Cloud Watch||yes||no||yes, showing configuration only||no|
|Compatible Tracing-Backends||Jaeger, Zipkin, Solarwinds||all Backends supporting OpenCensus||AWS X-Ray||Datadog, Jaeger, Zipkin, OpenTracing, Honeycomb||Jaeger||Jaeger, zipkin||Jaeger|
|Integrated, pre-configured Tracing-Backends||yes, Jaeger or Zipkin for nonprod environments||yes, Jaeger||yes, AWS X-Ray||no||yes, Jaeger||yes, Jaeger|
|Load Balancing||yes (Round Robin, Random, Weighted, Least Request)||yes (EWMA, exponentially weighted moving average)||yes||yes (Round Robin, Random, Weighted, Least Request, Consistent Hash)||yes||yes (Round Robin, Least Request, Ring Hash, Random, Maglev)||yes|
|Percentage-based Traffic Splits||yes||yes, through SMI||yes||yes||yes, through SMI||yes||yes, through SMI|
|Header- and Path-based Traffic SplitsRouting rules based on request header and path||yes||no||yes||yes||no||no*||Header-based via SMI|
|Retry & Timeout||yes||yes||yes||yes||yes||yes, retry and timeout||no|
|Path- & Method-based Retry & TimeoutDifferent retry and timeout config for each endpoint||yes||yes||yes||yes||no||no||no|
|Fault Injection||yes||yes, by adding a deployment and a traffic split config||no*||no||yes||no|
|mTLS Enforcement||yes||no||yes, via client policies||yes||no||yes|
|External CA certificate and key pluggable e.g. Vault, cert-manager||yes, CA cert pluggable and CA integration (experimental)||yes||yes||HashiCorp Vault, ACM Private CA, custom CA||no||yes||HashiCorp Vault, cert-manager and Azure Key Vault|
|Service-to-Service Authorization Rules||yes||no||no, but support for IAM for user-authorization||yes||no||yes||yes|
|*Might be possible through manual configuration/templating of proxy|
Found a mistake? Or have something to add? We appreciate your issues or pull requests on GitHub!
That's just a table.
For advice, trainings, and support around Kubernetes and Service Mesh send an email to firstname.lastname@example.org
Alternatives to Service Meshes
Undoubtedly, service mesh is a useful pattern and some current implementations are very promising. But they also go along with challenges such as cognitive and technical complexity. Like any tool, they are not useful in every situation. Sometimes it might be wise to keep existing well-known "boring" technology or to go with alternative solutions.
Service Mesh Primer
Our free Service Mesh Primer explains the service mesh pattern und features in detail and contains examples for Istio.