What is a Service Mesh?
A service mesh is a dedicated infrastructure layer that adds features to a network between services. It allows to control traffic and gain insights throughout the system. Observability, traffic shifting (for canary releasing), resiliency features (such as circuit breaking and retry/timeout) and automatic mutual TLS can be configured once and enforced in a decentralized fashion. In contrast to libraries, which are used for similar functionality, a service mesh does not require code changes. Instead, it adds a layer of additional containers that implement the features reliably and agnostic to technology or programming language.
Who needs a Service Mesh?
The value of a service mesh grows with the number of services an application consists of. Logically, microservices architectures are the most common use cases for a service mesh. However, the specific interaction might be more relevant in regards to how a service mesh can improve the control, reliability, security, and observability of the services. Even a monolith could benefit from a service mesh and some concrete microservice applications might not.
Service Mesh Implementations
Service Mesh Interface
In the face of the vast variety of service mesh implementations, a group of companies, including Microsoft, Buoyant (developing Linkerd), and HashiCorp (developing Consul), joined forces to create a common standard for service mesh features. The result, the Service Mesh Interface specification, means to enable tools based on service mesh Features (such as Flagger for Canary Releasing automation) to be compatible with any service mesh rather than binding to a specific set of implementations. Service mesh users also benefit from the ability to change their service mesh implementation without changing the configuration.
How to choose a Service Mesh Implementation
While service meshes have no impact on the code, they change operations procedures and require familiarization with new concepts and technology.
So - especially until the Service Mesh Interface is widely supported – adopting a service mesh implementation is a long term decision.
Therefore, the implementations should be compared and tested carefully in advance.
Choosing the most flexible service mesh with the most features seems logical at first.
But the contrary could be true because features and flexibility are often paid with cognitive and technical complexity.
The goal of the evaluation is to figure out which features are important to you and how you benefit from them. As service meshes impact the latency and resource consumption, these disadvantages have to be measured, too.
We recommend to include the following steps in your decision process:
- As a team, identify the most important problems to be solved by the service mesh. Keep in mind that libraries or adaptions to the architecture could be good alternatives in some cases (see below).
- Discuss your requirements regarding simplicity/usability, performance, and compatibility.
- Identify your top two or three implementations based on your feature and non-feature requirements. The table below should assist you.
- Try the implementations by executing their respective tutorials (links below) and possibly discard one candidate.
- Test the latency and resource overhead for your individual application. For each service mesh candidates, set up an identical test environment and install the service mesh. Set up an additional mesh-free environment. Install your application in all environments. Perform a load test in all and measure request latency, CPU and memory consumption by using tools like Locust or Fortio.
Service Mesh Comparison
|Istio||Linkerd 2||AWS App Mesh||Consul Connect||Maesh||Kuma|
|License||Apache License 2.0||Apache License 2.0||Closed Source||Mozilla License||Apache License 2.0||Apache License 2.0|
|Developed by||Google, IBM, Lyft||Buoyant||AWS||HashiCorp||Containous||Kong|
|Service Proxy||Envoy||linkerd-proxy||Envoy||defaults to Envoy, exchangeable||Traefik||Envoy|
|Ingress Controller||Envoy / Own Concept||any||any||any||any|
|Governance||see Istio Community||see Linkerd Governance and CNCF Charter||AWS||see Contributing to Consul||see Contributing notice||see Contributing notice|
|Tutorial||Istio Tasks||Linkerd Tasks||AWS App Mesh Getting Started||HashiCorp Learn platform||Maesh Example||Kuma Kubernetes Quickstart|
|Platform||Kubernetes||Kubernetes||ECS, Fargate, EKS, EC2||Kubernetes, Nomad, VMs (Universal)||Kubernetes||Kubernetes, VMs (Universal)|
|Automatic Sidecar Injection||yes||yes||yes||yes||yes (per Node)||yes|
|Used in production||yes||yes|
|Advantages||Istio can be adapted and extended like no other Mesh. Its many features are available for Kubernetes and other platforms.||Linkerd 2 is designed to be non-invasive and is optimized for performance and usability. Therefore, it requires little time to adopt.||AWS App Mesh is integrated into the AWS landscape and it is fully managed for you.||Consul Connect can be used in any Consul environment and therefore does not require a scheduler. The proxy can be changed and extended.||Maesh focuses on a selection of features to achieve good usability and performance.||Kuma supports both Kubernetes and plain VMs and allows you to customize the Envoy Proxy.|
|Drawbacks||Istio's flexibility can be overwhelming for teams who don't have the capacity for more complex technology. Also, Istio takes control of the ingress controller.||Linkerd 2 is deeply integrated with Kubernetes and cannot be expanded. Since Linkerd 2 does not rely on a third-party proxy, it cannot be extended easily.||AWS App Mesh configuration cannot be migrated to an environment outside AWS.||Consul Connect can only be used in combination with Consul.||Maesh currently does not support transparent TLS encryption.||Kuma is still in an early state. That might be a risk for production.|
|Service Mesh Interface compatibility|
|Traffic Access Control||yes||no||no||yes||yes||no|
|Access Log Generation||yes||no (tap-Feature instead)||yes||yes||yes||yes|
|“Golden Signal” Metrics Generation||yes||yes||yes||yes, depending on the proxy used||yes||no*|
|Integrated, pre-configured Prometheus||yes||yes||no||no||yes||no|
|Integrated, pre-configured Grafana||yes||yes||no||no||yes||no|
|Per-Route Metrics||no||yes||depending on the proxy used||no|
|Dashboard||yes, Kiali||yes||yes, AWS Cloud Watch||yes, showing configuration and availability only||no||yes, showing configuration only|
|Compatible Tracing-Backends||Jaeger, Zipkin, Solarwinds||all Backends supporting OpenCensus||AWS X-Ray||Datadog, Jaeger, Zipkin, OpenTracing, Honeycomb||Jaeger||all Backends supporting OpenTracing|
|Load Balancing||yes (Round Robin, Random, Least Connction)||yes (EWMA exponentially weighted moving average)||yes||yes||yes||yes|
|Percentage-based Traffic Splits||yes||yes, through SMI||yes||yes||yes||yes|
|Header- and Path-based Traffic Splits||yes||no||yes||yes||no||no*|
|Retry & Timeout||yes||yes||yes||Timeout yes, Retry no*||yes||no*|
|Path-based Retry & Timeout||no||yes||yes||no||no||no|
|Fault Injection||yes||yes, by adding a deployment and a traffic split config||no*||no||no*|
|mTLS||yes||yes, not for TCP||In preview||yes. Optional integration with Vault||no||yes|
*Might be possible through manual configuration/templating of proxy.
Found a mistake? Or have something to add? We appreciate your issues or pull requests on GitHub!
Alternatives to Service Meshes
Undoubtedly, service mesh is a useful pattern and some current implementations are very promising. But they also go along with challenges such as cognitive and technical complexity. Like any tool, they are not useful in every situation. Sometimes it might be wise to keep existing well-known "boring" technology or to go with alternative solutions.
Service Mesh Primer
Our free Service Mesh Primer explains the service mesh pattern und features in detail and contains examples for Istio.