Gremlin: Systematic Resilience Testing of Microservices

, , , and
IEEE Conference on Distributed Computing Systems (ICDCS)
Nara, Japan,
Abstract. Modern Internet applications are being disaggregated into a microservice-based architecture, with services being updated and deployed hundreds of times a day. The accelerated software life cycle and heterogeneity of language runtimes in a single application necessitates a new approach for testing the resiliency of these applications in production infrastructures. We present Gremlin, a framework for systematically testing the failure handling capabilities of microservices. Gremlin is based on the observation that microservices are loosely coupled and thus rely on standard message exchange patterns over the network. Gremlin's centralized control plane allows the operator to easily design tests, while the generic data plane manipulates inter-service messages at the network layer to execute these tests. We show how to use Gremlin to express common failure scenarios and how developers of an enterprise application were able to discover previously unknown bugs in their failure handling code without modifying the application.
Keywords. Failure Injection, Testing, Microservices, DevOps, Cloud
author = {Victor and Heorhiadi and Shriram and Rajagopalan and Hani and Jamjoom and Michael K. and Reiter and Vyas and Sekar},
title = {{Gremlin: Systematic Resilience Testing of Microservices}},
booktitle = {IEEE Conference on Distributed Computing Systems (ICDCS)},
address = {Nara, Japan},
month = {June},
year = {2016}