If you build it right and not just use it as a buzz word, I believe we can all agree that microservices are great and that they make your code & production environment simpler to handle.
Here are some of the clear benefits when writing a micro service architecture:
- Software language agnostic – each service can be written in the most suitable language for the specific task that you need from that service.
- Your system is designed to scale with ease – many services with the same purpose can perform the same action concurrently without interrupting one another. So if you are using Docker with auto-discovery, for example, you can automatically scale your system according to your needs
- If a service crashes for any reason, others can replace it automatically, making your system robust and with high availability.
- Failure of one micro service does not effect on another micro service, making your system highly decoupled and modular
Although all of the nice things written above, micro services have a very dark side to them, which if not checked & managed, can cause havoc to your production system and not only crash it but exhaust your cloud resources. Here is an example for such a scenario:
Let’s assume that one of your microservices is responsible for pulling messages from your Kafka queue and in turn, process each message and insert it into your Elastic search. Now, your service is on docker, fully scaled, can work in parallel with many other services of the same kind and is managed by an auto-discovery service.
Now let’s assume that you have a bug in your service which only happens when a certain type of data resides in the pulled messages. That bug causes the service to work slower and slower as time goes by, (not due to a memory leak but rather because the calculations needed due to the prior unexpected type of data are higher) thus your service pulls fewer messages each second. And if that’s not enough, your auto-discovery is not able to communicate with this service because the CPU is so highly utilized that the handshake with the machine takes too long and fails after a 5 seconds timeout.
What will happen next is not pleasant; your monitoring system will probably detect the backlog in your Kafka and alert your auto scale system which will start opening more and more micro services on machines that have no clue what their IPs are. This will keep going until you’ll run out of allocated machines or (if you’re smart) reach the pre-defined limit you’ve defined in your auto scale service. When you figured out what’s wrong and run to shut down the zombie machines from the cloud console, it will probably be too late.
If you’ve seen the walking dead, this case is not far from it; you have dozens of zombie machines with hundreds of micro services. The piece of information you have left are log records which you inserted in your service, and if you did not insert solid logs, you are probably in a world of pain.
The moral of the story is that it is vital to log your micro service well and to make sure that your logs provide you with all the necessary data you need to control them, here are a few thumb rules.
Here are a few simple steps to help you prevent catastrophe:
- Define periodic health checks to verify your microservices work properly, in case the results indicate a critical issue then you might want to pull the machines out of service automatically
- Define a limit on your auto scale, that way even if all protections are crossed you still don’t wake up to a huge bill
- Insert a log entry after each important task that your service does, if that task happens hundreds of times a second, add a log entry after finishing a bulk of these takes (including the bulk’s metadata, for example, the number of tasks it did and their outcome).
- Make sure you add a GUID to your host names, that way your services will log with that GUID allowing you to quickly distinguish between the different hosts and identify the root cause of potential disasters
- Insert log entries which count the time it took to process each task your micro service did and monitor this number, you would want to know that your service is slowing down sooner rather than later.
- Do not only log exceptions, log error cases which do not result in exceptions, for example, if you’ve received a message from your Kafka which does not contain the data you’ve anticipated to receive.
To summarize, microservices are an excellent piece of architecture, and if you built it right, this design would provide your code flexibility, durability, and modularity, BUT you need to take into account that a micro service can go rogue and plan for this day.
For questions and feedback don’t hesitate to email us at [email protected]