What is health check, Which health check XIM use

Introduction

Hello, I’m Komuro Kohei, a software engineer at Money Forward, Inc.

I’m primarily working on X-Insight-Marketing (XIM), a financial data analysis platform. XIM enables businesses to perform detailed customer profiling and significantly enhance their sales and marketing efforts. When researching health checks, I found that there are very few articles that explain how they are actually implemented in real-world products. In this post, I will share how one of MoneyForward’s products handles health checks, providing practical insights from our experience.

What are health checks?
Health checks used by XIM
How to reduce health check logs in Datadog

Target

Someone who wants to learn about health checks
Someone interested in understanding XIM’s health checks
Someone looking for methods to reduce health checks logs

1. What are health checks

Health checks are used to confirm from the outside whether a service on a specific server can execute tasks successfully.

Happy to use Health Checks

Early Detection of Issues
- Continuous health checks allow us to catch problems immediately
Automatic Recovery
- If there are some problems, kubernetes can automatically recover them
Optimized Load Balancing
- Load Balancer can route traffic only to healthy instances
Increased Operational Efficiency
- Regular automated health checks reduce the need for manual system monitoring
Enhanced Reliability
Monitoring and Alerts

and so on.

2. Health checks used by XIM

We use three types of health checks.

This chart shows the number of logs for each health check over a 15-minute period.

By Kubernetes Probes
By AWS ELB
By Datadog

By Kubernetes Probes

setting code

In Kubernetes, kube-probe is used to perform health checks on containers. This allows for monitoring the health of applications within the cluster and responding automatically.

The main objective is to monitor the internal state of the container and ensure its health by restarting the container or controlling traffic.

There are three main types of health checks probe in Kubernetes:

Startup Probe
Readiness Probe
Liveness Probe

Startup Probe

Confirms whether the container application has completed its startup
Until the Startup Probe succeeds, checks by the Liveness Probe and Readiness Probe are disabled to avoid interfering with the application’s startup
Prevents containers with slow startups from being forcefully terminated by kubelet before they are ready

Readiness Probe

Checks whether the container is in a state to accept traffic
Is considered successful when all containers within the Pod are ready
- “Ready” includes the readiness of external services related to the application
If not ready, the Pod is removed from load balancing

Liveness Probe

Checks whether the container needs to be restarted
Can detect deadlock states where the application is running but cannot continue processing

Additionally, there are three methods to execute probes:

HTTP Request
tcpSocket
Command Execution

HTTP Request

Typically, a GET request is made to the application’s /healthz endpoint to check for a successful response
Can simultaneously check many dependencies, not just the application server
Easy to obtain logs when there is some abnormality
CPU resources are used

tcpSocket

Attempts to open a socket to the specified container and considers it healthy if a connection can be established
Since it only checks for connection establishment, it is difficult to obtain internal information

Command Execution

Executes a specific command
- Example: cat /tmp/healthy

In the XIM provider server, we use

Readiness Probe by HTTP Request
- The role of Readiness Probe is to check if the container can accept traffic
- We should check dependencies such as the database to ensure there are no issues handling traffic
Liveness Probe by tcpSocket
- The role of Liveness Probe is to check if the container needs to restart
  - Case: server crash
- Using an HTTP Request for health checks is fine, but we adopted tcpSocket to reduce logs sent to Datadog and to save resource(CPU/Memory)
- Deadlocks rarely occur, but when they do, they can be avoided by the caller’s request timeout, so a restart is not necessary. Restarting should be avoided except when the server crashes, as it prevents responses to requests during that time.

By AWS ELB

We use ELB-HealthChecker.

The main objective is to identify healthy backend instances and maintain system availability for the proper distribution of user requests.

By Datadog

We use Datadog Synthetics API test.

The main objective is to ensure that the entire application is functioning correctly by sending requests through the ALB’s health check endpoint.

Summary(Architecture figure)

How to reduce health check logs in Datadog

This time, we focused on the health checks performed by Kubernetes probes, which we wanted to minimize the most.

How does Datadog catch logs?

The answer is provided by the Datadog Agent.

(ref: https://moneyforward.kibe.la/notes/292494)

So we realized that it is important not to write to standard output when a health check is successful.

Our option were:

Customize Logger middleware
Add exclusion filters in Datadog

We chose the second option because we don’t require precise control over the logs, and we prefer to avoid making changes to the product code as much as possible.

Customize Logger middleware

In middleware.Logger()

Getting the Buffer:
- buf := config.pool.Get().(*bytes.Buffer)
- Here, a buffer is obtained from config.pool.
Writing the Buffer’s Content to Output:
- _, err = c.Logger().Output().Write(buf.Bytes())
- This line writes the contents of the buffer to the output of c.Logger().

We should create a custom middleware that skips this Logger middleware when the health check is successful.

We can skip logging into Datadog because the Datadog agent catches the logs from kubernetes’ stdout.

Add exclusion filters in Datadog

There is another option, which is to add settings in Datadog.

In Datadog’s pricing model, storage costs are incurred based on the amount of log data included in the index. By using Exclusion Filters to exclude specific logs from the index, unnecessary logs are not included in the index, thereby reducing storage costs.

If we want to show the logs again, we can delete the filters.

We wanted to reduce successful health check logs, so we added status:info query.

We’ve successfully slashed the number of health check success logs! Mission accomplished!