What is health check, Which health check XIM use

Introduction

Hello, I’m Komuro Kohei, a software engineer at Money Forward, Inc.

I’m primarily working on X-Insight-Marketing (XIM), a financial data analysis platform. XIM enables businesses to perform detailed customer profiling and significantly enhance their sales and marketing efforts. When researching health checks, I found that there are very few articles that explain how they are actually implemented in real-world products. In this post, I will share how one of MoneyForward’s products handles health checks, providing practical insights from our experience.

Contents

  1. What are health checks?
  2. Health checks used by XIM
  3. How to reduce health check logs in Datadog

Target

  • Someone who wants to learn about health checks
  • Someone interested in understanding XIM’s health checks
  • Someone looking for methods to reduce health checks logs

1. What are health checks

Health checks are used to confirm from the outside whether a service on a specific server can execute tasks successfully.

Happy to use Health Checks

  1. Early Detection of Issues
    • Continuous health checks allow us to catch problems immediately
  2. Automatic Recovery
    • If there are some problems, kubernetes can automatically recover them
  3. Optimized Load Balancing
    • Load Balancer can route traffic only to healthy instances
  4. Increased Operational Efficiency
    • Regular automated health checks reduce the need for manual system monitoring
  5. Enhanced Reliability
  6. Monitoring and Alerts

and so on.

2. Health checks used by XIM

We use three types of health checks.

This chart shows the number of logs for each health check over a 15-minute period.

  1. By Kubernetes Probes
  2. By AWS ELB
  3. By Datadog

By Kubernetes Probes

setting code

In Kubernetes, kube-probe is used to perform health checks on containers. This allows for monitoring the health of applications within the cluster and responding automatically.

The main objective is to monitor the internal state of the container and ensure its health by restarting the container or controlling traffic.

There are three main types of health checks probe in Kubernetes:

  1. Startup Probe
  2. Readiness Probe
  3. Liveness Probe

Startup Probe

  • Confirms whether the container application has completed its startup
  • Until the Startup Probe succeeds, checks by the Liveness Probe and Readiness Probe are disabled to avoid interfering with the application’s startup
  • Prevents containers with slow startups from being forcefully terminated by kubelet before they are ready

Readiness Probe

  • Checks whether the container is in a state to accept traffic
  • Is considered successful when all containers within the Pod are ready
    • “Ready” includes the readiness of external services related to the application
  • If not ready, the Pod is removed from load balancing

Liveness Probe

  • Checks whether the container needs to be restarted
  • Can detect deadlock states where the application is running but cannot continue processing

Additionally, there are three methods to execute probes:

  1. HTTP Request
  2. tcpSocket
  3. Command Execution

HTTP Request

  • Typically, a GET request is made to the application’s /healthz endpoint to check for a successful response
  • Can simultaneously check many dependencies, not just the application server
  • Easy to obtain logs when there is some abnormality
  • CPU resources are used

tcpSocket

  • Attempts to open a socket to the specified container and considers it healthy if a connection can be established
  • Since it only checks for connection establishment, it is difficult to obtain internal information

Command Execution

  • Executes a specific command
    • Example: cat /tmp/healthy

In the XIM provider server, we use

  • Readiness Probe by HTTP Request
    • The role of Readiness Probe is to check if the container can accept traffic
    • We should check dependencies such as the database to ensure there are no issues handling traffic
  • Liveness Probe by tcpSocket
    • The role of Liveness Probe is to check if the container needs to restart
      • Case: server crash
    • Using an HTTP Request for health checks is fine, but we adopted tcpSocket to reduce logs sent to Datadog and to save resource(CPU/Memory)
    • Deadlocks rarely occur, but when they do, they can be avoided by the caller’s request timeout, so a restart is not necessary. Restarting should be avoided except when the server crashes, as it prevents responses to requests during that time.

By AWS ELB

We use ELB-HealthChecker.

The main objective is to identify healthy backend instances and maintain system availability for the proper distribution of user requests.

By Datadog

We use Datadog Synthetics API test.

The main objective is to ensure that the entire application is functioning correctly by sending requests through the ALB’s health check endpoint.

Summary(Architecture figure)

How to reduce health check logs in Datadog

This time, we focused on the health checks performed by Kubernetes probes, which we wanted to minimize the most.

How does Datadog catch logs?

The answer is provided by the Datadog Agent.

(ref: https://moneyforward.kibe.la/notes/292494)

So we realized that it is important not to write to standard output when a health check is successful.

Our option were:

  1. Customize Logger middleware
  2. Add exclusion filters in Datadog

We chose the second option because we don’t require precise control over the logs, and we prefer to avoid making changes to the product code as much as possible.

Customize Logger middleware

In middleware.Logger()

  1. Getting the Buffer:
    • buf := config.pool.Get().(*bytes.Buffer)
    • Here, a buffer is obtained from config.pool.
  2. Writing the Buffer’s Content to Output:
    • _, err = c.Logger().Output().Write(buf.Bytes())
    • This line writes the contents of the buffer to the output of c.Logger().

We should create a custom middleware that skips this Logger middleware when the health check is successful.

We can skip logging into Datadog because the Datadog agent catches the logs from kubernetes’ stdout.

Add exclusion filters in Datadog

There is another option, which is to add settings in Datadog.

In Datadog’s pricing model, storage costs are incurred based on the amount of log data included in the index. By using Exclusion Filters to exclude specific logs from the index, unnecessary logs are not included in the index, thereby reducing storage costs.

If we want to show the logs again, we can delete the filters.

We wanted to reduce successful health check logs, so we added status:info query.

We’ve successfully slashed the number of health check success logs! Mission accomplished!

Published-date