Owning a Product on a Kafka Foundation: Lessons from Inheriting an Event Pipeline

Who This Post Is For

(Junior) Engineers inheriting or taking over a system with little or no handover
Teams replacing polling-based integrations with event-driven Kafka architectures
Platform builders who need to define clear ownership boundaries between infrastructure and product teams

A Bit About Me

I’m Takudai Kawasaki, a backend engineer working in the Account Aggregation Division at Money Forward, primarily using Go and Apache Kafka. You can reach me via LinkedIn or X.

The Problem

There is a big difference between using Kafka and owning a product built on a Kafka foundation. I learned this the hard way when I took over the event-driven product, roughly six months into my career as a new grad at Money Forward.

The event pipeline is a Publisher/Subscriber application built on our Kafka infrastructure (Amazon Managed Streaming for Apache Kafka, or Amazon MSK) for the Aggregation domain.

It was my first time owning a product end-to-end: from inheriting a partially built system with no handover, to defining its boundaries, onboarding new product teams, and taking full responsibility for its reliability.

This post is both a technical overview of the pipeline’s architecture and a reflection on what it means to own a product you didn’t build. It’s the kind of document I wish I’d had when I started.

By the time I joined the project, the engineers who had originally built it had already moved on. With no formal handover and limited documentation, understanding the original design intent was extremely difficult. Comprehending the existing system felt like assembling a puzzle with missing pieces: scattered code, implicit decisions, and countless unknowns. Before I could move anything forward, I had to piece together what existed and why — and that turned out to be the critical first step.

The Context: From Polling to Push-Based Events

Money Forward’s backend has historically been a large monolithic application. The company has been gradually decomposing it into loosely coupled, domain-oriented services (for more background, see the Tech Day 2024 talk on this initiative). During this transition, two key actors are relevant to this story:

Aggregation Service — the upstream system that manages users’ linked financial accounts and their asset data (balances, transactions, etc.).
Downstream product teams — multiple product teams (e.g., household budgeting, accounting, expense management) that each need to know when a user’s asset data changes so they can update their own features.

Before (the polling problem): Each product team independently polled the Aggregation Service’s database to detect asset data changes. With N products, this meant N separate polling implementations — each team writing its own change-detection logic, each hitting the same database on its own schedule. This duplicated engineering effort across every team and added unnecessary read load to the upstream database.

After (the event pipeline): The event pipeline replaces these N polling implementations with a single push-based model. It detects asset data changes once in the Aggregation Service and broadcasts notifications to all subscribers via Kafka. Each product team simply subscribes to its own topic — no polling, no duplicated detection logic.

How the Event Pipeline Works

At a high level, the event pipeline is responsible for publishing aggregation event notifications using Kafka. It consists of four main components:

Note: Topic names like publisher-topic and {product}-subscriber-topic are examples used throughout this post, not the actual topic names.

Publisher: A shared application that retrieves aggregation change events and publishes them to the central Kafka topic (publisher-topic). It processes a list of users whose assets have been updated.
Subscriber: A product-specific application managed by our team that subscribes to events from the central topic, enriches them, and publishes to product-specific output topics (e.g., {product}-subscriber-topic). Each subscriber serves a specific product — there are currently multiple active subscribers.Product teams then build their own consumers to read from these output topics.
Aggregation Service: The upstream service that manages account assets.
Kafka Cluster Platform: The managed Apache Kafka infrastructure (Amazon MSK) provided by AWS.

Publisher and Subscriber Flow

Publisher Responsibilities

Implements application-level Change Data Capture (CDC). Unlike log-based CDC (e.g., Debezium), the publisher detects changes by querying the Aggregation Service for users whose assets have changed, using a sequence-based cursor to track its position.
Publishes lightweight events (user_id, sequence number, timestamp) to the publisher-topic Kafka topic — just enough to signal that a change occurred, not what changed.
Writes to Kafka using Kafka’s transactional API and persists state to the database to ensure data integrity (no lost or duplicate change signals).

Subscriber Responsibilities (owned by the pipeline team)

Consumes the lightweight publisher messages from the central topic.
Filters users by product relevance (each subscriber only processes users relevant to its product).
Determines the change window (using timestamps and sequence numbers). It then fetches the actual changed asset data from the Aggregation Service.
Publishes enriched messages to a product-specific output topic ({product}-subscriber-topic). These messages contain time windows, sequence ranges, and lists of changed assets — giving downstream products everything they need in a single event.

Product teams consume from these output topics with their own consumers and downstream business logic.

A deliberate two-stage message design separates concerns:

Stage	Topic	Content	Purpose
Publisher	`publisher-topic`	User identifier, sequence number, timestamp	Signal that a change occurred
Subscriber	`{product}-subscriber-topic`	Time windows, sequence ranges, full asset data	Deliver actionable change data

A key design decision was grouping changes by user. In Money Forward, a single user can link multiple bank accounts, credit cards, and other financial services. When assets change, downstream products often need all related changes for that user at once. For example, a household budgeting product needs every updated account balance to recalculate a user’s total assets, and an expense management product needs all new transactions across accounts to categorize spending correctly.

If changes were delivered per-account instead, each product would have to collect and correlate scattered events back to the same user, handling timing gaps when multiple accounts update at different times. The subscriber avoids this by grouping all of a user’s changes into a single message, giving downstream products a complete, consistent view of what changed.

Ordering and Correctness

For financial data, order is critical. We set each message key to a hashed user_id, giving every user consistent partition routing so their events are always processed in sequence. We optimized for correctness over raw throughput.

// Idempotent processing pattern
func (s *Subscriber) Handle(ctx context.Context, msg Message) error {
    lastProcessed := s.repo.GetLastProcessed(msg.UserID)
    if msg.Sequence <= lastProcessed.Sequence {
        return nil // Already processed, skip
    }
    // Process and update state atomically
    return s.processAndCommit(ctx, msg)
}

By tracking the last processed sequence per product, the subscriber can safely skip duplicate messages.

Reliability and Observability

To guarantee message delivery and durability, we tuned the following producer and consumer settings:

# Producer Settings
required_acks: -1 # All in-sync replicas (ISR) must acknowledge (equivalent to `acks=all`)
max_attempts: 10 # Retry up to 10 times
backoff_max: 1s # Cap retry delay

# Consumer Settings
read_timeout: 60s # Allow for slow consumers

Additional operational configuration:

Retention: retention.ms = 1209600000 (two weeks), giving us a substantial buffer for recovery or debugging.
Structured labels: Every pipeline component is tagged with labels that feed into our monitoring platform, enabling per-service dashboards, alerting, and Service Level Objective (SLO) tracking without manual configuration.
Infrastructure-driven configuration: Key runtime parameters like partition count are injected via deployment manifests rather than hardcoded, allowing infrastructure changes without code redeployment.
Kafka CLI for debugging: We maintain a dedicated Kafka CLI pod in the cluster, pre-configured with Amazon MSK credentials and SSL certificates, giving the team immediate access to tools like kafka-console-consumer and kafka-topics without local tooling or VPN tunnels.

What we monitor:

Consumer lag per subscriber
Message throughput rates
Error rates on Aggregation Service API calls
Database connection health
Pod restart counts

Defining the Product’s Boundaries

When inheriting a system, the hardest part was not writing code — it was figuring out where the event-driven product ends and where product teams begin. None of these boundaries were written down, so I had to reconstruct them from code paths, team conversations, and trial and error. This mirrors the company-wide approach to service decomposition: draw clear domain boundaries. The same principle applies at the product integration level in this case.

What the Pipeline Provides vs. What Product Teams Provide

The event pipeline provides:

Publisher service (change detection → central topic)
Product-specific subscribers (e.g., subscriber-{product}) that enrich and republish to output topics
Infrastructure (Kafka Cluster Platform, database, monitoring)

Product teams provide:

Their own consumers that subscribe to product-specific output topics ({product}-subscriber-topic)
Downstream business logic and data processing

Monitoring boundary: The pipeline monitors everything up to and including the output topic. Once a message is published to a product-specific topic, monitoring responsibility shifts to the product team.

The Output Contract

The real contract with product teams is the output topic message schema:

{
  "user_id": "string",
  "time_window": { "from": "ISO8601", "to": "ISO8601" },
  "sequence_range": { "from": "integer", "to": "integer" },
  "accounts": ["string"]
}

Product teams do not need to understand Kafka internals or the subscriber code. They only need to consume from their designated output topic ({product}-subscriber-topic) and deserialize this schema. The output topic message is the contract.

Onboarding Flow

Integrating a new product with the event pipeline follows a structured process:

Intake form: The product team fills out a requirements questionnaire describing what asset data they need.
Kafka access: The team gains access to the Kafka infrastructure.
Development: Our team develops the corresponding subscriber, while the product team develops their consumer for the output topic (using a local tool that simulates the subscriber).
Staging: Our team creates the staging topic and the product team validates end-to-end.
Production: After staging validation, the production topic is created and the subscriber goes live.

Schema Evolution and Compatibility

Current State: JSON Contracts

The JSON-based schema was an inherited design decision — already in use by the time I took over the pipeline. The pipeline currently uses the JSON contract shown above as its output topic schema. Any changes to this schema require careful coordination with product teams, since raw JSON offers no built-in mechanism for backward or forward compatibility. In practice, our team manually communicates schema changes and product teams must add defensive parsing to handle missing or unexpected fields.

While the current JSON approach works, it introduces coordination overhead that motivated our planned migration.

Future Direction: Protobuf + Schema Registry

Money Forward has standardized on Protobuf + Schema Registry for event-driven messaging, and the event pipeline will be aligned with that standard in the future.

Together, Protobuf and Schema Registry address the limitations of raw JSON:

Wire-level compatibility (Protobuf): Unknown fields are silently ignored by older consumers, and new fields receive sensible defaults — our team can evolve the schema without breaking existing product consumers.
Deployment-time enforcement (Schema Registry): Compatibility rules are checked before a schema change reaches production, preventing breaking changes from ever being published.
Developer experience: Product teams get code generation with compile-time guarantees on field names and types, eliminating manual JSON deserialization and defensive parsing.

Uncovering Compliance Gaps

Defining the product’s boundaries also surfaced undocumented compliance obligations. Through conversations with the Aggregation team, it turned out that compliance checks and data privacy requirements had not yet been fully addressed. After escalating to the appropriate stakeholders, development was paused so teams could evaluate alternatives.

The code worked as designed — but the design had never been validated against compliance requirements, and there was no way to confirm it was compliant. Uncovering this put the project on hold, but it was far better than shipping a system that could potentially violate compliance.

What I Learned

Inheriting a product with no handover: Taking over a system mid-development when the previous owners have left is daunting, but it is survivable.
- Do software archaeology: Read every artifact — code, configs, deployment manifests, CI pipelines, Slack threads, decision records. Trace data flows end-to-end, build a dependency map, and write down your findings as you go — this becomes the documentation that didn’t exist.
- Question everything: Don’t assume prior decisions were settled just because work moved forward — whether that’s architectural choices or cross-cutting concerns like compliance and data privacy. Identify the right stakeholders and verify for yourself.
Draw boundaries first: Just as the company-wide decomposition initiative draws boundaries between domain services, we clearly separated infrastructure responsibility (the Kafka Cluster Platform) from application responsibility (the publisher/subscriber layer). Getting explicit agreement on ownership was essential.
- List what your team owns and what integrating teams own — infrastructure, subscribers, output topics, consumers, monitoring — and write it down so both sides can point to it.
Infrastructure is part of the product: Even if a separate team manages infrastructure like the Kafka Cluster Platform, we still need to understand configurations like acks, retention policies, and replication factors in the world of Kafka — they directly affect our application’s correctness and reliability.
Make it easy for others to get started: Provide ready-to-use examples, clear documentation, and a step-by-step onboarding process so product teams can integrate without guesswork.