| Modern products run on data. The pipeline’s job is to get the right data to the right place with the right latency and guarantees. The phases below are structured with engineering choices, product trade-offs, and concrete examples from public case studies. What matters Diverse sources (apps, devices, third-party APIs, logs) and formats. Early schema control and validation to limit downstream breakage. Cost/latency balance (sampling, delta triggers vs. constant firehose).
Concrete patterns & examples Unified event collection for product analytics & ops: Netflix moved from ad-hoc collectors to a unified, company-wide event publishing model as part of its Keystone initiative, explicitly to support both batch and real-time consumers reliably. Device/edge → cloud triggers: At scale, products reduce chatter by emitting on change or threshold, then standardize payloads at the perimeter (documented in multiple streaming shops and platform blogs). Spotify’s move to a cloud-first event system similarly standardized producers/consumers to reduce coupling during their migration.
Product call-outs
What matters Backpressure, ordering, idempotence, and replays. Stream vs. batch: choose per use case; many orgs run both. Data contracts (schemas) at the boundary.
Concrete patterns & examples Kafka/Kinesis-style event backbones: LinkedIn created Kafka specifically to unify many brittle ingestion paths (logs, page views, messaging) and scale to billions of daily events, establishing stream logs as the integration fabric. Peak protection: E-commerce and media pipelines rely on buffer/partitioning to absorb bursts (e.g., sales, launches). Shopify describes a real-time Flink pipeline reading merchant events from Kafka topics, with additional dedupe/relay layers to keep ingestion stable under load. Dual-run migrations: During major changes, keep old/new ingestion paths compatible and run in parallel to validate completeness before cutover (Spotify’s cloud event delivery migration is a documented example).
Product call-outs
What matters Fit store to access patterns: warehouse (structured analytics), lake (raw/unstructured, cheap), or lakehouse (hybrid). Partitioning, compaction, and tiered storage for cost & performance. Metadata, lineage, and ownership so teams find and trust data.
Concrete patterns & examples Lake + warehouse side by side: Netflix documents using Kafka as the transport while landing into systems that serve both batch and real-time cohorts; Keystone’s goal was unified publishing and routing into the right stores. Cloud-native analytics stack: Spotify’s customer story highlights BigQuery (analytics), Pub/Sub (transport), and Dataflow (ETL/stream) as core managed components of their storage/compute plane in GCP. Single source of truth for metrics: Airbnb built Minerva to standardize metric definitions and serving across warehouses so teams stop recomputing the “same” KPI differently. They report thousands of standardized metrics served from one platform.
Product call-outs Invest early in a catalog + ownership (e.g., DataHub/Atlas). Lack of discoverability, not raw storage price, is what slows product velocity.
What matters Pick processing mode per SLA and cost: nightly batch for exploration & heavy joins; streaming for time-sensitive signals. Declarative, tested transforms (SQL+lineage) to control sprawl. Windowing, dedupe, and exactly-once semantics where required.
Concrete patterns & examples Company-wide stream processing: Netflix’s Keystone pairs Kafka with stream engines (e.g., Flink in later evolutions) to power real-time ETL and alerting across trillions of events. (Multiple public posts discuss Kafka inside Keystone and evolution toward stream ETL.) Retail & payments streaming: Shopify details a Flink pipeline that filters, cleans, enriches, and republishes sales/trending data, mixing streaming topics with periodic batch enrichment files. Managed stream/batch convergence: Spotify relies on Dataflow for both real-time and historical processing with Pub/Sub for transport and BigQuery as the analytical sink, illustrating a managed version of the same pattern.
Product call-outs |