Appearance
Big Data & Analytics
5 articles · ~3.5 hours · Data engineers, platform engineers
This path is designed for engineers who process, move, or analyze large volumes of data. It begins with the concurrency model that makes high-throughput data processing possible, moves into streaming architectures built for scale, covers the infrastructure that runs those workloads in production, and closes with the cryptographic controls needed to protect sensitive data pipelines.
Prerequisites
Before starting this path, you should be comfortable with:
- Basic programming in Java or a JVM language
- The concept of producer/consumer data flows
- Running services in a cloud or containerized environment (basic familiarity is enough)
You do not need prior experience with distributed systems or stream processing frameworks, though familiarity with one (Kafka, Flink, Spark Streaming) will make the streaming article more immediately applicable.
The Path
Step 1: Asynchronous Programming
Article: Async & Concurrency
Why this matters for big data Big data pipelines are inherently I/O-bound: reading from object storage, writing to databases, calling external enrichment services. Synchronous, blocking code turns these I/O waits into wasted CPU time and thread exhaustion. Async programming is the foundation that allows a single process to keep many I/O operations in flight simultaneously — which is what gives data pipelines their throughput advantage over naive implementations.
Key concepts to focus on
- Non-blocking I/O and the event loop model
CompletableFuturechaining and parallel fan-out patterns- Back-pressure: how downstream capacity limits propagate upstream
- Thread pool sizing for I/O-bound vs. CPU-bound workloads
What to learn next Async handles individual I/O operations efficiently. Step 2 addresses how to move and process entire streams of records continuously at high throughput.
Step 2: High-Performance Streaming
Article: High-Performance Streaming
Why this matters for big data Batch jobs have a fundamental limitation: they process data after it has accumulated, introducing latency. High-performance streaming lets you process records as they arrive, enabling real-time analytics, fraud detection, and live dashboards. This article covers the architectural patterns and engineering trade-offs that distinguish streaming systems that hold up at scale from those that collapse under load.
Key concepts to focus on
- Event-time vs. processing-time semantics and why the distinction matters for correctness
- Windowing strategies: tumbling, sliding, and session windows
- Exactly-once processing guarantees and their cost
- Backpressure propagation in streaming topologies
- Partitioning and parallelism: how to scale a streaming job horizontally
What to learn next You now understand how to design streaming workloads. Step 3 covers where and how to run them in production at scale.
Step 3: Columnar Data Stores
Article: Columnar Data Stores
Why this matters for big data Analytics queries — aggregations, time-series scans, wide-column reads — are fundamentally different from transactional queries. Column-oriented storage is designed for exactly these access patterns: it reads only the columns a query needs, applies compression efficiently within each column (since values are homogeneous), and enables vectorised execution. Understanding when and why to reach for a columnar store — rather than a row-oriented database or a streaming system — is a core architectural skill for data engineers.
Key concepts to focus on
- Row-oriented vs. column-oriented storage and the query shapes each favours
- Columnar compression techniques: run-length encoding, dictionary encoding, delta encoding
- Predicate pushdown and projection pushdown: how the query engine avoids reading irrelevant data
- Partitioning and clustering: laying out data to minimise scan ranges
- When to choose a columnar store vs. a streaming system vs. a key-value store
What to learn next You now understand the storage layer for analytical workloads. Step 4 covers how to run all of this in production at scale.
Step 4: Serverless & Containers
Article: Serverless & Containers
Why this matters for big data Data workloads vary enormously in resource demand — a batch job that runs once a day and a streaming pipeline that runs continuously have very different infrastructure needs. Containers give you reproducible, portable environments for long-running streaming jobs. Serverless functions are well-suited to event-driven ingestion and small transformation tasks. Knowing when to use each, and how to configure both for data workloads, is a core production skill.
Key concepts to focus on
- Container resource limits (CPU, memory) and their effect on JVM-based data runtimes
- Stateless design requirements and how they interact with streaming state stores
- Serverless triggers: S3 events, message queue arrivals, scheduled invocations
- Auto-scaling policies for streaming consumers vs. batch processors
- Cost modeling: always-on containers vs. pay-per-invocation serverless
What to learn next Your infrastructure is configured and scaling. The final step addresses how to protect the data flowing through it.
Step 5: Applied Cryptography
Article: Cryptography
Why this matters for big data Data pipelines frequently handle PII, financial records, health data, and other regulated information. Cryptography provides the technical controls that satisfy data protection requirements: encryption at rest for stored datasets, encryption in transit for data moving between pipeline stages, and integrity verification to detect tampering. Understanding these mechanisms lets you design compliant pipelines rather than retrofitting security after the fact.
Key concepts to focus on
- Envelope encryption: how managed key services (AWS KMS, GCP Cloud KMS) protect data at rest
- TLS for inter-service communication within a pipeline
- Field-level encryption for sensitive columns in data warehouses
- Key rotation strategies that do not require re-encrypting entire datasets
- HMAC and checksums for data integrity verification across pipeline stages
After This Path
Having completed this sequence, you will be able to:
- Write async data processing code that sustains high throughput without thread exhaustion
- Design streaming pipelines that are correct under out-of-order and late-arriving data
- Choose the right storage format (columnar vs. row-oriented vs. streaming) for each access pattern
- Choose and configure containerized or serverless infrastructure for your data workloads
- Apply encryption and integrity controls to meet data protection requirements
A natural next step is the Building Scalable APIs path, which covers how to expose the results of your data processing through well-designed, secured API endpoints.