
Data Engineer
Kafka | Spark | Python | Fresher
I build production-grade streaming pipelines. Real-time fraud detection, CDC replication, ML feature stores. AWS, Azure, GCP.
About Me
I'm I am Data Engineer with expertise in building real-time streaming pipelines and data platforms. My focus is on creating reliable, scalable systems that handle millions of events while maintaining data quality and governance.
I specialize in Apache Kafka, Spark Structured Streaming, and cloud-native services across AWS, Azure, and GCP. I believe in designing for failure modes first, enforcing data contracts, and always considering the downstream consumer.
Education
Coursework: Distributed Systems, Database Management Systems, Big Data Analytics, Data Mining and Warehousing
Coursework: Data Structures and Algorithms, Database Management Systems, Operating Systems, Computer Networks
Projects
Production-grade streaming pipelines built with Kafka, Spark, and cloud-native services. Each project demonstrates real-world patterns for reliability, scalability, and observability.
Technical Skills
Certifications
How I Think as a Data Engineer
Engineering is about making decisions under constraints. Here are the principles that guide my work.
- Network partition? Kafka handles it.
- Process crash? Checkpoint handles it.
- Duplicate message? Idempotent write handles it.
- Schema change? Schema registry handles it.
A pipeline that works under normal conditions is a prototype. A pipeline that recovers correctly from failure is a production system.
- What schema does this topic expect?
- What guarantees does this stage make to the next?
- What happens when this contract is violated?
If every stage enforces its contract, the whole system is reliable. If any stage is loose, the whole system is fragile.
- ML model needs: low latency, no nulls, consistent schema.
- BI dashboard needs: pre-aggregated, partitioned, fast queries.
- Compliance audit needs: immutable, timestamped, queryable history.
The consumer's needs determine the pipeline's design. Not the other way.
- Exactly-once is more correct but slower than at-least-once.
- Event-time is more accurate but more complex than processing-time.
- Normalised schema is cleaner but slower to query than denormalised.
Every choice is a trade-off. My job is to understand the trade-off, make the right call for this context, and document why.
- How much data am I storing? Do I need all of it?
- How often am I running this query? Can I cache it?
- Am I using the right instance type for this workload?
Correctness and cost efficiency are not opposites. Good engineering achieves both.
How I Built These Projects
I didn't start with the code. I started with the problem.
- P1: Fraud detected 4 hours late = money gone. That's real money.
- P2: Bad data reaching dashboards = wrong decisions. That's real business impact.
- P4: Database changes not captured = analytics 24 hours stale. That breaks real systems.
If I can't explain the business problem in two sentences, I don't start building.
- What happens if this component crashes?
- What happens if this component is slow?
- What happens if data arrives out of order?
- What happens if the schema changes?
Questions I answer before writing line 1.
- Dead letter queues before the main consumer.
- Checkpoint configuration before the processing logic.
- Schema validation before the business logic.
If the failure path works, the happy path is easy.
- Event count per second (not per day - per second)
- Consumer lag (are we keeping up or falling behind?)
- Checkpoint age (when did we last save state?)
- Error rate (what % of events are failing?)
Observable systems are debuggable systems.
- "Why Kafka over Kinesis?" - documented.
- "Why Parquet over Avro?" - documented.
- "Why session windows over sliding windows?" - documented.
Future me (and future teammates) will thank present me.
Get in Touch
Interested in collaborating or have a data engineering challenge? Let's connect.