
Data Engineer
Kafka · Spark · Python · AWS · Azure · GCP
I build production-grade streaming pipelines. Real-time fraud detection, CDC replication, ML feature stores. AWS, Azure, GCP.
About Me
I am a Data Engineer with expertise in building real-time streaming pipelines and data platforms. My focus is on creating reliable, scalable systems that handle millions of events while maintaining data quality and governance.
I specialize in Apache Kafka, Spark Structured Streaming, and cloud-native services across AWS, Azure, and GCP. I believe in designing for failure modes first, enforcing data contracts, and always considering the downstream consumer.
Education
Coursework: Distributed Systems, Database Management Systems, Big Data Analytics, Data Mining and Warehousing
Coursework: Data Structures and Algorithms, Database Management Systems, Operating Systems, Computer Networks
Open Source Contributions
Apache Airflow
Open SourceThe most popular open-source workflow orchestration platform used by thousands of companies to programmatically author, schedule, and monitor data pipelines.
Notable Contributions
Deferrable Execution Mode for SFTPOperator
Contributed deferrable execution mode to Apache Airflow's SFTPOperator — reduces worker slot occupancy by 95% and enables 10x concurrent file transfer capacity without infrastructure scaling. Architected async I/O bridge using asyncio.to_thread() to integrate synchronous paramiko SSH library with Airflow's Triggerer event loop, establishing a reusable pattern for sync-to-async provider conversions. Implemented SSH connection pooling across polling cycles, byte-offset resume for interrupted transfers, and real-time metrics via XCom.
Projects
Production-grade streaming pipelines built with Kafka, Spark, and cloud-native services. Each project demonstrates real-world patterns for reliability, scalability, and observability.
Technical Skills
Certifications
How I Think as a Data Engineer
Engineering is about making decisions under constraints. Here are the principles that guide my work.
- Network partition? Kafka handles it.
- Process crash? Checkpoint handles it.
- Duplicate message? Idempotent write handles it.
- Schema change? Schema registry handles it.
A pipeline that works under normal conditions is a prototype. A pipeline that recovers correctly from failure is a production system.
- What schema does this topic expect?
- What guarantees does this stage make to the next?
- What happens when this contract is violated?
If every stage enforces its contract, the whole system is reliable. If any stage is loose, the whole system is fragile.
- ML model needs: low latency, no nulls, consistent schema.
- BI dashboard needs: pre-aggregated, partitioned, fast queries.
- Compliance audit needs: immutable, timestamped, queryable history.
The consumer's needs determine the pipeline's design. Not the other way.
- Exactly-once is more correct but slower than at-least-once.
- Event-time is more accurate but more complex than processing-time.
- Normalised schema is cleaner but slower to query than denormalised.
Every choice is a trade-off. My job is to understand the trade-off, make the right call for this context, and document why.
- How much data am I storing? Do I need all of it?
- How often am I running this query? Can I cache it?
- Am I using the right instance type for this workload?
Correctness and cost efficiency are not opposites. Good engineering achieves both.
How I Built These Projects
I didn't start with the code. I started with the problem.
- P1: Fraud detected 4 hours late = money gone. That's real money.
- P2: Bad data reaching dashboards = wrong decisions. That's real business impact.
- P4: Database changes not captured = analytics 24 hours stale. That breaks real systems.
If I can't explain the business problem in two sentences, I don't start building.
- What happens if this component crashes?
- What happens if this component is slow?
- What happens if data arrives out of order?
- What happens if the schema changes?
Questions I answer before writing line 1.
- Dead letter queues before the main consumer.
- Checkpoint configuration before the processing logic.
- Schema validation before the business logic.
If the failure path works, the happy path is easy.
- Event count per second (not per day - per second)
- Consumer lag (are we keeping up or falling behind?)
- Checkpoint age (when did we last save state?)
- Error rate (what % of events are failing?)
Observable systems are debuggable systems.
- "Why Kafka over Kinesis?" - documented.
- "Why Parquet over Avro?" - documented.
- "Why session windows over sliding windows?" - documented.
Future me (and future teammates) will thank present me.
Get in Touch
Interested in collaborating or have a data engineering challenge? Let's connect.