AWS & Databricks Certified

Data Engineer

Available for Data Engineer roles in Bangalore · Open to Work

Kafka · Spark · Python · AWS · Azure · GCP

0+events/day across 6 pipelines

I build production-grade streaming pipelines. Real-time fraud detection, CDC replication, ML feature stores. AWS, Azure, GCP.

GitHub Download Resume

Hire Me

About Me

I am a Data Engineer with expertise in building real-time streaming pipelines and data platforms. My focus is on creating reliable, scalable systems that handle millions of events while maintaining data quality and governance.

I specialize in Apache Kafka, Spark Structured Streaming, and cloud-native services across AWS, Azure, and GCP. I believe in designing for failure modes first, enforcing data contracts, and always considering the downstream consumer.

Education

Master of Computer Applications (MCA)

Himalayan Garhwal University, Uttarakhand, India

GPA: 8.0/10Jun 2021 – Apr 2023

Coursework: Distributed Systems, Database Management Systems, Big Data Analytics, Data Mining and Warehousing

Bachelor of Computer Applications (BCA)

Himalayan Garhwal University, Uttarakhand, India

GPA: 7.55/10Jun 2018 – Apr 2021

Coursework: Data Structures and Algorithms, Database Management Systems, Operating Systems, Computer Networks

Open Source Contributions

Apache Airflow

Open Source

The most popular open-source workflow orchestration platform used by thousands of companies to programmatically author, schedule, and monitor data pipelines.

View Project

Pull Requests

Commits

191+

Lines Changed

Notable Contributions

Deferrable Execution Mode for SFTPOperator

FeatureOpen

PR #65480

Contributed deferrable execution mode to Apache Airflow's SFTPOperator — reduces worker slot occupancy by 95% and enables 10x concurrent file transfer capacity without infrastructure scaling. Architected async I/O bridge using asyncio.to_thread() to integrate synchronous paramiko SSH library with Airflow's Triggerer event loop, establishing a reusable pattern for sync-to-async provider conversions. Implemented SSH connection pooling across polling cycles, byte-offset resume for interrupted transfers, and real-time metrics via XCom.

Projects

Production-grade streaming pipelines built with Kafka, Spark, and cloud-native services. Each project demonstrates real-world patterns for reliability, scalability, and observability.

Real-Time Fraud Detection

End-to-end fraud detection streaming pipeline processing 120K+ events/day with confidence scoring and real-time pattern detection.

AWS

Real-Time Fraud Detection Architecture Diagram

120K+ events/day | AWS | Kafka | Spark | <60s latency

Apache KafkaSpark Structured StreamingPySparkPythonAmazon S3Amazon RedshiftParquet

View Code Watch Demo

Data Quality & Governance

Medallion architecture pipeline with 6 quality rules, processing 200K+ records/day through Bronze, Silver, Gold layers.

Azure

Data Quality & Governance Architecture Diagram

200K+ records/day | Azure | Delta Lake | <30s latency

Azure Event HubsAzure DatabricksDelta LakePySparkPythonAzure Data Factory

View Code Watch Demo

Global Event Processing

Multi-region stateful streaming with event-time windowing, processing 300K+ events/day across 3 global regions.

GCP

Global Event Processing Architecture Diagram

300K+ events/day | GCP | Dataflow | 3 regions | <45s latency

Google Pub/SubCloud DataflowBigQueryApache BeamPythonSQL

View Code Watch Demo

Real-Time CDC Database Replication

Change Data Capture pipeline with Debezium, processing 500K+ change events/day with schema auto-detection.

AWS

Real-Time CDC Database Replication Architecture Diagram

500K+ events/day | AWS | Debezium | Kafka | <15s latency

PostgreSQLDebeziumKafkaSparkSnowflakeAWS S3PySparkPython

View Code Watch Demo

Real-Time ML Feature Store

Dual-store feature platform with online/offline consistency, serving 400K+ events/day with <10ms latency.

AWS

Real-Time ML Feature Store Architecture Diagram

400K+ events/day | AWS | Redis | Kafka | <10ms latency

Apache KafkaSparkRedisAWS S3PySparkPythonGreat ExpectationsPostgreSQL

View Code Watch Demo

Multi-Cloud Data Lakehouse

Cross-cloud Apache Iceberg lakehouse processing 600K+ events/day with unified SQL queries via Trino.

AWS + GCP

Multi-Cloud Data Lakehouse Architecture Diagram

600K+ events/day | AWS + GCP | Iceberg | Trino | <90s latency

Apache KafkaSparkApache IcebergAWS S3BigQueryTrinodbtPySparkPythonSQL

View Code Watch Demo

View All on GitHub

Technical Skills

Languages & Query

PythonSQLWindow FunctionsCTEsQuery OptimizationIndexing

Core Skills

Data PipelinesETL/ELTBatch ProcessingStreaming ProcessingData Warehousing

Databases

PostgreSQLRedshiftBigQuerySnowflakeClickHouse

Big Data & Streaming

Apache SparkPySparkApache KafkaApache AirflowApache BeamHadoop

Cloud - AWS

S3RedshiftGlueKinesisLambdaEMR

Cloud - GCP

Pub/SubDataflowBigQueryCloud StorageComposer

Cloud - Azure

Event HubsDatabricksData FactorySynapseADLS

Data Engineering

Data ModellingData QualityData GovernanceMedallion ArchitectureDelta Lake

Tools & DevOps

OpenTelemetryStructured LoggingGitLinuxCI/CDDocker

Certifications

AWS Certified Data Engineer - Associate

Amazon Web Services

AWS

Issued: Jan 2026

How I Think as a Data Engineer

Engineering is about making decisions under constraints. Here are the principles that guide my work.

Think in Failure Modes

Every system I design, I start by asking: what can go wrong here?

Network partition? Kafka handles it.
Process crash? Checkpoint handles it.
Duplicate message? Idempotent write handles it.
Schema change? Schema registry handles it.

A pipeline that works under normal conditions is a prototype. A pipeline that recovers correctly from failure is a production system.

Think in Data Contracts

Data flowing through a pipe is easy to visualise. But what matters is the contract at each stage.

What schema does this topic expect?
What guarantees does this stage make to the next?
What happens when this contract is violated?

If every stage enforces its contract, the whole system is reliable. If any stage is loose, the whole system is fragile.

Think About Downstream First

Before I design a pipeline, I ask: who reads this data, and what do they need?

ML model needs: low latency, no nulls, consistent schema.
BI dashboard needs: pre-aggregated, partitioned, fast queries.
Compliance audit needs: immutable, timestamped, queryable history.

The consumer's needs determine the pipeline's design. Not the other way.

Think in Trade-offs

There is no universally correct answer in data engineering.

Exactly-once is more correct but slower than at-least-once.
Event-time is more accurate but more complex than processing-time.
Normalised schema is cleaner but slower to query than denormalised.

Every choice is a trade-off. My job is to understand the trade-off, make the right call for this context, and document why.

Think About Cost Alongside Correctness

A pipeline that processes 600K events/day perfectly but costs $10,000/month is not a good pipeline for a small company.

How much data am I storing? Do I need all of it?
How often am I running this query? Can I cache it?
Am I using the right instance type for this workload?

Correctness and cost efficiency are not opposites. Good engineering achieves both.

How I Built These Projects

I didn't start with the code. I started with the problem.

Understand the Business Problem First

For every project, I asked: what is the actual cost of NOT having this pipeline?

P1: Fraud detected 4 hours late = money gone. That's real money.
P2: Bad data reaching dashboards = wrong decisions. That's real business impact.
P4: Database changes not captured = analytics 24 hours stale. That breaks real systems.

If I can't explain the business problem in two sentences, I don't start building.

Design on Paper Before Code

I draw the pipeline on paper first. Every component. Every arrow. Every failure point.

What happens if this component crashes?
What happens if this component is slow?
What happens if data arrives out of order?
What happens if the schema changes?

Questions I answer before writing line 1.

Build the Unhappy Path First

Most tutorials build the happy path. I build the failure path first.

Dead letter queues before the main consumer.
Checkpoint configuration before the processing logic.
Schema validation before the business logic.

If the failure path works, the happy path is easy.

Measure Everything

I don't trust that a pipeline works unless I can see it working.

Event count per second (not per day - per second)
Consumer lag (are we keeping up or falling behind?)
Checkpoint age (when did we last save state?)
Error rate (what % of events are failing?)

Observable systems are debuggable systems.

Document the Decisions

Code explains what. Architecture Decision Records (ADRs) explain why.

"Why Kafka over Kinesis?" - documented.
"Why Parquet over Avro?" - documented.
"Why session windows over sliding windows?" - documented.

Future me (and future teammates) will thank present me.

Get in Touch

Interested in collaborating or have a data engineering challenge? Let's connect.

Contact Information

sunildataengineer@outlook.com +91 9380691205

Banashankari 3rd Stage, Bangalore, Karnataka

Social Links

github.com/sunildataengineer linkedin.com/in/suniil-data-engineer