Sunil - Data Engineer
AWS & Databricks Certified

Data Engineer

Kafka | Spark | Python | Fresher

0+events/day across 6 pipelines

I build production-grade streaming pipelines. Real-time fraud detection, CDC replication, ML feature stores. AWS, Azure, GCP.

About Me

I'm I am Data Engineer with expertise in building real-time streaming pipelines and data platforms. My focus is on creating reliable, scalable systems that handle millions of events while maintaining data quality and governance.

I specialize in Apache Kafka, Spark Structured Streaming, and cloud-native services across AWS, Azure, and GCP. I believe in designing for failure modes first, enforcing data contracts, and always considering the downstream consumer.

Education

Master of Computer Applications (MCA)
Himalayan Garhwal University, Uttarakhand, India
GPA: 8.0/10Jun 2021 – Apr 2023

Coursework: Distributed Systems, Database Management Systems, Big Data Analytics, Data Mining and Warehousing

Bachelor of Computer Applications (BCA)
Himalayan Garhwal University, Uttarakhand, India
GPA: 7.55/10Jun 2018 – Apr 2021

Coursework: Data Structures and Algorithms, Database Management Systems, Operating Systems, Computer Networks

Projects

Production-grade streaming pipelines built with Kafka, Spark, and cloud-native services. Each project demonstrates real-world patterns for reliability, scalability, and observability.

Real-Time Fraud Detection
End-to-end fraud detection streaming pipeline processing 120K+ events/day with confidence scoring and real-time pattern detection.
AWS
Real-Time Fraud Detection Architecture Diagram
120K+ events/day | AWS | Kafka | Spark | <60s latency
Apache KafkaSpark Structured StreamingPySparkPythonAmazon S3Amazon RedshiftParquet
View Code
Data Quality & Governance
Medallion architecture pipeline with 6 quality rules, processing 200K+ records/day through Bronze, Silver, Gold layers.
Azure
Data Quality & Governance Architecture Diagram
200K+ records/day | Azure | Delta Lake | <30s latency
Azure Event HubsAzure DatabricksDelta LakePySparkPythonAzure Data Factory
View Code
Global Event Processing
Multi-region stateful streaming with event-time windowing, processing 300K+ events/day across 3 global regions.
GCP
Global Event Processing Architecture Diagram
300K+ events/day | GCP | Dataflow | 3 regions | <45s latency
Google Pub/SubCloud DataflowBigQueryApache BeamPythonSQL
View Code
Real-Time CDC Database Replication
Change Data Capture pipeline with Debezium, processing 500K+ change events/day with schema auto-detection.
AWS
Real-Time CDC Database Replication Architecture Diagram
500K+ events/day | AWS | Debezium | Kafka | <15s latency
PostgreSQLDebeziumKafkaSparkSnowflakeAWS S3PySparkPython
View Code
Real-Time ML Feature Store
Dual-store feature platform with online/offline consistency, serving 400K+ events/day with <10ms latency.
AWS
Real-Time ML Feature Store Architecture Diagram
400K+ events/day | AWS | Redis | Kafka | <10ms latency
Apache KafkaSparkRedisAWS S3PySparkPythonGreat ExpectationsPostgreSQL
View Code
Multi-Cloud Data Lakehouse
Cross-cloud Apache Iceberg lakehouse processing 600K+ events/day with unified SQL queries via Trino.
AWS + GCP
Multi-Cloud Data Lakehouse Architecture Diagram
600K+ events/day | AWS + GCP | Iceberg | Trino | <90s latency
Apache KafkaSparkApache IcebergAWS S3BigQueryTrinodbtPySparkPythonSQL
View Code

Technical Skills

Languages & Query
PythonSQLWindow FunctionsCTEsQuery OptimizationIndexing
Core Skills
Data PipelinesETL/ELTBatch ProcessingStreaming ProcessingData Warehousing
Databases
PostgreSQLRedshiftBigQuerySnowflakeClickHouse
Big Data & Streaming
Apache SparkPySparkApache KafkaApache AirflowApache BeamHadoop
Cloud - AWS
S3RedshiftGlueKinesisLambdaEMR
Cloud - GCP
Pub/SubDataflowBigQueryCloud StorageComposer
Cloud - Azure
Event HubsDatabricksData FactorySynapseADLS
Data Engineering
Data ModellingData QualityData GovernanceMedallion ArchitectureDelta Lake
Tools & DevOps
OpenTelemetryStructured LoggingGitLinuxCI/CDDocker

Certifications

AWS Certified Data Engineer - Associate
Amazon Web Services
AWS
Issued: Jan 2026
Databricks Certified Data Engineer Associate
Databricks
Databricks
Issued: Feb 2026

How I Think as a Data Engineer

Engineering is about making decisions under constraints. Here are the principles that guide my work.

Think in Failure Modes
Every system I design, I start by asking: what can go wrong here?
  • Network partition? Kafka handles it.
  • Process crash? Checkpoint handles it.
  • Duplicate message? Idempotent write handles it.
  • Schema change? Schema registry handles it.

A pipeline that works under normal conditions is a prototype. A pipeline that recovers correctly from failure is a production system.

Think in Data Contracts
Data flowing through a pipe is easy to visualise. But what matters is the contract at each stage.
  • What schema does this topic expect?
  • What guarantees does this stage make to the next?
  • What happens when this contract is violated?

If every stage enforces its contract, the whole system is reliable. If any stage is loose, the whole system is fragile.

Think About Downstream First
Before I design a pipeline, I ask: who reads this data, and what do they need?
  • ML model needs: low latency, no nulls, consistent schema.
  • BI dashboard needs: pre-aggregated, partitioned, fast queries.
  • Compliance audit needs: immutable, timestamped, queryable history.

The consumer's needs determine the pipeline's design. Not the other way.

Think in Trade-offs
There is no universally correct answer in data engineering.
  • Exactly-once is more correct but slower than at-least-once.
  • Event-time is more accurate but more complex than processing-time.
  • Normalised schema is cleaner but slower to query than denormalised.

Every choice is a trade-off. My job is to understand the trade-off, make the right call for this context, and document why.

Think About Cost Alongside Correctness
A pipeline that processes 600K events/day perfectly but costs $10,000/month is not a good pipeline for a small company.
  • How much data am I storing? Do I need all of it?
  • How often am I running this query? Can I cache it?
  • Am I using the right instance type for this workload?

Correctness and cost efficiency are not opposites. Good engineering achieves both.

How I Built These Projects

I didn't start with the code. I started with the problem.

1
Understand the Business Problem First
For every project, I asked: what is the actual cost of NOT having this pipeline?
  • P1: Fraud detected 4 hours late = money gone. That's real money.
  • P2: Bad data reaching dashboards = wrong decisions. That's real business impact.
  • P4: Database changes not captured = analytics 24 hours stale. That breaks real systems.

If I can't explain the business problem in two sentences, I don't start building.

2
Design on Paper Before Code
I draw the pipeline on paper first. Every component. Every arrow. Every failure point.
  • What happens if this component crashes?
  • What happens if this component is slow?
  • What happens if data arrives out of order?
  • What happens if the schema changes?

Questions I answer before writing line 1.

3
Build the Unhappy Path First
Most tutorials build the happy path. I build the failure path first.
  • Dead letter queues before the main consumer.
  • Checkpoint configuration before the processing logic.
  • Schema validation before the business logic.

If the failure path works, the happy path is easy.

4
Measure Everything
I don't trust that a pipeline works unless I can see it working.
  • Event count per second (not per day - per second)
  • Consumer lag (are we keeping up or falling behind?)
  • Checkpoint age (when did we last save state?)
  • Error rate (what % of events are failing?)

Observable systems are debuggable systems.

5
Document the Decisions
Code explains what. Architecture Decision Records (ADRs) explain why.
  • "Why Kafka over Kinesis?" - documented.
  • "Why Parquet over Avro?" - documented.
  • "Why session windows over sliding windows?" - documented.

Future me (and future teammates) will thank present me.

Get in Touch

Interested in collaborating or have a data engineering challenge? Let's connect.

Contact Information
sunildataengineer@outlook.com+91 9380691205
Banashankari 3rd Stage, Bangalore, Karnataka