Kafka Tier 3 Support Engineer (Platform & Operations)
Tata Consultancy Services
2 - 5 years
Bengaluru
Posted: 20/04/2026
Job Description
Tier 3 Incident Management & Escalation Support
Act as the highest technical escalation point for Kafka production incidents (Sev1 / Sev2).
Lead deep troubleshooting across:
Broker instability, controller elections, ISR shrinkage
Underreplicated partitions and leader imbalance
Producer/consumer failures, lag spikes, and rebalance storms
Disk, network, JVM, and request handler saturation
Provide handson remediation for complex issues, including:
Partition reassignment and leader rebalance
Broker configuration tuning
Throttle/quota strategies for noisy producers or consumers
Coordinate with vendor support during service incidents, providing logs, metrics, and forensic details.
Guide Tier2 teams during major incidents and validate restoration actions.
2. Kafka Performance Engineering & Optimization
Analyze Kafka workloads for performance and scalability risks:
Partition skew and hot partitions
Inefficient producer batching/compression
Consumer lag root cause analysis
Thread pool, I/O, and network bottlenecks
Recommend and validate:
Topic design (partition count, replication factor, retention, compaction)
Producer and consumer configuration best practices
Quotas, quotas enforcement, and multitenant controls
Support onboarding of highthroughput or latencysensitive workloads, ensuring Kafka is correctly sized and tuned.
3. Platform Stability, Reliability & Resilience
Diagnose and resolve systemic Kafka stability issues:
Repeated broker failures or flapping
Metadata/controller instability (Zookeeper or KRaft)
Recovery issues following failovers or maintenance events
Support resilience initiatives:
MultiAZ cluster health validation
Replication and DR strategies (MirrorMaker 2, Replicator, or applevel DR patterns)
Failover testing and validation
Define and improve Kafka SLOs for availability, durability, and latency.
4. Change, Upgrade & Configuration Leadership
Lead medium to highrisk Kafka changes, including:
Broker and cluster configuration changes
Partition expansion or largescale reassignment
Topic policy changes impacting durability or performance
Support and plan:
Kafka version upgrades
MSK / Confluent upgrade cycles
Client compatibility and rollout strategies
Participate in CAB reviews, assess risk, and design rollback and validation plans.
Services you might be interested in
Improve Your Resume Today
Boost your chances with professional resume services!
Get expert-reviewed, ATS-optimized resumes tailored for your experience level. Start your journey now.
