Data Engineer for AI
Cloudera
2 - 5 years
Bengaluru
Posted: 3/19/2025
Job Description
Business Area:
Professional ServicesSeniority Level:
Mid-Senior levelJob Description:
At Cloudera, we empower people to transform complex data into clear and actionable insights. With as much data under management as the hyperscalers, we're the preferred data partner for the top companies in almost every industry. Powered by the relentless innovation of the open source community, Cloudera advances digital transformation for the world’s largest enterprises.
Role:
As a Customer Enablement Engineer specializing in Data Engineering for AI, you will design, develop, and deliver comprehensive curriculum content, including student guides, labs, quizzes, and certifications on data engineering and data preparation skills. This curriculum will enable Cloudera customers to effectively build AI systems on the Cloudera Hybrid platform.
Objective of this Role:
To ensure customers are successfully enabled to prepare data with high quality that meets the requirements to efficiently build their ML/AI including LLMs.
As the Data Engineer for AI you will:
Responsible for developing high quality and impactful “data engineering for AI” course
Enable instructors to successfully deliver the course in classrooms to our customers
Deliver hands-on workshops to customers in person or remote on select course topics
Record and publish course content as online modules in digital format
Work with internal & external SMEs and Customers to regularly seek inputs for improvement
Assist Edu sales leaders to sell Educational products by being a technical resources
Own your own self development and stay resourceful all the time. Enrich your own knowledge on various topics in data analytics and AI by being a self-learner .
We’re excited about you if you have:
Five (5) or more years of data engineering experience with SQL, Python, Hive, Spark, Flink, Kafka, Nifi and Airflow.
Hands-on experience in developing data ingest (batch and realtime) pipelines from various data sources into large analytics platforms, data warehouses, data lakes and lake houses
Experience with one or more LMS (learning management systems)
Experience or educated in preparing data ( both structured and unstructured ) for ML/AI model development including training and fine tuning of LLMs
Experience with data governance, data lineage, and metadata best practices
Experienced using data quality & data profiling tools and data catalogs
Experience in having published technology education content on digital media platforms like Udemy, LinkedIn, YouTube or own website etc as Curriculum Developer or independent contributor
Experience in working in public cloud environments from one of the hyperscalers like AWS, Google Cloud and Microsoft Azure). A cloud certification is preferred
Experience working with containers and Kubernetes. A certification in Kubernetes is preferred
Experience in (or trained on) the Cloudera platform (CDP, HDP or CDH ) and any underlying Apache projects
Experience or training in preparing data for ML/AI model development including LLMs
Experience or training on Iceberg, Trino and Vector databases like Pinecone orMilvus
Experience using configuration management tools such as Git, Ansible, Puppet or Chef
Familiarity with scripting tools such as bash shell scripts, Python and/or Perl
Soft Skills Essential
Ability to work closely with the curriculum content development team to define the operational requirements for technical training courses
Ability to build efficient, well-architected, easy-to-use hands-on lab environments
Ability to work as part of a remote, distributed team
It is a plus if you have:
Certification in cloud on at least one hypescaler: AWS, Azure, or GCP
Expertise in preprocessing unstructured data for generative AI, including tokenization and embedding generation
Proficiency with one or more vector databases (e.g., Pinecone, Milvus) for managing embeddings in semantic search and data retrieval.
Skills in handling large-scale datasets for LLMs, including sharding, distributed loading, and parallel data processing.
Knowledge of data lineage, versioning, and metadata tracking to ensure compliant, high-quality training data for generative AI.
What you can expect from us:
Generous PTO Policy
Support work life balance with Unplugged Days
Flexible WFH Policy
Mental & Physical Wellness programs
Phone and Internet Reimbursement program
Access to Continued Career Development
Comprehensive Benefits and Competitive Packages
Employee Resource Groups
Cloudera is an Equal Opportunity / Affirmative Action Employer. All qualified applicants will receive consideration for employment without regard to race, color, religion, sex, pregnancy, sexual orientation, gender identity, national origin, age, protected veteran status, or disability status.
#LI-Hybrid
#LI-SN1
About Company
Cloudera provides enterprise data cloud solutions, enabling businesses to manage and analyze large volumes of data. The company specializes in big data technologies like Hadoop and Apache, helping organizations unlock value from their data by providing tools for data processing, storage, and analytics.
Services you might be interested in
One-Shot Campaign
Reach out to ideal employees in one shot!
The intelligent campaign for reaching out to the ideal audience to whom you can ask for help (guidance or referral).