Loading...

Data Engineer for AI

Cloudera

2 - 5 years

Bengaluru

Posted: 3/19/2025

Job Description

Business Area:

Professional Services

Seniority Level:

Mid-Senior level

Job Description: 

At Cloudera, we empower people to transform complex data into clear and actionable insights. With as much data under management as the hyperscalers, we're the preferred data partner for the top companies in almost every industry.  Powered by the relentless innovation of the open source community, Cloudera advances digital transformation for the world’s largest enterprises.

Role:

As a Customer Enablement Engineer specializing in Data Engineering for AI, you will design, develop, and deliver comprehensive curriculum content, including student guides, labs, quizzes, and certifications on data engineering and data preparation skills. This curriculum will enable Cloudera customers to effectively build AI systems on the Cloudera Hybrid platform.

Objective of this Role:

To ensure customers are successfully enabled to prepare data with high quality that meets the requirements to efficiently build their ML/AI including LLMs.

As the Data Engineer for AI you will:

  • Responsible for developing high quality and impactful “data engineering for AI” course

  • Enable instructors to successfully deliver the course in classrooms to our customers 

  • Deliver hands-on workshops to customers in person or remote on select course topics

  • Record and publish course content as online modules in digital format

  • Work with internal & external SMEs and Customers to regularly seek inputs for improvement

  • Assist Edu sales leaders to sell Educational products by being a technical resources

  • Own your own self development  and stay resourceful all the time. Enrich your own knowledge on various topics in data analytics and AI by being a self-learner .

We’re excited about you if you have:

  • Five (5) or more years of data engineering experience with SQL, Python, Hive, Spark, Flink, Kafka, Nifi and Airflow. 

  • Hands-on experience in developing data ingest (batch and realtime) pipelines from various data sources into large analytics platforms, data warehouses, data lakes and lake houses 

  • Experience with one or more LMS (learning management systems)

  • Experience or educated in preparing data ( both structured and unstructured )  for ML/AI model development including training and fine tuning of LLMs

  • Experience with data governance, data lineage, and metadata best practices

  • Experienced using data quality & data profiling tools and data catalogs 

  • Experience in having published technology education content on digital media platforms like Udemy, LinkedIn, YouTube or own website etc as Curriculum Developer or independent contributor

  • Experience in working in public cloud environments from one of the hyperscalers like AWS, Google Cloud and Microsoft Azure). A cloud certification is preferred 

  • Experience working with containers and Kubernetes. A certification in Kubernetes is preferred 

  • Experience in (or trained on) the Cloudera platform (CDP, HDP or CDH ) and any underlying Apache projects

  • Experience or training in preparing data for ML/AI model development including LLMs

  • Experience or training on Iceberg, Trino and Vector databases like Pinecone orMilvus

  • Experience using configuration management tools such as Git, Ansible, Puppet or Chef

  • Familiarity with scripting tools such as bash shell scripts, Python and/or Perl

Soft Skills Essential

  • Ability to work closely with the curriculum content development team to define the operational requirements for technical training courses

  • Ability to build efficient, well-architected, easy-to-use hands-on lab environments

  • Ability to work as part of a remote, distributed team

It is a plus if you have:

  • Certification in cloud on at least one hypescaler: AWS, Azure, or GCP

  • Expertise in preprocessing unstructured data for generative AI, including tokenization and embedding generation

  • Proficiency with one or more vector databases (e.g., Pinecone, Milvus) for managing embeddings in semantic search and data retrieval.

  • Skills in handling large-scale datasets for LLMs, including sharding, distributed loading, and parallel data processing.

  • Knowledge of data lineage, versioning, and metadata tracking to ensure compliant, high-quality training data for generative AI.

What you can expect from us:

  • Generous PTO Policy 

  • Support work life balance with Unplugged Days

  • Flexible WFH Policy 

  • Mental & Physical Wellness programs 

  • Phone and Internet Reimbursement program 

  • Access to Continued Career Development 

  • Comprehensive Benefits and Competitive Packages 

  • Paid Volunteer Time

  • Employee Resource Groups

Cloudera is an Equal Opportunity / Affirmative Action Employer. All qualified applicants will receive consideration for employment without regard to race, color, religion, sex, pregnancy, sexual orientation, gender identity, national origin, age, protected veteran status, or disability status.

#LI-Hybrid

#LI-SN1

About Company

Cloudera provides enterprise data cloud solutions, enabling businesses to manage and analyze large volumes of data. The company specializes in big data technologies like Hadoop and Apache, helping organizations unlock value from their data by providing tools for data processing, storage, and analytics.

Services you might be interested in

One-Shot Campaign

Reach out to ideal employees in one shot!

The intelligent campaign for reaching out to the ideal audience to whom you can ask for help (guidance or referral).