You will play a key role in building and optimising high-performance AI kernels for a next-generation compute platform. This role focuses on enabling efficient execution of AI and LLM workloads by developing, profiling, and optimising kernels across varying hardware configurations.

You will be responsible for:

Developing AI/LLM kernels and operators for efficient inference on a specialised compute platform
Optimising kernel performance across different hardware configurations and workloads
Profiling and analysing performance across compute, memory, and parallelism to identify bottlenecks
Optimising low-level C/C++ code to maximise hardware utilisation
Collaborating across the AI inference stack, including runtime, compiler, and system layers
Contributing to improvements in toolchain, compiler, and runtime components
Supporting internal teams and external stakeholders with technical insights and documentation

Ideal Candidate

You have a Bachelors or Masters degree in Computer Science, Electrical Engineering, or a related field
You have 5+ years of experience in AI kernel development and performance optimisation
You have experience in profiling models and kernel inference performance
You have hands-on experience with at least one of the following: CUDA, DSP, NEON, or Triton
You have strong proficiency in C/C++ and Python; exposure to assembly is a plus
You have strong problem-solving, debugging, and communication skills
You are comfortable working close to hardware and across system layers

The Offer

Competitive compensation with meaningful equity
High-impact role in a deeply technical, low-bureaucracy environment
Opportunity to work on cutting-edge AI systems and long-term career growth

About the employer

Our client is a Silicon Valleybased deep-tech company building a new compute architecture for real-time AI at the edge. Founded by engineers from leading research backgrounds, the focus is on solving the gaps in current neural processing approaches through tight integration of hardware and software.

The platform is built to run both neural network inference and conventional compute workloads efficiently across a wide range of edge devices. Unlike typical accelerators that only handle parts of an ML graph, this architecture supports end-to-end execution, including both neural network graph code and standard C++ DSP and control code, enabling greater flexibility and performance in real-world deployments.

AI Kernel Engineer

Snaphunt

Job Description

Services you might be interested in

Improve Your Resume Today