Serve as a technical resource across engineering teams developing and sharing best practices, raising technical debt and reliability risks early, and always……
Executes creative software solutions, design, development, and technical troubleshooting with ability to think beyond routine or conventional approaches to……
Executes creative software solutions, design, development, and technical troubleshooting with ability to think beyond routine or conventional approaches to……
Net, React, Node, and Ruby; Researching and implementing innovative technologies to enhance productivity, reliability, and security within a production……
Utilize programming languages like Java, Python, SQL, Ruby and Go, Container Orchestration services including Docker and Kubernetes, CM tools including Ansible……
Bachelor's degree in Programming/Systems or Computer Science or other related field. Increased skill in multiple technical environments and knowledge of a……
You will play a key role in standardizing frontend practices, improving developer productivity, and driving adoption of scalable UI platforms and reusable……
You'll own the AWS cloud infrastructure that powers AI-driven security intelligence platform — from data lake architecture and serverless functions, through……
Skills: • 7+ years of progressive experience in Software Engineering • 5+ years of experience as delivery lead or Project Manager • 5+ years of experience using……
Create custom integrations and CLI tools that give agents deep understanding of internal systems and codebases. CI/CD platforms and cloud infrastructure (AWS).…
Partner with AWS field teams and WWT account executives to position modernization offerings and develop pursuit strategies for strategic accounts.…
Ability to communicate technical concepts to both technical and non-technical audiences. IgniteAction is seeking an experienced AWS/EMR Cloud Infastructure Lead……
We are seeking a highly skilled and experienced Lead DevOps Engineer to design, implement, and maintain a secure, scalable, and reliable Modeling and Simulation……
Work closely with engineering and architecture teams to plan migration strategies (rehost, replatform, refactor). 7+ years of experience in DevOps / SRE roles.…
This role is responsible for automating security and governance controls, and for the implementation and maintenance of tools and infrastructure supporting the……
Supports technical roadmaps for the timely implementation of a technical solution strategy. Demonstrates an extensive knowledge of technologies and/or thought……
Ability to grasp technical concepts quickly and apply that knowledge to DevOps efforts within the team. The role is responsible for managing and automating the……
Contribute to technical planning for future-state engineering, modernization, and delivery optimization. The ideal candidate is a hands-on technical leader who……
The DevOps / Platform Engineering role requires an experienced senior-level engineer with a proven track record of building scalable, high‑performance cloud……
The contractor leads sprint planning and team tasking, drives infrastructure technical direction, and owns CI/CD pipeline management and cloud infrastructure……
Expertise with the SAFe framework; active RTE or SAFe Agilist certification is highly desirable. Familiarity with InterSystems IRIS or similar enterprise……
In this role, you will lead a team of DevOps engineers and own the design, implementation, and governance of cloud infrastructure across both AWS and Azure……
We are charged with designing and creating the foundational technology stack that powers Commerce stores, while building the tools and systems that help……
We are charged with designing and creating the foundational technology stack that powers Commerce stores, while building the tools and systems that help……
Provide solutions and technical guidance based on program needs, priorities, and technical requirements/constraints. Project tracking tools such as Jira.…
You will work closely with engineering, QA, and platform teams to ensure our systems are highlyavailable, secure, scalable, and cost‑efficient, while mentoring……
Experience driving engineering transformation initiatives at enterprise scale. Lead the design, implementation, and operation of large-scale Kubernetes……
Degree in Computer Science or a related field. Write and own infrastructure-as-code across both Terraform and AWS CDK, setting reusable, well-tested, peer-……
Perform as an individual contributor and colleague who enjoys collaborating with, learning from, and mentoring program team members on security disciplines to……
Senior DevOps Lead Engineer (AI Acceleration)- Hybrid
Santa Clara, CA
$80.00 - $110.00 Per Hour (Employer provided)
Is your resume a good match?
Use AI to find out how well the skills on your resume fit this job description.
The Role
You will be the senior DevOps technical lead on the Infrastructure team, owning the CI/CD pipelines, container infrastructure, observability stack, and shared tooling that AI/ML hardware accelerator development runs on in the lab, in the cloud, and across colocations at scale.
Because we design and manufactures AI acceleration silicon, a core part of this is working with internal cloud and lab physical systems: automating and operating on-premises GPU clusters, high-speed interconnects, and lab server infrastructure not just cloud resources. You will build the automation layer that ties lab hardware, cloud environments, and developer tooling into a single, reliable system.
You will also be instrumental in scaling that system globally, as they build toward a follow-the-sun DevOps model across its expanding engineering sites.
What You Will Do
DevOps Leadership
Own CI/CD pipelines, runners, and execution environments across software, silicon, hardware, and ML teams GitLab CI, GitHub Actions, and build systems like Bazel.
Build and maintain automated provisioning and deployment pipelines for GPU driver stacks, AI/ML frameworks (PyTorch, TensorFlow), and inference software; implement container-based test harnesses (Docker/Kubernetes/Singularity) that verify driver and framework compatibility across hardware generations (NVIDIA, AMD, Intel).
Improve pipeline performance through parallelization, caching, and architectural changes; maintain the Docker image library supporting AI/ML workload testing across distributions and framework versions.
Automation & Infrastructure as Code
Own IaC and configuration management (Terraform, Ansible, Python, Go, Bash) across lab, on-prem, colo, and cloud (AWS, Azure, GCP) covering GPU/CPU driver provisioning through infrastructure deployments, with remote state management, environment isolation, and plan validation.
Build automation to eliminate toil and enforce consistency across team workflows; implement auto-remediation where appropriate with blast-radius controls and approval gates for production systems.
Operate and automate Kubernetes clusters and HPC container environments (Singularity/Apptainer) across cloud and on-premises installation, upgrades, workload management, and troubleshooting.
Observability, Reliability & Incident Response
Design and maintain dashboards, alerting, and monitoring (Prometheus/Grafana, DataDog) across CI runners, lab hardware, GPU utilization, and shared services; define SLOs/SLIs and lead structured incident response when they are breached.
Lead incident triage from bare metal to application layer resolving infrastructure, software, and hardware faults across CI/CD, lab, container, and cloud environments, including GPU drivers, framework crashes, and network issues.
Documentation & Global Collaboration
Create and maintain high-quality documentation: architecture diagrams, troubleshooting guides, onboarding materials, and API/tool references.
Partner with Global DevOps and SRE team members to build a consistent, scalable operating model.
Serve as a technical resource across engineering teams developing and sharing best practices, raising technical debt and reliability risks early, and always coming with a proposed plan.
Drive innovation by supporting R&D activities and leading proof-of-concept (POC) and proof-of-value (POV) evaluations for new tooling, infrastructure patterns, and accelerator technologies.
What You Will Bring
Required
Bachelor's or Master's in Computer Science, Electrical Engineering, or related field with 10+ years of hands-on DevOps/infrastructure experience (8 years minimum).
Deep Linux systems expertise: package management, networking (TCP/IP stack, routing, bonding), storage, systemd, kernel parameters, and performance tuning.
Production-grade Git based CI/CD experience: pipeline design, runner management, merge request workflows, caching, and artifact handling.
Strong Python and/or Bash scripting for automation, with the ability to write clean, tested, maintainable code not just one-off scripts.
Hands-on Ansible experience writing playbooks from scratch for complex, multi-host configuration scenarios and mentoring team members on Ansible and IaC best practices.
Docker/container expertise: multi-stage builds, registry management, security scanning, and container networking.
Kubernetes operational experience: cluster lifecycle, workload debugging, storage, networking, and RBAC.
Prometheus + Grafana observability stack: metric instrumentation, alert design, and dashboard development.
Experience supporting AI/ML or HPC workloads on GPU or accelerator hardware including driver installation, framework compatibility, and hardware-level troubleshooting.
Comfort operating in fast-moving startups: you ship, document, and iterate not wait for perfect requirements.
Cross-site or follow-the-sun DevOps technical leadership experience.
Strongly Preferred
Production Go and/or Python for DevOps services pipeline validators, health-check microservices, or auto-remediation agents beyond scripting.
Experience with artifact repositories such as Harbor, Nexus, Artifactory, or GitLab Package Registry.
Job scheduling systems: Slurm, LSF, or similar HPC-style cluster job control.
Knowledge of CPU/GPU architectures and high-speed interconnect fabrics: InfiniBand, RoCE (RDMA over Converged Ethernet), or NVLink.
Prior experience speaking at technical conferences or writing public-facing technical documentation/blog posts
The minimum salary is $80.00 and the max salary is $110.00.
$80.00 – $110.00/hr (Employer provided)
$95.00
/hr Median
Santa Clara, CA
If an employer includes a salary or salary range on their job, we display it as "Employer Provided". If a job has no salary data, Glassdoor displays a "Glassdoor Estimate" if available. To learn more about "Glassdoor Estimates," see our FAQ page.
Working here doesn’t have to be a secret
Sign in to browse authentic reviews, anonymous ratings and salary data before you apply.