Senior Staff Machine Learning Engineer - DevOps/Site Reliability Engineer
Company: Servicenow
Location: Santa Clara
Posted on: June 2, 2025
Job Description:
Company DescriptionIt all started in sunny San Diego, California
in 2004 when a visionary engineer, Fred Luddy, saw the potential to
transform how we work. Fast forward to today - ServiceNow stands as
a global market leader, bringing innovative AI-enhanced technology
to over 8,100 customers, including 85% of the Fortune 500. Our
intelligent cloud-based platform seamlessly connects people,
systems, and processes to empower organizations to find smarter,
faster, and better ways to work. Join us as we pursue our purpose
to make the world work better for everyone.
Job DescriptionThis position requires passing a ServiceNow
background screening, USFedPASS (US Federal Personnel Authorization
Screening Standards). This includes a credit check,
criminal/misdemeanor check and taking a drug test. Employment is
contingent upon passing the screening. Due to Federal requirements,
only US citizens, US naturalized citizens or US Permanent
Residents, holding a green card, will be considered.PLATO (Platform
Engineering and AI Technology Organization) at ServiceNow is a
customer-focused innovative group building intelligent software
using a variety of technology stacks to enable end-to-end,
industry-leading work experiences for our customers. We are deeply
invested in our customers' success, with expertise in advanced
technologies and software engineering best practices. We prioritize
robustness, performance, and user experience over specific
technologies.We are a team of technology professionals and platform
engineers with a dual mission: to build and evolve the AI platform,
and to partner with teams to create products and end-to-end
AI-powered work experiences. We also focus on foundational
research, experimentation, and de-risking AI technologies for
future innovations.As a Senior Staff Machine Learning Engineer -
Site Reliability Engineer you will:
- Design, develop, and implement infrastructure, platform,
deployment, and observability features that support AI
workloads.
- Collaborate with researchers, AI engineers, and infrastructure
teams to optimize GPU cluster performance, scalability, and
reliability.
- Enhance the SRE practice by translating operational use cases
into software tooling requirements.
- Support deployment activities for AI/ML developers.
- Write high-quality, scalable, and reusable code, adhering to
best practices like code reviews and unit testing.
- Work closely with product owners to understand requirements and
oversee your code from design through delivery.
- Operate Large Language Models (LLMs) on NVIDIA GPUs.
- Mentor colleagues and promote knowledge
sharing.QualificationsTo succeed in this role, you should have:
- Experience integrating AI into work processes, decision-making,
or problem-solving, including using AI tools, automating workflows,
analyzing AI insights, or exploring AI's industry impact.
- 8+ years in infrastructure, platform operations, deployments,
SRE, and DevOps, focusing on platform health.
- 6+ years managing highly-available distributed workloads on
Kubernetes following DevOps principles.
- 6+ years developing with Python, GoLang, Java, or similar
languages.
- Experience with DevOps tools like Helm, Ansible, Kubernetes,
Prometheus, Splunk, GitLab CI.
- Strong experience operating distributed Linux-based systems and
J2EE applications.
- Knowledge of software-defined networking, infrastructure as
code, and configuration management.
- Experience developing compliant and secure software for
regulated environments.
- Ability to lead projects with significant technical risks to
achieve outcomes.We offer a competitive base salary, equity (when
applicable), incentives, and comprehensive benefits, including
health plans, 401(k), ESPP, matching donations, flexible time off,
and family leave programs. Compensation varies based on geographic
location and other factors.Additional InformationWork PersonasWe
support flexible, remote, and in-office work arrangements depending
on job requirements. .Equal Opportunity EmployerServiceNow is
committed to diversity and inclusion. We consider all qualified
applicants without regard to race, color, creed, religion, sex,
sexual orientation, national origin, age, disability, gender
identity, marital status, veteran status, or any other protected
category. We also consider applicants with arrest or conviction
records in accordance with legal requirements.AccommodationsIf you
require accommodations during the application process, please
contact .Export Control RegulationsEmployment may be contingent
upon obtaining export licenses or approvals if required by law,
especially for roles with access to controlled technology.From
Fortune. 2024 Fortune Media IP Limited. All rights reserved. Used
under license.
#J-18808-Ljbffr
Keywords: Servicenow, Rancho Cordova , Senior Staff Machine Learning Engineer - DevOps/Site Reliability Engineer, Engineering , Santa Clara, California
Didn't find what you're looking for? Search again!
Loading more jobs...