RanchoCordovaRecruiter Since 2001
the smart solution for Rancho Cordova jobs

Senior Staff Machine Learning Engineer - DevOps/Site Reliability Engineer

Company: Servicenow
Location: Santa Clara
Posted on: June 2, 2025

Job Description:

Company DescriptionIt all started in sunny San Diego, California in 2004 when a visionary engineer, Fred Luddy, saw the potential to transform how we work. Fast forward to today - ServiceNow stands as a global market leader, bringing innovative AI-enhanced technology to over 8,100 customers, including 85% of the Fortune 500. Our intelligent cloud-based platform seamlessly connects people, systems, and processes to empower organizations to find smarter, faster, and better ways to work. Join us as we pursue our purpose to make the world work better for everyone.
Job DescriptionThis position requires passing a ServiceNow background screening, USFedPASS (US Federal Personnel Authorization Screening Standards). This includes a credit check, criminal/misdemeanor check and taking a drug test. Employment is contingent upon passing the screening. Due to Federal requirements, only US citizens, US naturalized citizens or US Permanent Residents, holding a green card, will be considered.PLATO (Platform Engineering and AI Technology Organization) at ServiceNow is a customer-focused innovative group building intelligent software using a variety of technology stacks to enable end-to-end, industry-leading work experiences for our customers. We are deeply invested in our customers' success, with expertise in advanced technologies and software engineering best practices. We prioritize robustness, performance, and user experience over specific technologies.We are a team of technology professionals and platform engineers with a dual mission: to build and evolve the AI platform, and to partner with teams to create products and end-to-end AI-powered work experiences. We also focus on foundational research, experimentation, and de-risking AI technologies for future innovations.As a Senior Staff Machine Learning Engineer - Site Reliability Engineer you will:

  • Design, develop, and implement infrastructure, platform, deployment, and observability features that support AI workloads.
  • Collaborate with researchers, AI engineers, and infrastructure teams to optimize GPU cluster performance, scalability, and reliability.
  • Enhance the SRE practice by translating operational use cases into software tooling requirements.
  • Support deployment activities for AI/ML developers.
  • Write high-quality, scalable, and reusable code, adhering to best practices like code reviews and unit testing.
  • Work closely with product owners to understand requirements and oversee your code from design through delivery.
  • Operate Large Language Models (LLMs) on NVIDIA GPUs.
  • Mentor colleagues and promote knowledge sharing.QualificationsTo succeed in this role, you should have:
    • Experience integrating AI into work processes, decision-making, or problem-solving, including using AI tools, automating workflows, analyzing AI insights, or exploring AI's industry impact.
    • 8+ years in infrastructure, platform operations, deployments, SRE, and DevOps, focusing on platform health.
    • 6+ years managing highly-available distributed workloads on Kubernetes following DevOps principles.
    • 6+ years developing with Python, GoLang, Java, or similar languages.
    • Experience with DevOps tools like Helm, Ansible, Kubernetes, Prometheus, Splunk, GitLab CI.
    • Strong experience operating distributed Linux-based systems and J2EE applications.
    • Knowledge of software-defined networking, infrastructure as code, and configuration management.
    • Experience developing compliant and secure software for regulated environments.
    • Ability to lead projects with significant technical risks to achieve outcomes.We offer a competitive base salary, equity (when applicable), incentives, and comprehensive benefits, including health plans, 401(k), ESPP, matching donations, flexible time off, and family leave programs. Compensation varies based on geographic location and other factors.Additional InformationWork PersonasWe support flexible, remote, and in-office work arrangements depending on job requirements. .Equal Opportunity EmployerServiceNow is committed to diversity and inclusion. We consider all qualified applicants without regard to race, color, creed, religion, sex, sexual orientation, national origin, age, disability, gender identity, marital status, veteran status, or any other protected category. We also consider applicants with arrest or conviction records in accordance with legal requirements.AccommodationsIf you require accommodations during the application process, please contact .Export Control RegulationsEmployment may be contingent upon obtaining export licenses or approvals if required by law, especially for roles with access to controlled technology.From Fortune. 2024 Fortune Media IP Limited. All rights reserved. Used under license.
      #J-18808-Ljbffr

Keywords: Servicenow, Rancho Cordova , Senior Staff Machine Learning Engineer - DevOps/Site Reliability Engineer, Engineering , Santa Clara, California

Click here to apply!

Didn't find what you're looking for? Search again!

I'm looking for
in category
within


Log In or Create An Account

Get the latest California jobs by following @recnetCA on Twitter!

Rancho Cordova RSS job feeds