Network Engineer
Company: OpenAI
Location: San Francisco
Posted on: May 3, 2025
Job Description:
As a Networking Engineer focused on WAN and LAN, you will play a
critical role in developing, managing, and optimizing the front end
network components of OpenAI's supercomputing infrastructure.Your
expertise will ensure that our networks are fast, reliable, and
scalable to meet the demands of training frontier AI models.This
includes managing both local (LAN) and long-distance (WAN)
connectivity across our data centers, optimizing performance, and
ensuring seamless communication between compute nodes and clusters.
Finally, this also includes writing code to instrument and observe
the network.Our team primarily uses Python and some Rust, so
familiarity with or interest in working with this stack is
essential.This role is based in San Francisco, CA, with a hybrid
work model of 3 days per week in the office. Relocation assistance
is available.In this role, you will:
- Design, manage, and optimize WAN and LAN infrastructure for
OpenAI's supercomputers.
- Develop and maintain data collection and monitoring systems to
ensure network visibility and performance.
- Troubleshoot and resolve network issues, such as TCP/IP, BGP,
and physical.
- Automate network issue detection and resolution to reduce
operational overhead.
- Work closely with hardware and systems engineers to meet the
performance demands of distributed AI training workloads.You might
thrive in this role if you:
- Have 5+ years of experience in networking or related
infrastructure roles.
- Possess strong expertise in networking technologies, protocols,
and design principles.
- Have hands-on experience with troubleshooting complex
networking issues, including both LAN and WAN environments.
- Deeply understand how to set up TCP/IP networks from scratch
(e.g., BGP, ECMP routing, etc.).
- Have a deep understanding of network protocols such as TCP/IP,
BGP, & VLAN.
- Are familiar with optical connectors and optical circuit
switches (OCS).
- Understand advanced concepts in routing, forwarding, and
network management systems.
- Have experience with telemetry, traffic engineering, and
congestion management to optimize network performance.
- Are skilled in collaborating across teams, combining technical
expertise with excellent problem-solving and communication
abilities.
- Exhibit ownership of problems end-to-end and maintain a
commitment to continuous learning to effectively solve
challenges.
- Are familiar with InfiniBand, RoCE, or RDMA in HPC
(High-Performance Computing) or similar environments.About
OpenAIOpenAI is an AI research and deployment company dedicated to
ensuring that general-purpose artificial intelligence benefits all
of humanity. We push the boundaries of the capabilities of AI
systems and seek to safely deploy them to the world through our
products. AI is an extremely powerful tool that must be created
with safety and human needs at its core, and to achieve our
mission, we must encompass and value the many different
perspectives, voices, and experiences that form the full spectrum
of humanity.We are an equal opportunity employer and do not
discriminate on the basis of race, religion, national origin,
gender, sexual orientation, age, veteran status, disability or any
other legally protected status.For US Based Candidates: Pursuant to
the San Francisco Fair Chance Ordinance, we will consider qualified
applicants with arrest and conviction records.We are committed to
providing reasonable accommodations to applicants with
disabilities, and requests can be made via this link.At OpenAI, we
believe artificial intelligence has the potential to help people
solve immense global challenges, and we want the upside of AI to be
widely shared. Join us in shaping the future of technology.
#J-18808-Ljbffr
Keywords: OpenAI, Rancho Cordova , Network Engineer, Engineering , San Francisco, California
Didn't find what you're looking for? Search again!
Loading more jobs...