Software Engineer — Infrastructure
About Basis
Basis is a nonprofit applied AI research organization with two mutually reinforcing goals.
The first is to understand and build intelligence. This entails establishing the mathematical principles of reasoning, learning, decision-making, understanding, and explaining, and constructing software that embodies these principles.
The second is to advance society’s ability to solve intractable problems. This involves expanding the scale, complexity, and breadth of problems we can solve today and, more importantly, accelerating our ability to solve problems in the future.
To achieve these goals, we are building both a new technological foundation inspired by human reasoning, and a new type of collaborative organization that prioritizes human value.
About the Role
Software Engineers on the Platform team at Basis build the infrastructure that accelerates research and enables commercial deployment of Basis innovations. You will create reliable training and evaluation infrastructure, manage compute resources scaling to medium-scale models, develop SaaS platform offerings, and build the technical foundation that supports both internal research and external customers.
We are looking for people who excel at infrastructure engineering and understand the unique demands of ML systems at scale. The ideal Software Engineer has experience with distributed systems, cloud infrastructure, and ML training pipelines, and brings a reliability-focused mindset that ensures researchers can trust the systems they depend on. You will work at the intersection of cutting-edge research and production-grade infrastructure.
This role is central to Basis’s commercial strategy and scaling objectives. The Platform team develops general-purpose infrastructure separate from individual design partner teams, enabling replication-based growth across multiple domains and clients.
We seek individuals who aspire to build rigorous, high-quality, robust systems, but are not afraid to iterate quickly, learn from production, and explore different architectural approaches to achieve excellence.
Basis is a collaborative effort, both internally and with our external partners; we are looking for people who enjoy building infrastructure for problems larger than ones they can tackle alone.
We expect you to:
- Have demonstrated significant technical achievements in infrastructure engineering. Examples include:
- Building ML training or inference infrastructure for distributed systems
- Developing cloud platforms or services used by multiple teams or customers
- Creating developer tools, CI/CD systems, or deployment automation at scale
- Contributing to infrastructure open-source projects or technical systems with high reliability requirements
- Possess deep understanding of distributed systems principles including consistency, availability, fault tolerance, scalability patterns, and performance optimization for high-throughput, low-latency workloads.
- Have hands-on experience with cloud platforms (AWS, GCP, Azure) including compute orchestration, storage systems, networking, and cost optimization strategies. Experience managing significant cloud budgets is valuable.
- Be proficient in infrastructure technologies including Kubernetes, Docker, infrastructure as code (Terraform), CI/CD pipelines, monitoring and observability (Prometheus, Grafana), and modern DevOps practices.
- Understand ML infrastructure requirements including GPU cluster management, distributed training frameworks (PyTorch Distributed, DeepSpeed, Ray), experiment tracking, model versioning, and reproducible research pipelines.
- Have experience with systems programming languages including Python (primary for ML), and familiarity with Go, Rust, or C++ for performance-critical components.
- Value reliability and operational excellence. You design systems that fail gracefully, monitor proactively, and enable teams to debug and recover quickly when issues arise.
- Progress with autonomy on complex technical challenges. You can scope infrastructure projects, make sound architectural decisions, and execute from design through deployment.
- Be excited about enabling breakthrough research that advances society’s ability to solve intractable problems through robust, scalable infrastructure.
Responsibilities
- Design and build ML training infrastructure supporting medium-scale models with distributed training across GPU clusters, experiment tracking, checkpoint management, and reproducible pipelines.
- Develop SaaS platform and API offerings that package Basis research innovations into commercial products, including backend services, API design, authentication, rate limiting, and customer-facing features.
- Manage compute infrastructure as it scales, including capacity planning, resource allocation, cost optimization, cloud and on-premise orchestration, and efficiency monitoring.
- Build developer tools and workflows that accelerate research velocity including CI/CD pipelines, testing frameworks, deployment automation, and development environment management.
- Implement monitoring and observability providing comprehensive visibility into system health, performance, costs, and research progress through metrics, logging, alerting, and dashboards.
- Ensure system reliability and scalability by designing fault-tolerant architectures, implementing graceful degradation, conducting load testing, and establishing SLAs appropriate for research and production workloads.
- Collaborate with research teams to understand infrastructure needs, translate experimental techniques into scalable systems, and provide technical consultation on architecture and performance.
- Maintain security and compliance implementing access controls, encryption, audit logging, and adherence to data governance policies as Basis serves external customers.
- Contribute to the culture and direction of Basis by modeling technical excellence, operational discipline, and focus on enabling high-impact research and commercial applications.
Role Details
Exceptional candidates who may not meet all of the following criteria are still encouraged to apply.
- FT/PT: Full-time
- In-person Policy: We are in the office four days a week. Be prepared to attend multi-day Basis-wide in-person events.
- Location: New York City.
- Salary range: Competitive salary.
In addition, the following would be an advantage:
- Experience at companies building ML infrastructure at scale (Anthropic, OpenAI, Google, Meta AI Research, Weights & Biases, HuggingFace).
- Background in ML research or research engineering providing understanding of researcher workflows.
- Experience with on-premise GPU cluster management or hybrid cloud architectures.
- Contributions to infrastructure open-source projects (Kubernetes, PyTorch, Ray).
- SRE background or experience with production ML systems serving external customers.
- Understanding of AI safety and responsible AI deployment practices.
Privacy Notice
By submitting your application, you grant Basis permission to use your materials for both hiring evaluation and recruitment-related research and development purposes. Your information may be processed in different countries, including the US. You retain copyright while providing Basis a license to use these materials for the stated purposes. Read our full Global Data Privacy Notice here.