Experienced Distinguished Engineer, Generative AI Systems and Cloud Infrastructure Development (Remote Eligible)
Posted 2025-10-26
Remote, USA
Full Time
Immediate Start
Introduction to Capital One and Our Mission At Capital One, we are driven by a mission to create trustworthy, reliable, and human-in-the-loop AI systems that change banking for good. For years, we have been at the forefront of the industry in leveraging machine learning to create real-time, intelligent, automated customer experiences. Our applications of AI and ML have brought humanity and simplicity to banking, from informing customers about unusual charges to answering their questions in real-time. With our investments in public cloud infrastructure and machine learning platforms, we are uniquely positioned to harness the power of AI. We are committed to building world-class applied science and engineering teams and continuing our industry-leading capabilities with breakthrough product experiences and scalable, high-performance AI infrastructure. About the Role We are seeking an experienced Senior Distinguished Engineer, AI Systems, to help us build the foundations of our enterprise AI capabilities. As a key member of our team, you will work on a wide range of initiatives, including designing robust, secure infrastructure, building large-scale distributed training clusters, deploying Large Language Models (LLMs) on GPU instances for real-time use cases, and supporting cutting-edge AI research and development, all within our public cloud infrastructure. You will collaborate with a team of AI engineers and researchers to envision the target state of our capabilities while helping to design and implement key services. Key Responsibilities Design and build fault-tolerant infrastructure to support long-running large-scale training tasks reliably despite failure of individual nodes, using containers and checkpointing libraries. Design and build infrastructure for serving large ML models in our public cloud. Deploy a thousand-node training cluster, optimizing storage and networking stack, with tightly coupled training pipelines to take advantage of multiple parallelism strategies, in our public cloud. Design and implement benchmarks to measure the performance of software systems within AI capabilities and make recommendations on technology selection. Develop applications that leverage LLMs and Foundation Models (FMs). Design and implement capabilities to support MLOps for foundation models. Essential Qualifications To be successful in this role, you will need: A Bachelor's degree in Computer Science, Computer Engineering, or a technical field. At least 7 years of experience designing and building distributed computing HPC and large-scale ML systems. At least 5 years of experience developing AI and ML algorithms in Python or C/C++. At least 3 years of experience with the full ML development lifecycle using AI and ML frameworks and public cloud. Preferred Qualifications While not required, the following qualifications are preferred: A Master's degree or PhD in Engineering, Computer Science, a related technical field, or equivalent practical experience with a focus on modern AI techniques. Experience designing large-scale distributed platforms and/or systems in cloud environments such as AWS, Azure, or GCP. Experience architecting cloud systems for security, availability, performance, scalability, and cost. Experience with delivering very large models through the MLOps life cycle from exploration to serving. Experience with building GPU clusters in the public cloud with tightly-coupled storage and networking. Experience with the complete stack for distributed training of large models, including ML compilers, distributed training frameworks, and ML development frameworks such as PyTorch, TensorFlow, Lightning, etc. Experience with one or multiple areas of AI technology stack, including prompt engineering, guardrails, vector databases/knowledge bases, LLM hosting, and fine-tuning. Authored research publications in top peer-reviewed conferences or industry-recognized contributions in the space of neural networks, distributed training, and SysML. Skills and Competencies To excel in this role, you will need to possess: Strong technical skills in AI, ML, and software development. Experience with cloud computing platforms, such as AWS, Azure, or GCP. Excellent problem-solving skills, with the ability to analyze complex problems and develop creative solutions. Strong communication and collaboration skills, with the ability to work effectively with cross-functional teams. A passion for innovation and a desire to stay up-to-date with the latest developments in AI and ML. Career Growth Opportunities and Learning Benefits At Capital One, we are committed to helping our employees grow and develop their careers. As a Distinguished Engineer, you will have access to: Opportunities to work on high-impact projects that drive business results. Collaboration with experienced engineers and researchers who are passionate about AI and ML. Professional development opportunities, including training, mentorship, and conference attendance. A culture that encourages innovation, experimentation, and continuous learning. Work Environment and Company Culture At Capital One, we pride ourselves on our inclusive and dynamic work environment. As a remote-eligible employee, you will have the flexibility to work from anywhere, while still being connected to our global team. Our company culture is built on a foundation of: Respect and empathy for our customers and each other. A passion for innovation and a desire to make a positive impact. A commitment to diversity, equity, and inclusion. A focus on employee well-being and work-life balance. Compensation, Perks, and Benefits We offer a comprehensive and competitive compensation package, including: A competitive salary, with a range of $232,900 - $265,800 for remote employees. Performance-based incentive compensation, including cash bonuses and long-term incentives. A comprehensive benefits package, including health, financial, and other benefits that support your total well-being. Opportunities for professional development and growth. Conclusion If you are a motivated and experienced engineer with a passion for AI and ML, we encourage you to apply for this exciting opportunity. As a Distinguished Engineer at Capital One, you will have the chance to work on high-impact projects, collaborate with talented engineers and researchers, and contribute to the development of innovative AI systems that change banking for good. Don't miss out on this opportunity to join our team and shape the future of AI in banking. Apply now! Apply for this job