Skip to Content

Hamza Abubakar Kheruwala

AI Systems Engineer

I architect production AI systems that hold up under the conditions real users create — not just in the test harness.

What I learn in production becomes the pattern the next team builds on.

About

Hamza Abubakar Kheruwala

I'm an AI systems engineer at Morgan Stanley, owning the architecture of production LLM agent systems in a regulated environment. The core constraint: a confident wrong answer is worse than no answer. Every technical decision flows from that.

Most of what I know came from things that failed instructively. I shipped a model that beat every benchmark and degraded for real users. Diagnosing that — and rebuilding the evaluation framework around it — changed how I think about what it means for a system to actually be good. That framework became the org-wide standard.

I care most about systems that outlast the original deployment — documented clearly enough that teams I've never worked with can build on them. Three of the AI systems I've built are being used that way now.

How I Build

Evaluation is an architecture problem, not a QA step.

I decide what to measure before writing a line of code. A benchmark that doesn't correlate with real user outcomes isn't a safety net — it's a false one.

Confident wrong answers are worse than no answer.

In regulated environments, routing to a human beats returning a low-confidence response. I've built confidence-gated routing as a deliberate first-class decision, calibrated against production data — not bolted on as a fallback.

Backend reliability is what makes AI systems trustworthy.

The LLM is usually the easiest part to swap. What determines whether users can actually rely on the system is everything around it — pipelines, observability, retrieval, output enforcement.

Every system should outlast the original deployment.

I document failure modes after every production build — not just the architecture, but the specific places where the design breaks. That's what makes a system reusable rather than a one-off.

Experience

  1. Nov 2023 — Present

    Senior Software Engineer II — AI Systems & Infrastructure · Morgan Stanley

    Technical owner of production LLM agent systems in a regulated financial environment — orchestration architecture, evaluation infrastructure, and the patterns other teams build on.

    • Designed a multi-step agent with confidence-gated routing: uncertain outputs route to a human analyst rather than returning a low-confidence answer. Zero compliance incidents.
    • Rebuilt the org's LLM evaluation framework after shipping a model that beat every benchmark and degraded for real users. Rebuilt scoring around production outcome signals — now the standard for every model release.
    • The RAG architecture I built has been adopted as the baseline by three downstream teams, packaged with failure modes so they didn't have to start from scratch.
    • Co-architect with enterprise engineering teams when delivery is broken. The structural fix from one recovery was institutionalized across all enterprise accounts.
    • LangGraph
    • Python
    • AWS Bedrock
    • Kafka
    • Terraform
  2. Feb 2023 — Nov 2023

    Software Engineer I — AI Systems & Automation · Morgan Stanley

    Shipped the org's earliest production LLM deployments. Built the foundational infrastructure — pipelines, evaluation tooling, human feedback systems — that became the reference for teams that followed.

    • Led one of the org's first production LLM deployments: a RAG-enhanced pipeline for legacy code modernization with validation infrastructure built from scratch.
    • Built a human feedback pipeline from zero prior background in six weeks — preference schema, reward modeling, PPO integration. Adopted by multiple downstream teams; wrote the onboarding doc that became the standard reference.
    • Cut a three-week model selection process to three days by identifying the one decision variable that mattered and designing a targeted experiment around it.
    • LangChain
    • Python
    • AWS Lambda
    • RLHF
    • RAG
  3. Jul 2021 — Sep 2022

    Regional Associate · Accelerator Intern · Hult Prize Foundation

    Backend infrastructure for a global competition platform — distributed systems under real surge conditions.

    • Engineered a fault-tolerant distributed event system for real-time load surges. The gap between load testing and what users actually create at peak is a lesson I've carried into every production system since.
    • Node.js
    • PostgreSQL
    • Distributed Systems
  4. Jan 2021 — May 2021

    AI Engineering Intern · IoTIoT.in

    Built a real-time, device-agnostic gesture recognition framework using ML-driven motion tracking.

    • Developed motion tracking algorithms to improve input reliability across diverse hardware.
    • Built parallelized training pipelines to reduce latency without sacrificing classification accuracy.
    • Python
    • TensorFlow
    • Signal Processing
  5. Jun 2020 — Dec 2020

    Backend & ML Intern · MediaPro Innovations

    Applied ML-driven content filtering and behavior analysis to improve engagement on an ed-tech platform.

    • Built content filtering on user behavior patterns to surface more relevant learning materials.
    • Improved backend efficiency through caching and indexing, supporting a growing user base.
    • Python
    • Machine Learning
    • Backend

Projects

  1. Citation-Grounded Knowledge Agent

    A multi-step AI agent built around one constraint: every answer must trace back to source — enforced at the generation layer, not bolted on after.

    • Confidence-gated abstention: if retrieval isn't confident enough to support a grounded answer, the system returns nothing rather than extrapolating.
    • Multi-model tiering routes each query to the cheapest model that clears the quality bar — inference cost as a first-class concern from day one.
    • LangGraph
    • AWS Bedrock
    • pgvector
    • LangSmith
    • Python
  2. LLM Evaluation Framework — Production Rebuild

    A model can improve on every tracked metric, ship to users, and be worse. This is what I built after that happened — a rebuild of how evaluation actually works.

    • Pulled production outcome signals and ran correlation analysis against every benchmark task. Rebuilt scoring around what actually predicted user outcomes, with a mandatory human judgment gate at each release.
    • Adopted org-wide — not because it was mandated, but because every team building LLM features eventually hits the same benchmark-vs-production divergence. The framework solved it once.
    • Python
    • LLM Evaluation
    • Statistical Analysis
  3. Aarogya — Privacy-Preserving Mental Health Risk Detection

    An NLP pipeline for early detection of depression and suicide-risk signals — privacy as an architectural constraint, not a compliance checkbox. The hardest problem was defining what a detection system that respects user dignity looks like at the architecture level.

    • Python
    • NLP
    • Privacy-Preserving ML
    • AWS
  4. Diverting Public Complaints Based on Textual Analysis

    Built a text-classification pipeline to route financial/public complaints to the right department using comparative ML experiments and evaluation-driven iteration.

    • Python
    • Machine Learning
    • Text Classification
  5. Text Summarization Using Sentiment Analysis

    Implemented sentiment-driven summarization on customer review data, integrating web-scraped inputs with supervised ML baselines.

    • Python
    • NLP
    • Machine Learning
  6. Survey Masters Website

    Designed and built a full-stack survey platform with authenticated workflows for creating surveys and collecting responses.

    • Web
    • JavaScript
    • CSS
    • Backend
  7. Deep Learning for Satellite Imaging

    Produced a deep-learning research report exploring satellite-imaging use cases, modeling approaches, and practical deployment constraints.

    • Deep Learning
    • Computer Vision
    • Research
  8. Question Paper Generator

    Built a web-based application concept for generating structured question papers with configurable templates and sections.

    • Web
    • Automation
    • Product Design

Stack & Tools

AI & Agent Systems

  • LangGraph
  • LangChain
  • AWS Bedrock
  • RAG Pipeline Design
  • LLM Evaluation & Outcome-Aligned Scoring
  • RLHF & Preference Data
  • Confidence Routing & Abstention
  • Structured Output Enforcement
  • Embedding Model Evaluation
  • LangSmith
  • Multi-Step Agent Orchestration

Backend & Infrastructure

  • Python
  • TypeScript
  • Node.js
  • Java
  • SQL
  • REST APIs
  • GraphQL
  • AWS Lambda
  • Step Functions
  • API Gateway
  • Kinesis
  • DynamoDB
  • S3
  • Aurora
  • Apache Kafka
  • Apache Flink
  • Terraform
  • Docker
  • GitHub Actions CI/CD
  • PostgreSQL
  • pgvector
  • Pinecone
  • Redis
  • MongoDB

Certifications

  • AWS Certified Machine Learning Specialty

Research

Education

  1. Buffalo, NY

    University at Buffalo, SUNY

    M.S., Computer Science

  2. Ahmedabad, India

    Nirma University

    B.Tech, Computer Engineering