Hamza Abubakar Kheruwala — AI Systems Engineer

About

I'm an AI systems engineer at Morgan Stanley, owning the architecture of production LLM agent systems in a regulated environment. The core constraint: a confident wrong answer is worse than no answer. Every technical decision flows from that.

Most of what I know came from things that failed instructively. I shipped a model that beat every benchmark and degraded for real users. Diagnosing that — and rebuilding the evaluation framework around it — changed how I think about what it means for a system to actually be good. That framework became the org-wide standard.

I care most about systems that outlast the original deployment — documented clearly enough that teams I've never worked with can build on them. Three of the AI systems I've built are being used that way now.

How I Build

Evaluation is an architecture problem, not a QA step.

I decide what to measure before writing a line of code. A benchmark that doesn't correlate with real user outcomes isn't a safety net — it's a false one.

Confident wrong answers are worse than no answer.

In regulated environments, routing to a human beats returning a low-confidence response. I've built confidence-gated routing as a deliberate first-class decision, calibrated against production data — not bolted on as a fallback.

Backend reliability is what makes AI systems trustworthy.

The LLM is usually the easiest part to swap. What determines whether users can actually rely on the system is everything around it — pipelines, observability, retrieval, output enforcement.

Every system should outlast the original deployment.

I document failure modes after every production build — not just the architecture, but the specific places where the design breaks. That's what makes a system reusable rather than a one-off.

Experience

Nov 2023 — Present

Senior Software Engineer II — AI Systems & Infrastructure · Morgan Stanley

Technical owner of production LLM agent systems in a regulated financial environment — orchestration architecture, evaluation infrastructure, and the patterns other teams build on.

▹Designed a multi-step agent with confidence-gated routing: uncertain outputs route to a human analyst rather than returning a low-confidence answer. Zero compliance incidents.
▹Rebuilt the org's LLM evaluation framework after shipping a model that beat every benchmark and degraded for real users. Rebuilt scoring around production outcome signals — now the standard for every model release.
▹The RAG architecture I built has been adopted as the baseline by three downstream teams, packaged with failure modes so they didn't have to start from scratch.
▹Co-architect with enterprise engineering teams when delivery is broken. The structural fix from one recovery was institutionalized across all enterprise accounts.

LangGraph
Python
AWS Bedrock
Kafka
Terraform

Feb 2023 — Nov 2023

Software Engineer I — AI Systems & Automation · Morgan Stanley

Shipped the org's earliest production LLM deployments. Built the foundational infrastructure — pipelines, evaluation tooling, human feedback systems — that became the reference for teams that followed.

▹Led one of the org's first production LLM deployments: a RAG-enhanced pipeline for legacy code modernization with validation infrastructure built from scratch.
▹Built a human feedback pipeline from zero prior background in six weeks — preference schema, reward modeling, PPO integration. Adopted by multiple downstream teams; wrote the onboarding doc that became the standard reference.
▹Cut a three-week model selection process to three days by identifying the one decision variable that mattered and designing a targeted experiment around it.

LangChain
Python
AWS Lambda
RLHF
RAG

Jul 2021 — Sep 2022

Regional Associate · Accelerator Intern · Hult Prize Foundation

Backend infrastructure for a global competition platform — distributed systems under real surge conditions.

▹Engineered a fault-tolerant distributed event system for real-time load surges. The gap between load testing and what users actually create at peak is a lesson I've carried into every production system since.

Node.js
PostgreSQL
Distributed Systems

Jan 2021 — May 2021

AI Engineering Intern · IoTIoT.in

Built a real-time, device-agnostic gesture recognition framework using ML-driven motion tracking.

▹Developed motion tracking algorithms to improve input reliability across diverse hardware.
▹Built parallelized training pipelines to reduce latency without sacrificing classification accuracy.

Python
TensorFlow
Signal Processing

Jun 2020 — Dec 2020

Backend & ML Intern · MediaPro Innovations

Applied ML-driven content filtering and behavior analysis to improve engagement on an ed-tech platform.

▹Built content filtering on user behavior patterns to surface more relevant learning materials.
▹Improved backend efficiency through caching and indexing, supporting a growing user base.

Python
Machine Learning
Backend

View Full Résumé

Projects

Citation-Grounded Knowledge Agent

A multi-step AI agent built around one constraint: every answer must trace back to source — enforced at the generation layer, not bolted on after.

▹Confidence-gated abstention: if retrieval isn't confident enough to support a grounded answer, the system returns nothing rather than extrapolating.
▹Multi-model tiering routes each query to the cheapest model that clears the quality bar — inference cost as a first-class concern from day one.

LangGraph
AWS Bedrock
pgvector
LangSmith
Python

LLM Evaluation Framework — Production Rebuild

A model can improve on every tracked metric, ship to users, and be worse. This is what I built after that happened — a rebuild of how evaluation actually works.

▹Pulled production outcome signals and ran correlation analysis against every benchmark task. Rebuilt scoring around what actually predicted user outcomes, with a mandatory human judgment gate at each release.
▹Adopted org-wide — not because it was mandated, but because every team building LLM features eventually hits the same benchmark-vs-production divergence. The framework solved it once.

Python
LLM Evaluation
Statistical Analysis

Aarogya — Privacy-Preserving Mental Health Risk Detection

An NLP pipeline for early detection of depression and suicide-risk signals — privacy as an architectural constraint, not a compliance checkbox. The hardest problem was defining what a detection system that respects user dignity looks like at the architecture level.

Python
NLP
Privacy-Preserving ML
AWS

Diverting Public Complaints Based on Textual Analysis

Built a text-classification pipeline to route financial/public complaints to the right department using comparative ML experiments and evaluation-driven iteration.

Python
Machine Learning
Text Classification

Text Summarization Using Sentiment Analysis

Implemented sentiment-driven summarization on customer review data, integrating web-scraped inputs with supervised ML baselines.

Python
NLP
Machine Learning

Survey Masters Website

Designed and built a full-stack survey platform with authenticated workflows for creating surveys and collecting responses.

Web
JavaScript
CSS
Backend

Deep Learning for Satellite Imaging

Produced a deep-learning research report exploring satellite-imaging use cases, modeling approaches, and practical deployment constraints.

Deep Learning
Computer Vision
Research

Question Paper Generator

Built a web-based application concept for generating structured question papers with configurable templates and sections.

Web
Automation
Product Design

Stack & Tools

AI & Agent Systems

LangGraph
LangChain
AWS Bedrock
RAG Pipeline Design
LLM Evaluation & Outcome-Aligned Scoring
RLHF & Preference Data
Confidence Routing & Abstention
Structured Output Enforcement
Embedding Model Evaluation
LangSmith
Multi-Step Agent Orchestration

Backend & Infrastructure

Python
TypeScript
Node.js
Java
SQL
REST APIs
GraphQL
AWS Lambda
Step Functions
API Gateway
Kinesis
DynamoDB
S3
Aurora
Apache Kafka
Apache Flink
Terraform
Docker
GitHub Actions CI/CD
PostgreSQL
pgvector
Pinecone
Redis
MongoDB

Certifications

AWS Certified Machine Learning Specialty

Research

2022
Context-Enriched Machine Learning-Based Approach for Sentiment Analysis
Research Publication · Apr 2022
2020
Comprehensive Review of Text-Mining Applications in Finance
Q1 Journal · Nov 2020
2020
Interplay of Machine Learning and Software Engineering for Quality Estimations
Research Publication · Nov 2020
2020
BioUAV: Blockchain Framework for Digital Identification in Next-Gen UAVs
Research Publication · Sep 2020
2020
Comparative Study of Sentiment Analysis and Text Summarization for Commercial Social Networks
Research Publication · Jul 2020

Education

Buffalo, NY
University at Buffalo, SUNY
M.S., Computer Science
Ahmedabad, India
Nirma University
B.Tech, Computer Engineering