research experience
Research Experience
Clinical AI, reliability auditing, sports analytics, and predictive modeling projects with emphasis on transparent workflows and usable decision support.
0107/2025 – 10/2025First Author, MIDI 2025
Plan–Act–Verify: An Agentic AI Question Answering and Reasoning System Evaluated on the CURE-Bench Challenge
First AuthorMIDI 2025LLM AgentBiomedical AI
Served as team lead and first author on a clinical AI project developing an evidence-grounded LLM agent for therapeutic decision support, with a plan-act-verify architecture designed to improve reliability, reduce hallucination risk, and support auditable medical reasoning.
- Integrated a curated biomedical tool stack spanning FDA labels, DailyMed, MedlinePlus, RxNav/RxNorm, OpenTargets, and PubChem, enabling the system to retrieve up-to-date drug evidence instead of relying solely on parametric model memory.
- Built an evidence-grounding workflow that distilled retrieved information into concise, source-attributed “Tool Facts”, improving transparency and making final multiple-choice decisions more auditable for clinical use.
- Achieved 0.69564 accuracy on the hidden test set of the NeurIPS CURE-Bench agentic reasoning challenge after fine-tuning and tool integration, demonstrating strong performance on therapeutic reasoning tasks.
- Addressed medically important reasoning problems in drug decision-making and precision therapeutics, including evidence-grounded treatment selection, medication safety reasoning, dosing-related questions, contraindications, and monitoring logic in a benchmark designed for high-stakes clinical applications.
0210/2025 – 11/2025Second Author, Accepted at FLAIRS-39
Reliability Beyond Accuracy: Error Analysis of Agentic Tool-Augmented Reasoning in LLMs on CURE-Bench
FLAIRS-39Biomedical AILLM Agent
Co-authored a reliability-focused study of agentic clinical reasoning systems, analyzing failure modes that remain hidden when evaluation relies on accuracy alone.
- Audited 2,079 benchmark questions and 347,125 tool calls, uncovering large-scale operational weaknesses in tool-augmented LLM pipelines for therapeutic reasoning.
- Identified 342,515 missing-parameter tool failures, accounting for more than 99% of all failures, and showed that tool integration can appear active while still failing to retrieve usable medical evidence at scale.
- Discovered severe instability in repeated questions, with 154 of 155 duplicated stems receiving different answer letters, revealing a major reproducibility problem for healthcare AI systems.
- Translated empirical findings into a practical deployment audit checklist for healthcare AI, covering tool contract validation, evidence logging, invariance testing, and option-formatting stress tests.
0302/2026 – 05/2025
Medical Insurance Cost Predictor
Machine Learning + Web App Project
Machine LearningWeb AppPredictive Modeling
Developed and contributed to a medical insurance cost prediction system combining neural networks, quantile regression, uncertainty-aware prediction, and web-based analytics.
- Developed a PyTorch MLP regressor and quantile regression module for a medical insurance cost prediction system, enabling both point estimates and uncertainty-aware prediction intervals for annual charges.
- Built the front-end analytics layer in Streamlit, creating an interactive model comparison page that combined machine learning outputs, uncertainty estimates, and accessible visual reporting for non-technical users.
- Contributed to a broader machine learning pipeline comparing Linear Regression, Random Forest, XGBoost, MLP, and mixture-based modeling for a strongly bimodal healthcare cost distribution shaped by smoking status.
- Helped build an interpretable health-finance application that lets users input demographic and health features and receive data-driven insurance charge estimates, connecting ML modeling with practical product design.
0402/2025 – 05/2025
LLM-Driven NBA Roster Upgrade Agent
LLM AgentSports AnalyticsPredictive Modeling
Built an LLM-driven sports analytics agent that converts natural-language roster requests into structured constraints and automatically orchestrates modular analysis tools for team-need diagnosis, player filtering, ranking, and report generation.
- Designed an interpretable team weakness diagnosis module using rolling-window statistics, league-wide Z-score normalization, and Ridge Regression coefficients to quantify which performance deficits matter most for winning.
- Constructed standardized PlayerVectors from box-score and advanced statistics, then ranked candidates through a weighted Fit Score that matched player strengths to team-specific needs.
- Automated the generation of explainable scouting reports and visual summaries, including team-need charts and player radar plots, making statistical recommendations easier to interpret for non-technical decision makers.
- According to the project demo, a query about improving the Warriors’ interior defense filtered the pool to 236 players and ranked Anthony Davis as the top fit under the stated constraints.
- Extended the system toward real-world front-office use by planning support for salary constraints, age filters, positional normalization, trade feasibility, and recommendation stability analysis.
0503/2025 – 05/2025
Statistical Analysis and Predictive Modeling on Professor Ratings Using RateMyProfessor Dataset
Machine LearningPredictive Modeling
Built a statistical and predictive modeling project using RateMyProfessor data to analyze rating patterns, gender bias, difficulty-rating relationships, and prediction models.
- Cleaned and preprocessed a large-scale dataset from RateMyProfessor by setting an empirical threshold to exclude biased entries, improving the reliability of statistical comparisons across gender, experience level, and teaching mode.
- Conducted hypothesis-driven analyses using Welch’s t-test, Mann-Whitney U test, and Pearson correlation to identify significant rating patterns, such as pro-male gender bias and the inverse correlation between course difficulty and professor rating, r ≈ –0.74.
- Developed linear regression and Ridge regression models to predict professor ratings, achieving an R² of 0.81 and RMSE of 0.37, demonstrating that combining multiple features significantly improves prediction accuracy.
- Built logistic regression classifiers with L2 regularization and class stratification to predict professor “hotness” / pepper status, increasing AUROC from 0.79 to 0.807 with multi-feature integration, showing improved model robustness.
0606/2024 – 08/2024Team leader
Rental Price Estimation with Machine Learning
Machine LearningPredictive Modeling
Built a rental price prediction project using machine learning models and housing datasets.
- Built a house rent prediction model using multiple regression algorithms including Linear Regression and Random Forest, leveraging features such as location, size, and furnishing status to optimize model performance.
- Processed and cleaned raw housing datasets with Pandas and NumPy, handling missing values and categorical variables through encoding and scaling, ensuring data quality for model training.
- Visualized feature distributions and correlations using Seaborn and Matplotlib to uncover key drivers of rental prices, enhancing model interpretability and stakeholder insights.
- Achieved a model accuracy of over 85% on test data through hyperparameter tuning and model evaluation using R² and RMSE metrics, demonstrating robust predictive capabilities.