about

Ruize Ma

Data Science/Math student at New York University building LLM agents and data-driven applications. Interested in AI for decision-making, reasoning, and real-world impacts.

WECHAT

Marvin041018

GITHUB

Sherlockmrz

UNIVERSITY

New York University

MAJOR

Mathematics and Data Science

IDENTITY

Ruize Lab

AI Systems Lab

CORE COURSE

LLMResponsible DSMachine LearningCSData StructuresDiscrete MathematicsCalculus III

TOEFL

112/120

Programming Skills: Java, Python, SQL, R

DELIVERY

fullstack

Next.js frontend + FastAPI backend

working style

Product clarity for technical systems

planner

LLM Parser

query JSON

Converts the natural-language roster request into team, goal, top-k, recent-game window, and availability filters.

planner

GPT5Model.plan

plan JSON

Analyzes the stem and choices, extracts keywords, selects facts needed, and proposes biomedical tools.

classifier

Block 2 Router

segment

Uses known smoker status directly, or estimates smoker probability for unknown status.

planner

Task Router

task type

Classifies a POI query as address correction, duplicate detection, or merchant status checking.

research experience

Research Experience

Clinical AI, reliability auditing, sports analytics, and predictive modeling projects with emphasis on transparent workflows and usable decision support.

0107/2025 – 10/2025First Author, MIDI 2025

Plan–Act–Verify: An Agentic AI Question Answering and Reasoning System Evaluated on the CURE-Bench Challenge

First AuthorMIDI 2025LLM AgentBiomedical AI

Served as team lead and first author on a clinical AI project developing an evidence-grounded LLM agent for therapeutic decision support, with a plan-act-verify architecture designed to improve reliability, reduce hallucination risk, and support auditable medical reasoning.

Integrated a curated biomedical tool stack spanning FDA labels, DailyMed, MedlinePlus, RxNav/RxNorm, OpenTargets, and PubChem, enabling the system to retrieve up-to-date drug evidence instead of relying solely on parametric model memory.
Built an evidence-grounding workflow that distilled retrieved information into concise, source-attributed “Tool Facts”, improving transparency and making final multiple-choice decisions more auditable for clinical use.
Achieved 0.69564 accuracy on the hidden test set of the NeurIPS CURE-Bench agentic reasoning challenge after fine-tuning and tool integration, demonstrating strong performance on therapeutic reasoning tasks.
Addressed medically important reasoning problems in drug decision-making and precision therapeutics, including evidence-grounded treatment selection, medication safety reasoning, dosing-related questions, contraindications, and monitoring logic in a benchmark designed for high-stakes clinical applications.

Conference abstract PDF MIDI 2025 virtual venue

0210/2025 – 11/2025Second Author, Accepted at FLAIRS-39

Reliability Beyond Accuracy: Error Analysis of Agentic Tool-Augmented Reasoning in LLMs on CURE-Bench

FLAIRS-39Biomedical AILLM Agent

Co-authored a reliability-focused study of agentic clinical reasoning systems, analyzing failure modes that remain hidden when evaluation relies on accuracy alone.

Audited 2,079 benchmark questions and 347,125 tool calls, uncovering large-scale operational weaknesses in tool-augmented LLM pipelines for therapeutic reasoning.
Identified 342,515 missing-parameter tool failures, accounting for more than 99% of all failures, and showed that tool integration can appear active while still failing to retrieve usable medical evidence at scale.
Discovered severe instability in repeated questions, with 154 of 155 duplicated stems receiving different answer letters, revealing a major reproducibility problem for healthcare AI systems.
Translated empirical findings into a practical deployment audit checklist for healthcare AI, covering tool contract validation, evidence logging, invariance testing, and option-formatting stress tests.

Accepted papers page

0302/2026 – 05/2025

Medical Insurance Cost Predictor

Machine Learning + Web App Project

Machine LearningWeb AppPredictive Modeling

Developed and contributed to a medical insurance cost prediction system combining neural networks, quantile regression, uncertainty-aware prediction, and web-based analytics.

Developed a PyTorch MLP regressor and quantile regression module for a medical insurance cost prediction system, enabling both point estimates and uncertainty-aware prediction intervals for annual charges.
Built the front-end analytics layer in Streamlit, creating an interactive model comparison page that combined machine learning outputs, uncertainty estimates, and accessible visual reporting for non-technical users.
Contributed to a broader machine learning pipeline comparing Linear Regression, Random Forest, XGBoost, MLP, and mixture-based modeling for a strongly bimodal healthcare cost distribution shaped by smoking status.
Helped build an interpretable health-finance application that lets users input demographic and health features and receive data-driven insurance charge estimates, connecting ML modeling with practical product design.

0402/2025 – 05/2025

LLM-Driven NBA Roster Upgrade Agent

LLM AgentSports AnalyticsPredictive Modeling

Built an LLM-driven sports analytics agent that converts natural-language roster requests into structured constraints and automatically orchestrates modular analysis tools for team-need diagnosis, player filtering, ranking, and report generation.

Designed an interpretable team weakness diagnosis module using rolling-window statistics, league-wide Z-score normalization, and Ridge Regression coefficients to quantify which performance deficits matter most for winning.
Constructed standardized PlayerVectors from box-score and advanced statistics, then ranked candidates through a weighted Fit Score that matched player strengths to team-specific needs.
Automated the generation of explainable scouting reports and visual summaries, including team-need charts and player radar plots, making statistical recommendations easier to interpret for non-technical decision makers.
According to the project demo, a query about improving the Warriors’ interior defense filtered the pool to 236 players and ranked Anthony Davis as the top fit under the stated constraints.
Extended the system toward real-world front-office use by planning support for salary constraints, age filters, positional normalization, trade feasibility, and recommendation stability analysis.

0503/2025 – 05/2025

Statistical Analysis and Predictive Modeling on Professor Ratings Using RateMyProfessor Dataset

Machine LearningPredictive Modeling

Built a statistical and predictive modeling project using RateMyProfessor data to analyze rating patterns, gender bias, difficulty-rating relationships, and prediction models.

Cleaned and preprocessed a large-scale dataset from RateMyProfessor by setting an empirical threshold to exclude biased entries, improving the reliability of statistical comparisons across gender, experience level, and teaching mode.
Conducted hypothesis-driven analyses using Welch’s t-test, Mann-Whitney U test, and Pearson correlation to identify significant rating patterns, such as pro-male gender bias and the inverse correlation between course difficulty and professor rating, r ≈ –0.74.
Developed linear regression and Ridge regression models to predict professor ratings, achieving an R² of 0.81 and RMSE of 0.37, demonstrating that combining multiple features significantly improves prediction accuracy.
Built logistic regression classifiers with L2 regularization and class stratification to predict professor “hotness” / pepper status, increasing AUROC from 0.79 to 0.807 with multi-feature integration, showing improved model robustness.

0606/2024 – 08/2024Team leader

Rental Price Estimation with Machine Learning

Machine LearningPredictive Modeling

Built a rental price prediction project using machine learning models and housing datasets.

Built a house rent prediction model using multiple regression algorithms including Linear Regression and Random Forest, leveraging features such as location, size, and furnishing status to optimize model performance.
Processed and cleaned raw housing datasets with Pandas and NumPy, handling missing values and categorical variables through encoding and scaling, ensuring data quality for model training.
Visualized feature distributions and correlations using Seaborn and Matplotlib to uncover key drivers of rental prices, enhancing model interpretability and stakeholder insights.
Achieved a model accuracy of over 85% on test data through hyperparameter tuning and model evaluation using R² and RMSE metrics, demonstrating robust predictive capabilities.

stack

Next.js App RouterTypeScriptTailwind CSSFramer Motionlucide-react