completed fraud-detection

MedicaidGuard

A production machine learning system that analyzes 227 million Medicaid records to automatically detect and flag complex healthcare fraud.

Started: January 1, 2026

Completed: March 1, 2026

GitHub Repository

Evidence & Verification

Metrics and claims on this page are tied to the linked artifacts below (repository docs, experiment outputs, and deployment pages when available).

Source Repository

227M+

Records Processed

Technologies

Python XGBoost Autoencoders Isolation Forest

Overview

MedicaidGuard is a comprehensive machine learning pipeline that ingests, cleans, and analyzes the complete 227 million record US HHS Medicaid Provider dataset to detect anomalous billing behaviors and identify potential healthcare fraud.

Unlike traditional heuristic models, MedicaidGuard employs an advanced ensemble architecture combining Statistical Outlier Detection, Isolation Forests, Deep Autoencoders, and XGBoost probabilistic classifiers to assign transparent, SHAP-explainable risk scores to all 617,503 registered providers.

🚀 Features

Massive Scale Integration: Ingests the 15GB+ HHS dataset using chunked streaming directly integrated with the NPI registry and OIG LEIE exclusion lists.
38-Dimensional Feature Matrix: Extracts critical financial features including maximum cost boundaries, geographic code concentration, and month-over-month acceleration vectors.
Hybrid Ensemble Architecture: Fuses deterministic rules with unsupervised (Autoencoders) and supervised (XGBoost) learners to generate a universal 0-100 risk score.
SHAP Explainability: Every flagged provider is accompanied by a waterfall trace explicitly calculating the exact variables that triggered the risk score.

🧠 System Architecture

The platform calibrates the risk scores such that explicitly excluded providers (OIG LEIE) are forced into the top 5% risk quantiles. The pipeline sequentially executes:

Fetching data from HHS databases
Merging with NPI and OIG data
Generating the feature matrix
Training the Ensemble ML models
Generating risk scores and SHAP bounds

Evidence & Verification

Tags

Technologies

Overview

🚀 Features

🧠 System Architecture