Research-oriented data science

Statistics, machine learning, and tools that make modeling assumptions visible.

I am Wei Dai, a statistician and data scientist with experience in feature selection, mixture models, stochastic simulation optimization, and applied work on electronic health records. This site brings together selected projects, lightweight research notes, and interactive explainers built to clarify difficult ideas.

Selected work

Projects anchored in methodological clarity and practical data work.

The project list is intentionally compact: one public package, one applied data artifact, and one durable archive for reusable research materials.

Public repo

subsampwinner

Feature selection tooling built around the Subsampling Winner Algorithm, with an emphasis on stability under repeated resampling.

feature selectionR packagestatistical learning
Site-hosted material

Heart Transplant Data Showcase

A compact walkthrough of data preprocessing decisions for heart-transplant studies, focused on cohort construction and modeling-ready tables.

EHRdata curationapplied statistics
Public repo

References and Slide Archive

A public repository for reading notes, slide materials, and small teaching artifacts across statistics and machine learning.

reading notesslide decksresearch workflow

Research map

Three recurring themes structure the work.

I tend to return to the same questions across different projects: how assumptions are encoded, how uncertainty is surfaced, and how workflows stay usable once they leave the whiteboard.

01

Model assumptions

Kernel choice, resampling behavior, and selection rules all encode preferences that should be inspectable rather than implicit.

02

Applied data constraints

EHR and observational settings demand careful preprocessing, explicit limitations, and tools that respect imperfect measurement.

03

Legible software

Explanatory software should do more than execute an algorithm. It should help others see what the algorithm is assuming and where it can fail.

Interactive lab

Small demos for building intuition, starting with Gaussian processes.

The lab is where I turn abstract modeling ideas into compact visual tools. The first two demos focus on kernels and Gaussian-process posteriors because they reward visual exploration and benefit from direct manipulation.

Interactive explainer

Kernel Playground

Compare how different kernels change covariance structure and prior sample paths before fitting any data.

Kernels are often introduced abstractly, but the modeling consequences become clearer once you can see the covariance matrix and sampled functions change together.

Interactive explainer

GP Posterior Explorer

Place noisy observations directly on the plot and watch the posterior mean and uncertainty band update in real time.

Posterior intuition is easiest to build when the model reacts immediately to new observations and hyperparameter changes.

Open to new collaborations

Looking for work that connects statistical rigor with practical systems.