Finding Long-COVID: Temporal Topic Modeling EHR Data

In this study, currently available as a preprint awaiting peer-review, I apply an unsupervised learning technique known as Latent Dirichlet Allocation (LDA) to identify condition clusters across millions of patients in the N3C database. Such clusters are not guaranteed to be related to COVID-19, however. To identify those that are, we look at how patients are assigned to clusters before and after COVID-19 infection, integrating the probabilistic nature of LDA and supervised models (repeated-measures logistic regression) to predict how patients will migrate toward or away from a cluster post-infection, moderated by demographic factors such as age, sex, and wave of infection.

Read more

Share

Searching and Summarizing PubMed with LLMs

What if an AI could perform a literature review for you? Here I illustrate using Large Language Models (LLMs) and related technologies for search and summarization. We use OpenAI's gpt-3.5-turbo (ChatGPT) and gpt-4 models, SentenceTransformers sentence embeddings, and the PubMed API to build a first-pass literature summarizer, that 1) designs an effective search query given a user question, 2) retrieves and indexes pubmed abstracts with embeddings, 3) selects the top-matching abstracts for individual summarization in light of the original question, and 4) summarizes the summaries including references.

Read more

Share

National COVID Cohort Collaborative

Since starting a new position with the TISLab in early 2020, my primary effort has been as Training Coordinator with the National COVID Cohort Collaborative, or N3C. This truly amazing group of scientists, clinicians, and engineers has accomplished something unique in US healthcare history: sourcing electronic health records from hospital systems and medical centers nationwide (56 and counting) into a single unified database of clinical records related to the COVID-19 pandemic.

Read more

Share

DataScience@OSU

DS@OSU was a project I had the priveledge of co-leading (along with Dr. Robin Pappas) in collaboration with Oregon State UIT. Driven by a broadly recognized need to enable accessible and scalable infrastructure for data science (and related) instruction, we embarked on campus-wide needs assessments, a steering committee, technical and faculty advisory committees, fact-finding sessions with peer institutions, and eventually platform implemention. Based on the Zero to JupyterHub kubernetes-based deployment pattern, DS@OSU provides a number of additional custom features:

Read more

Share

TidyTensor - More Fun with Deep Learning

TidyTensor is an R package for inspecting and manipulating tensors (multidimensional arrays). i It provides an improved print() function for summarizing structure, named tensors, conversion to data frames, and high-level manipulation functions. Designed to complement the excellent keras package, functionality is layered on top of base R types. TidyTensor was inspired by a workshop I taught in deep learning with R, and a desire to explain and explore tensors in a more intuitive way.

Read more

Share

Bio

Work My CV. I am an Assistant Professor of Research with the TISLab, Department of Biomedical Informatics, and Center for Health AI at the University of Colorado Anschutz Medical Campus. In previous years I was a Senior Faculty Research Assistant with the Center for Genome Research and Biocomputing (CGRB, now CQLS) at Oregon State University. I earned my Ph.D./M.S. in 2012/2009 from the University of Notre Dame in the Department of Computer Science and Engineering, and my Bachelors in Computer Science from Northern Michigan University in the beautiful Upper Peninsula of Michigan.

Read more

Share

Homelab

Sometimes you just get tired of paying AWS & Digital Ocean. A quick sketch of my homelab, inspired by the good folks at r/homelab. The main goal of the project was to enable the quick creation of Virtual Machines (or containers), accessible by subdomain such as project1.mydomain.net, project2.mydomain.net, etc., all under a single dynamic IP address. I managed to make it work pretty well, with a little help from NGINX for reverse proxying, pfsense for local DNS and routing, ddclient for dynamic DNS, Proxmox for hypervisor management, and freeNAS for shared storage.

Read more

Share

Random Forests & Gradient Boosting

In an earlier post we considered machine learning “models” as functions producting predictions from data, and training models to be the production of these functions with higher-order functions. From there we built regression trees–models created recursively by determining 1) a splitting column (column 1 or 2 in this case), and 2) a good value in that column to split the dataset on. At each split, we find a column and value that produces two relatively homogenous sets of y values (in the sense that the values in each y subset can be well-predicted from the column values).

Read more

Share