portfolio-data-science-narrative

Statistics

&

Programming Experience

Herman Autore

LinkedIn | GitHub | Personal Website

To Whom It May Concern:

Please consider the following detailed overviews of some of my experiences. This document is meant to complement my CV, highlighting technical achievements accomplished outside of academic or work settings. I hope these vignettes not only describe my technical abilities, but also my intellectual interests, and maybe even my personality. Ideally you can use this information to decide if I will be a fit in your organization.

Full-Stack Data Science Badge Shield

I scraped a premium dating website and collected all the users’ information to create a natural-language recommendation system. Using Tensorflow I calculated the cosine and Euclidian similarities of the embeddings representing three aspects of each user’s profile: their self-description, their interests, and their desired partner. The result was then encased in a Flask web application, a REST-ful API, and then published on HEROKU.

Random Forests and Decision Trees Badge Shield

I used decision trees to create a human-interpretable model for rural patients to predict their diabetes status. A comparison with random forests was performed.

Data Mining in Python Badge Shield

I taught myself Python online, which included using a package to access online material. Motivated by this new knowledge and my desire to pursue a graduate degree, I set out to scrape US News & World Reports’ rankings of Statistics, Biology, and Computer Science graduate programs. I then uploaded the spreadsheet to Google Maps, which is available for viewing.

Data Analysis and Statistical Modeling for Social Causes

I have experience in multiple linear regression. As a capstone project for a class in applied statistics I collaborated with a partner in acquiring data and identifying interesting variables. Of these interesting variables one was chosen to serve as the outcome variable to be regressed on the explanatory variables. We acquired our data from the Integrated Postsecondary Education Data System (IPEDS). We started with 10 variables. We narrowed that down to 9, after merging two highly correlated variables (25th percentile of ACT scores and 75th percentile of ACT scores). The initial regression showed that three variables were not significant (Percent of Students Admitted, Institution Revenue, and For-Profit Categorization), but we kept them until after transformations were done, to see if the relationship changed.

Residual diagnostics showed if the variables needed transformation. If deemed necessary, we performed power transformations to increase the explanatory power of the variables, but we only transformed the variable if the result would still produce an interpretable model. A second tentative regression showed that two variables were not significant (Institution Revenue and For-Profit Categorization). Our $R_2$ increased from $0.8724$ to $0.8761$. This was followed by outlier and influence tests. The cases were eliminated only after their careful evaluation, since we want to preserve as much data as possible to find the true relationship of cause and effect. After outlier removal our $R_2$ increased from $0.8761$ to $0.9094$ and Instruction Expenses per Full-Time Equivalent student was shown to be not significant, leading to its removal from the model. The final model had an $R_2$ of $0.9094$, and it was the same model given by both forward and backward elimination.

Experiment Design

As a chemist it was my duty to perform accurate and reliable experiments. This required a lot of critical thinking. Since our samples came from different sources, I had to make sure that the tests on them were comparable. For this reason, I developed ways to make sure the results were accurate by making sure proper controls were made. Concretely, I noticed our calibration standards were not filtered like the test samples. After filtering the calibration standards, I noticed they gave different results. From that point on all samples were filtered using a new filter which gave more accurate readings. Later, when I took graduate classes in statistics, I learned new ways to design more complex experiments that would allow accurate comparisons between samples. This is a more mature version of A/B testing.

Statistics Communication

One of my soft-skills is being a good communicator. I may not have walked a mile in everyone’s shoes, but I am aware of those differences and I’m willing to walk in those shoes so I can know how I need to frame a question or answer to someone, be they a client, coworker, or superior. I had one professor complement me on how elaborate my visualizations were.

High-Dimensional Multinomial Classification and Unsupervised Learning

One of my favorite subjects I learned in graduate school was high-dimensional statistics and the concept for sparsity. The class included a project where I independently analyzed data from beginning to end. The 801 cases had 5 classes and 20,541 variables. Using PCA, and $l1$-penalized multinomial regression, we discovered that the dataset was highly sparse. We achieved 90% accuracy and higher on test sets as small as $1/10$ the size of the whole dataset.

Natural Language Processing

On my own initiative as an instructor of statistics I decided to analyze students’ prose answers to three questions: 1) What is your favorite thing about our school or town? 2) why are you taking this class? 2) What do you hope to learn? This was motivated by an observed bimodal distribution in grades on a scale of 0 to 100, with one mode in the range of 50 to 60 and the other mode in the range of 80 to 90. A quick and dirty analyses using term-frequency inverse-document-frequency showed there was no significant difference between the frequency of words used by high and low-performing students. The analysis was done using both multinomial and linear regression. Before withdrawing from my PhD program, I was working on a project with a professor to analyze text information. This led to the following developments:

  1. An object-oriented approach to detecting words associated with numbers
  2. A visualization method for matches of words inside a large text.

Other Statistical Methods

I’ve taken an entire semester on time series which was entirely in SAS programming language. I’ve also taken a machine learning class which introduced me to the practice and theory of SVM, neural networks.