PROJECTS

LSTM model predicting AMZN's price, based on 2 year rolling window

1. AMZN Price Prediction

- First, I visualize the dataset, and then compare different models for price predictions.

- ARIMA: The Autoregressive integrated moving average model preformed the best fit, using previous data to predict the next timestep.

- LSTMs (RNN) are able to capture similar results as the statistical ARIMA model, using 2 years of training data to predict the next timestep.

- Prophet is Facebook’s new timeseries analysis tool, and it has no train/test split. It takes all the data and predicts the next X number of days.

Tools used: Pandas, seaborn, numpy, matplotlib, tensorflow, keras, lag_plot, statsmodels, sklearn

Check out the code on Github

A diagram of Facebook's Wav2Vec2

2. ZH-CN to US-EN Speech-to-Text translation

- Using Facebook's Wav2Vec2 and bi-directional recurrent neural networks, I created a pipeline that would translate Chinese speech into English text. Thanks to HuggingFace and their team, I was able to use their pretrained Wav2Vec2 model that was finetuned on Zh-CN to dig deeper into this cutting-edge architecture.

- The Wav2Vec2 model portion is further trained on the CommonVoice dataset, using character-level embeddings for extra accuracy. This self-supervised framework of Wav2Vec2 quickly adapted to the new speech domain, and generalize well on my voice. Next the Chinese text is fed into a Bi-RNN, which was trained on 17+ hours of parallel English-Chinese Ted talks. The ultimate BLEU score of the pipeline was 11.5

Tools used: ASR, PyTorch, Huggingface, Transformers, OpenNMT, Wav2Vec 2.0

Check out the code on Github

A workflow diagram from start to finish

3. Covid-19 Vaccine Tweet Sentiment Analysis

- My team and I created a sentiment analysis model on tweets. Specifically, we analyzed the sentiment towards the vaccines at the end of December 2020. We explored various approaches, from supervised linear models, to unsupervised clustering models, to deep learning models. Ultimately, the neural networks were able to generalize the best while avoiding the overfitting problems that are commonly associated with error propagation. Then, we integrated word embeddings in our sequence-to-sequence model, while tuning our hyperparameters.

- Lastly, I designed the front end and connected it with our backend model, to accept a string of words and predict the sentiment. I chose to design the front end with Bootstrap, connected the backend with the Flask API, and deploy it on Heroku.

Tools used: Linear models, Neural networks, Web scraping, HTML/CSS, Sentiment Analysis, Flask, Heroku, PyTorch

Check out the WebApp , or the code on Github

The front end with the ElasticSearch bar

4. Finanicial forecasting based on Reddit posts

- The goal of this corpus was to scrape posts from various finance-related subreddits to see if there is any correlation between the post's sentiment and the respective stock price. In addition, the corpus is searchable with ElasticSearch implementation.

- Scraped and annotated reddit posts regarding the FAANG and MANA stocks. (Facebook, Apple, Amazon, Netflix, Google) and (Microsoft, Apple, Netflix, Amazon).

- Annotated and tagged a small batch of posts with either the Vader sentiment, or with Mturk. Then, created a backend with Elasticsearch, and frontend with basic HTML. Lastly, we dockerized and deployed the website to be peer reviewed.

Tools used: Python, HTML, CSS, ElasticSearch, Docker, FastAPI, Mechanical Turk, Sentiment Analysis, Data Cleaning

Check out the code on Github

A code snippet on the left, and the interface on the right

5. Blackjack Game

- Created a functional blackjack game, utilizing basic algorithms and data structures in Python. This project was done before my MDS program, and it demonstrated my ability to quickly learn a new skill.

- The interface takes a user’s input for how many chips they want to bet, and whether they want to hit or stand. I created the game to be like the casino, since Blackjack was what got me interested in Data Science in the first place.

Tools used: Python, Probabilities, Statistics, Game Design, various Logic Gates

Check out the Github Code - Or, play a game of blackjack here: Repl.it


All photos used are taken by Daniel
Last Updated: June 1, 2021