Movie Recommender System

Independent project, Summer 2016.
Technologies: Java, Hadoop MapReduce, HDFS, Docker
Recommender System

This project was to build a movie recommender system based on Item Collaborative Filtering using Hadoop MapReduce in Java. I multiplied user rating matrix and movie co-occurrence matrix to generate a recommendation list of movies that are similar to the user's high-rated movies.

Github Repository

Google Search Auto Complete

Independent project, Summer 2016.
Technologies: Java, Hadoop MapReduce, HDFS, Docker, MySQL, PHP, Ajax, JQuery
Auto Complete

This project was to implement Google Search Auto Complete based on N-Gram Model using Hadoop MapReduce in Java. The data of Language Model was loaded into MySQL. And the web demo was built using PHP, JQuery, Ajax.

Github Repository

Data Matching for Restaurants from Yelp and Yellow Pages

CS784 Data Models and Languages, Fall 2015. Team members: Jin Ruan, Clarence Cheung.
Technologies: Python, Web Crawler, Machine Learning
Data Matching

This project was to crawl HTML data of restaurants from Yelp and Yellow Pages, perform information extraction to convert the HTML data into two relational tables and match the restaurants in the two tables.

Project Webpage Github Repository

This project has three stages: data crawling and extraction, blocking, and matching.

  1. Data Crawling and Extraction: we crawl 9,947 restaurant htmls from Yelp, 28,787 htmls from Yello Pages and then use BeautifulSoup Package to extract the information, convert it into two relational tables (csv format).
  2. Blocking: since the cross product of the two tables has roughly 280,000,000 tuple pairs, which is too many for the matching stage, we decide to perform blocking with rule combination before we go into the next stage. And we successfully reduce the original amount of tuple pairs to roughly 21,000 pairs. Check blocking explanation for more details.
  3. Matching: we perform cross validation on each of the following methods to select the best matcher.
    • Decision Tree (DT)
    • Random Forest (RF)
    • Support Vector Machine (SVM)
    • Naive Bayes (NB)
    • Logistic Regression (LG)
    • Linear Regression (LN)
    And finally set Random Forest as our best matcher. Here is the matching explanation.

Food Paradise - Restaurant Rating Web Application

CS564 Database Management Systems, Spring 2015. Team members: Jin Ruan, Qing Li, Shuang Wu.
Technologies: JSP, Apache Struts 2, Apache Tomcat, MySQL, Bootstrap
Food Paradise

This project was to develop a Yelp-like web application that allows users to search nearby restaurants as well as to give reviews and ratings etc. We built it using JSP technology with Apache Struts2 framework. The program was deployed on Apache Tomcat server and we chose MySQL as our database.

Github Repository


  • Review Restaurant: write, edit reviews and give ratings to a restaurant.
  • Check-in Restaurant: check-in the restaurants that users have visited.
  • Attend Event: explore and save the events of the restaurants that users are interested in.
  • Follow/Unfollow Users: users can follow other users so that they are able to see others' recent activities in the timeline.
  • Search: users can search restaurants by the food they like.

Identifying the Zygosity Status of Twins

Spring 2016. Team members: Jin Ruan, Yicun Ni, Ying Zhang.
Technologies: Java, Expectation-Maximization, Bayes Network, Hypothesis Test
Zygosity Status

This project was to predict zygosity status of twin pairs based on their history of diseases using Expectation-Maximization (EM) algorithm with Bayes Network. Then, a two-sample t-test is conducted for the concordance rates between identical and fraternal twins for each disease to identify the diseases that have high potential to correlate with zygosity.

Final Report

Parallel Seam Carving for Video Retargeting

CS766 Computer Vision, Spring 2016. Team members: Jin Ruan, Clarence Cheung.
Technologies: C++, OpenCV, Multi-Thread, Min-Cut
Seam Carving

This project was to propose a new approach for video retargeting that uses discontinuous seam-carving in both space and time for resizing videos in order to improve the poor speed performance of the original seam-carving algorithm by reducing computing complexity and parallelizing.

Project Webpage Final Report Github Repository

First, we implement the seam-carving algorithm in papers Seam Carving for Content-Aware Image Resizing and Improved Seam Carving for Video Retargeting. We notice that the seam-carving algorithm for videos in the paper runs fairly slow. Thus, in our new approach we calculate the seam for each frame seperately while using the look-ahead energy to maintain the temporal coherency between frames. We implement and parallelize this algorithm with OpenCV in C++ and achieve considerable speed improvements while keeping the same carving power as the algorithm in the papers.

Predicting the Northern Hemisphere Sea Ice Extent

CS761 Advanced Machine Learning, Spring 2016. Team members: Jin Ruan, Guangshan Chen.
Technologies: Python, Neural Networks, Time Series
Northern Hemisphere

This project was to propose a new approach to predicting the Northern Hemisphere sea ice extent using Time Lagged Neural Networks (TLNN) with adding external forcing (Solar radiation and CO2). And our new model can capture the main features of Northern Hemisphere sea ice change as predicted by the complexing numerical models while it is simpler and less computationally expensive.

Final Report Github Repository