SRCH is a custom search engine built from scratch. I worked with a team of 4 other classmates. Contact me to request access to the git repository.
This search engine consists of a web crawler, an indexer, a page rank algorithm, and a web interface.
- The web crawler is running on AWS EC2 and writing to DynamoDB. This crawler uses Mercator-style, distributed requests.
- The indexer creates a lexicon, inverted index, making use of TF-IDF, proximity, and other ranking algorithms. It is built as a MapReduce job and also stores its results in the dynamoDB.
- The page rank algorithm performs link analysis as a MapReduce job.
- The web interface uses Jetty to create a servlet-backed web interface in which users can submit queries and see a page displaying the most relevant results. It has additional features separate from the data gathered by the crawler, such as external information/results based on user location or the Amazon ItemLookup API, for example. It also provides suggestions/corrections for queries with few hits, by using the data gathered by the crawler.