\documentclass[11pt]{article}
% Character set for input and output
\usepackage[utf8]{inputenc}
\usepackage[T1]{fontenc}
% Fonts
\usepackage{libertine}
\usepackage[scaled=0.83]{beramono}
% AMS math-packages
\usepackage{amssymb}
\usepackage{amsmath,amsthm}
% TODO in text
\usepackage{todonotes}
% URL (clickable), references within document, etc./
\usepackage{hyperref}
% Code formatting
\usepackage{listings}
\lstset{
language=java,
extendedchars=true,
basicstyle=\footnotesize\ttfamily,
showstringspaces=false,
showspaces=false,
numbers=left,
numberstyle=\footnotesize,
numbersep=9pt,
tabsize=2,
breaklines=true,
showtabs=false,
frame=single,
extendedchars=false,
inputencoding=utf8,
captionpos=b
}
% Text / Paragraph space
\addtolength{\textwidth}{1.5cm}
\addtolength{\hoffset}{-0.5cm}
\setlength{\parindent}{0pt}
\setlength{\parskip}{1.5ex plus 1ex minus 1ex}
\title{Project: Search Engine}
\author{{\Large Group Z}\\Alice Liddell \quad Bob Ross \quad Cruella de Vil \quad David Attenborough\\\tt{\{alid,bros,cdev,datt\}@itu.dk}}
\date{\today{}}
%\date{December 14, 2019}
\begin{document}
\maketitle
\section*{Introduction}
%
This document reports on the search engine project that we developed
during the Introductory Programming course at the IT University of
Copenhagen.
%
In Sections~1--3 we will report on our solutions to the
mandatory tasks posed in the project description.
%
We also solved some of the challenges posed for these mandatory
tasks.
%
Their solution is described in the respective section along with the
solution to the mandatory task.
%
\todo{This is just an example template; you do not need to do these extensions.}
%
Furthermore, we developed the following extensions which are described
in detail in the following sections:
%
\begin{itemize}
\item
Section 4 describes our changes to the client GUI.
\item
Section 5 describes how we built a webcrawler.
\item
Section 6 describes how we supported the ``fuzzy search'' feature.
\end{itemize}
The description for each solution is roughly split up into the
following parts:
%
\begin{itemize}
\item
\textbf{Task:} A short review on the task that we had to solve.
\item
\textbf{Approach:} An informal, high-level description of how we solved the
task.
\item
\textbf{Solution:} A detailed technical description of our solution to the task.
\item
\textbf{Test:} Description of considerations for testing the
correctness of our solution.
\end{itemize}
%
Other parts (e.g. design, reflection, \ldots{}) will be included where
deemed appropriate. For instance, Task 1 includes a part on the
benchmarking results.
%
The source code of our project is handed in as a single
zip file called \texttt{<filename>.zip}.
%
\texttt{<insert directory>} contains the source code that solves the
mandatory tasks.
%
Our code is also available on ITU's Github:
\url{https://github.itu.dk/...}.
%
\subsection*{Statement of Contribution}
%
All authors contributed equally to all parts of the solution of the
mandatory tasks.
%
Bob Ross made the graphic design for the client GUI.
%
Bob Ross and Cruella de Vil conceived the idea of the web crawler,
Cruella de Vil implemented it, and Bob Ross documented it.
%
Alice Liddell is solely responsible for the ``fuzzy search'' feature.
\section{Inverted Indices}
\paragraph{Task}
%
This task involves reducing the amount of time the search engine takes
to respond to a query, by means of an inverted index.
%
%
An inverted index provides a fast way to access the webpages that
contain a certain word by storing a mapping from words to the list
of webpages that contain the word.
%
Figure~\ref{fig:index:uml} provides an example for such a data structure.
%
We are to implement two variants of an inverted index: one based on a
{\tt TreeMap}, and the other on a {\tt HashMap}. We are then to
evaluate the performance of these implementations, and to pick the
best one to use in our search engine.
%
\paragraph{Approach}
%
As per the code handout, we generalized the notion of an index into what we call an
{\tt Index}, that has the following two methods:
%
\begin{itemize}
\item {\tt build}: add description from your javadoc.
\item {\tt lookup}: add description from your javadoc.
\end{itemize}
%
We then implemented the inverted index by using Java's Map data structure.
%
Here, Java provides different implementations and we tested the
following variants in our code: ..., ....
\paragraph{Solution} We build the index data structures as described in Figure~\ref{fig:index:uml}.
We allowed to switch the implementation of the Map data structure by ... . Building the inverted index was accomplished by the following code:
\begin{lstlisting}[language=Java]
for each webpage page in list of webpages {
for each word p on page {
...
}
}
\end{lstlisting}
\begin{figure}[t]
\centering
\includegraphics[width=\textwidth]{figures/uml}
\caption{UML Diagram for the Software Architecture of our webserver.}
\label{fig:index:uml}
\end{figure}
\paragraph{Test} We verified the correctness of the {\tt build} and {\tt lookup} method with unit tests. These tests can be found in the file .... Here, the {\tt build} method of the inverted index was tested by .... (May add Java code here.)
The {\tt lookup} method of all three indexes was checked by carefully creating a list of test webpages and checking whether the lookup methods returns the expected results by looking at the following properties of the returned list: (i) its size and (ii) ... ? (Add Java code?)
\paragraph{Benchmark}
%
We evaluated which implementation provides the fastest running time for
lookup operation by ... (describe your benchmarking setup / approach).
%
We choose the following query words for the benchmark:
%
``itu, is, the, best''.
%
The running times of different index implementations can be found in
Table~\ref{tab:benchmark:indices} (or a plot).
%
\begin{table}[t]
\begin{tabular}{l|l|l}
Index & File & Avg. lookup time (ms)\\ \hline \hline
SimpleIndex & enwiki-tiny.txt & ... \\
SimpleIndex & enwiki-small.txt & ... \\
SimpleIndex & enwiki-medium.txt & ... \\
InvertedIndex w/ HashMap & enwiki-tiny.txt & ... \\
... & ... & ... \\
InvertedIndex w/ TreeMap & enwiki-tiny.txt & ... \\
... & ... & ...
\end{tabular}
\caption{Running times for different index implementations.}
\label{tab:benchmark:indices}
\end{table}
%
These benchmark results indicate that ... .
%
Based on these results, we decided that we use ... as the index in our
data structure.
\section{Refined Queries}
\section{Ranking Algorithms}
\section{Extension: Improving the Client GUI}
\section{Extension: Webcrawler}
\section{Extension: Fuzzy-Search}
\end{document}