Add documentation - taxi - Winning entry to the Kaggle taxi competition

commit c5187418bc93c34e3fdce4fdc1a3b5316812b69a
parent 793be7b049cecba43072858341dc7006fef352e7
Author: Alex Auvolat <alex@adnab.me>
Date:   Fri, 10 Jul 2015 19:20:57 -0400

Add documentation

Diffstat:
M .gitignore  | 7 ++++++-
A Makefile  | 23 +++++++++++++++++++++++
A doc/biblio.bib  | 36 ++++++++++++++++++++++++++++++++++++
A doc/report.tm  | 353 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++

4 files changed, 418 insertions(+), 1 deletion(-)
diff --git a/.gitignore b/.gitignore
@@ -1,5 +1,9 @@
 .idea/*
 
+# Source archive
+submission.tgz
+*.pdf
+
 # Byte-compiled / optimized / DLL files
 __pycache__/
 *.py[cod]
@@ -58,8 +62,9 @@ docs/_build/
 # PyBuilder
 target/
 
-# Vim tmpfiles
+# Vim & other tmpfiles
 *.swp
+*~
 
 # Random datafiles
 *.csv
diff --git a/Makefile b/Makefile
@@ -0,0 +1,23 @@
+FILES=config/__init__.py \
+	  config/dest_mlp_tgtcls_1_cswdtx_alexandre.py \
+	  data/cuts/__init__.py \
+	  data/cuts/test_times_0.py \
+	  data/__init__.py \
+	  data/csv_to_hdf5.py \
+	  data/hdf5.py \
+	  data/init_valid.py \
+	  data/make_valid_cut.py \
+	  data/transformers.py \
+	  model/__init__.py \
+	  model/mlp.py \
+	  model/dest_mlp_tgtcls.py \
+	  data_analysis/cluster_arrival.py \
+	  doc/report.pdf \
+	  __init__.py \
+	  error.py \
+	  ext_saveload.py \
+	  ext_test.py \
+	  train.py
+		
+submission.tgz: $(FILES)
+	tar czf $@ $(FILES)
diff --git a/doc/biblio.bib b/doc/biblio.bib
@@ -0,0 +1,36 @@
+
+@article{bengio2003neural,
+  title={A neural probabilistic language model},
+  author={Bengio, Yoshua and Ducharme, R{\'e}jean and Vincent, Pascal and Janvin, Christian},
+  journal={The Journal of Machine Learning Research},
+  volume={3},
+  pages={1137--1155},
+  year={2003},
+  publisher={JMLR. org}
+}
+
+@inproceedings{bergstra2010theano,
+  title={Theano: a CPU and GPU math expression compiler},
+  author={Bergstra, James and Breuleux, Olivier and Bastien, Fr{\'e}d{\'e}ric and Lamblin, Pascal and Pascanu, Razvan and Desjardins, Guillaume and Turian, Joseph and Warde-Farley, David and Bengio, Yoshua},
+  booktitle={Proceedings of the Python for scientific computing conference (SciPy)},
+  volume={4},
+  pages={3},
+  year={2010},
+  organization={Austin, TX}
+}
+
+@article{bastien2012theano,
+  title={Theano: new features and speed improvements},
+  author={Bastien, Fr{\'e}d{\'e}ric and Lamblin, Pascal and Pascanu, Razvan and Bergstra, James and Goodfellow, Ian and Bergeron, Arnaud and Bouchard, Nicolas and Warde-Farley, David and Bengio, Yoshua},
+  journal={arXiv preprint arXiv:1211.5590},
+  year={2012}
+}
+
+@inproceedings{glorot2011deep,
+  title={Deep sparse rectifier neural networks},
+  author={Glorot, Xavier and Bordes, Antoine and Bengio, Yoshua},
+  booktitle={International Conference on Artificial Intelligence and Statistics},
+  pages={315--323},
+  year={2011}
+}
+
diff --git a/doc/report.tm b/doc/report.tm
@@ -0,0 +1,352 @@
+<TeXmacs|1.99.2>
+
+<style|generic>
+
+<\body>
+  <doc-data|<doc-title|Taxi Destination Prediction Challenge<next-line>Winner
+  Team's Report>|<doc-author|<author-data|<\author-affiliation>
+    <em|Montr�al, July 2015>
+  </author-affiliation>>>>
+
+  <center|<tabular*|<tformat|<table|<row|<cell|<name|Alex
+  Auvolat>>|<cell|<name|Alexandre De Br�bisson>>|<cell|<name|�tienne
+  Simon>>>|<row|<cell|ENS Paris>|<cell|Universit� de Montr�al>|<cell|ENS
+  Cachan>>|<row|<cell|France>|<cell|Qu�bec,
+  Canada>|<cell|France>>|<row|<cell|<verbatim|alexis211@gmail.com>>|<cell|<verbatim|<strong|adbrebs@gmail.com>>>|<cell|<verbatim|esimon@esimon.eu>>>>>>>
+
+  <section|Summary>
+
+  Our model is based on a multi-layer perceptron (MLP), a simple feed-forward
+  neural network architecture. Our MLP model is trained by stochastic
+  gradient descent (SGD) on the training trajectories. The inputs to our MLP
+  are the 5 first and 5 last positions of the known part of the trajectory,
+  as well as embeddings for the context information (date, client and taxi
+  identification). \ The embeddings are trained with SGD jointly with the MLP
+  parameters. The MLP outputs probabilities for 3392 target points, and a
+  mean is calculated to get a unique destination point as an output. We did
+  no ensembling and used no external data.
+
+  <section|Feature Selection/Extraction>
+
+  We used a mean-shift algorithm on the destination points of all the
+  training trajectories to extract 3392 classes for the destination point.
+  These classes were used as a fixed output layer for the MLP architecture.
+
+  We used the embedding method which is common in neural language modeling
+  approaches (see [1]) to take the metainformation into account in our model.
+  The following embeddings were used (listed with corresponding
+  dimensionnality):
+
+  <big-table|<tabular|<tformat|<table|<row|<cell|<tabular|<tformat|<cwith|1|1|1|-1|cell-bborder|1px>|<table|<row|<cell|<strong|Meta-data>>|<cell|<strong|Embedding
+  Dimension>>|<cell|<strong|Number of classes>>>|<row|<cell|Unique caller
+  number>|<cell|10>|<cell|57125>>|<row|<cell|Unique stand
+  number>|<cell|10>|<cell|64>>|<row|<cell|Unique taxi
+  number>|<cell|10>|<cell|448>>|<row|<cell|Week of
+  year>|<cell|10>|<cell|54>>|<row|<cell|Day of
+  week>|<cell|10>|<cell|7>>|<row|<cell|1/4 of hour of the
+  day>|<cell|10>|<cell|96>>|<row|<cell|Day type (invalid
+  data)>|<cell|10>|<cell|3>>>>>>>>>>|Embeddings and corresponding dimensions
+  used by the model>
+
+  The embeddings were first initialized to random variables and were then let
+  to evolve freely with SGD along with the other model parameters.
+
+  The geographical data input in the network is a centered and normalized
+  version of the GPS data points.
+
+  We did no other preprocessing or feature selection.
+
+  <section|Modelling Techniques and Training>
+
+  Here is a brief description of the model we used:
+
+  <\itemize>
+    <item><strong|Input.> The input layer of the MLP is the concatenation of
+    the following inputs:
+
+    <\itemize>
+      <item>Five first and five last points of the known part of the
+      trajectory.
+
+      <item>Embeddings for all the metadata.
+    </itemize>
+
+    <item><strong|Hidden layer.> We use a single hidden layer MLP. The hidden
+    layer is of size 500, and the activation function is a Rectifier Linear
+    Unit (ie <math|f<around*|(|x|)>=max<around*|(|0,x|)>>). See [2] for more
+    information about ReLUs.
+
+    <item><strong|Output layer.> The output layer predicts a probability
+    vector for the 3392 output classes that we obtained with our clustering
+    preprocessing step. If <math|\<b-p\>> is the probability vector output by
+    our MLP (output by a softmax layer) and <math|c<rsub|i>> is the centroid
+    of cluster <math|i>, our prediciton is given by:
+
+    <\eqnarray*>
+      <tformat|<table|<row|<cell|<wide|y|^>>|<cell|=>|<cell|<big|sum><rsub|i>p<rsub|i>*c<rsub|i>>>>>
+    </eqnarray*>
+
+    Since <math|\<b-p\>> sums to one, this is a valid point on the map.
+
+    <item><strong|Cost.> We directly train using an approximation of the mean
+    Haversine Distance as a cost.
+
+    <item><strong|SGD and optimization.> We used a minibatch size of 200. The
+    optimization algorithm is simple SGD with a learning rate of 0.01 and a
+    momentum of 0.9.
+
+    <item><strong|Validation.> To generate our validation set, we tried to
+    create a set that looked like the training set. For that we generated
+    ``cuts'' from the training set, i.e. extracted all the taxi rides that
+    were occuring at given times. The times we selected for our validation
+    set are similar to those of the test set, only one year before.
+  </itemize>
+
+  <section|Code Description>
+
+  Here is a brief description of the Python files in the archive:
+
+  <\itemize>
+    <item><verbatim|config/*.py> : configuration files for the different
+    models we have experimented with
+
+    The model which gets the best solution is
+    <verbatim|mlp_tgtcls_1_cswdtx_alexandre.py>
+
+    <item><verbatim|data/*.py> : files related to the data pipeline:
+
+    <\itemize>
+      <item><verbatim|__init__.py> contains some general statistics about the
+      data
+
+      <item><verbatim|csv_to_hdf5.py> : convert the CSV data file into an
+      HDF5 file usable directly by Fuel
+
+      <item><verbatim|hdf5.py> : utility functions for exploiting the HDF5
+      file
+
+      <item><verbatim|init_valid.py> : initializes the HDF5 file for the
+      validation set
+
+      <item><verbatim|make_valid_cut.py> : generate a validation set using a
+      list of time cuts. Cut lists are stored in Python files in
+      <verbatim|data/cuts/> (we used a single cut file)
+
+      <item><verbatim|transformers.py> : Fuel pipeline for transforming the
+      training dataset into structures usable by our model
+    </itemize>
+
+    <item><strong|<verbatim|data_analysis/*.py>> : scripts for various
+    statistical analyses on the dataset
+
+    <\itemize>
+      <item><verbatim|cluster_arrival.py> : the script used to generate the
+      mean-shift clustering of the destination points, producing the 3392
+      target points
+    </itemize>
+
+    <item><verbatim|model/*.py> : source code for the various models we tried
+
+    <\itemize>
+      <item><verbatim|__init__.py> contains code common to all the models,
+      including the code for embedding the metadata
+
+      <item><verbatim|mlp.py> contains code common to all MLP models
+
+      <item><verbatim|dest_mlp_tgtcls.py> containts code for our MLP
+      destination prediction model using target points for the output layer
+    </itemize>
+
+    <item><verbatim|error.py> contains the functions for calculating the
+    error based on the Haversine Distance
+
+    <item><verbatim|ext_saveload.py> contains a Blocks extension for saving
+    and reloading the model parameters so that training can be interrupted
+
+    <item><verbatim|ext_test.py> contains a Blocks extension that runs the
+    model on the test set and produces an output CSV submission file
+
+    <item><verbatim|train.py> contains the main code for the training and
+    testing
+  </itemize>
+
+  In the archive we have included only the files listed above, which are
+  strictly necessary for reproducing our results. More files for the other
+  models we have tried are available on GitHub at
+  <hlink|https://github.com/adbrebs/taxi|><hlink||https://github.com/adbrebs/taxi>.
+
+  <section|Dependencies>
+
+  We used the following packages developped at the MILA lab:
+
+  <\itemize>
+    <item><strong|Thano.> A general GPU-accelerated python math library, with
+    an interface similar to numpy (see [3, 4]).
+    <hlink|http://deeplearning.net/software/theano/|>
+
+    <item><strong|Blocks.> A deep-learning and neural network framework for
+    Python based on Theano. <hlink|https://github.com/mila-udem/blocks|>
+
+    <item><strong|Fuel.> A data pipelining framework for Blocks.
+    <hlink|https://github.com/mila-udem/fuel|>
+  </itemize>
+
+  We also used the <verbatim|scikit-learn> Python library for their
+  mean-shift clustering algorithm. <verbatim|numpy>, <verbatim|cPickle> and
+  <verbatim|h5py> are also used at various places.
+
+  <section|How To Generate The Solution>
+
+  <\enumerate>
+    <item>Set the <verbatim|TAXI_PATH> environment variable to the path of
+    the folder containing the CSV files.
+
+    <item>Run <verbatim|data/csv_to_hdf5.py> to generate the HDF5 file (which
+    is generated in <verbatim|TAXI_PATH>, along the CSV files). This takes
+    around 20 minutes on our machines.
+
+    <item>Run <verbatim|data/init_valid.py> to initialize the validation set
+    HDF5 file.
+
+    <item>Run <verbatim|data/make_valid_cut.py test_times_0> to generate the
+    validation set. This can take a few minutes.
+
+    <item>Run <verbatim|data_analysis/cluster_arrival.py> to generate the
+    arrival point clustering. This can take a few minutes.
+
+    <item>Create a folder <verbatim|model_data> and a folder
+    <verbatim|output> (next to the training script), which will recieve
+    respectively a regular save of the model parameters and many submission
+    files generated from the model at a regular interval.
+
+    <item>Run <verbatim|./train.py dest_mlp_tgtcls_1_cswdtx_alexandre> to
+    train the model. Output solutions are generated in <verbatim|output/>
+    every 1000 iterations. Interrupt the model with three consecutive Ctrl+C
+    at any times. The training script is set to stop training after 10 000
+    000 iterations, but a result file produced after less than 2 000 000
+    iterations is already the winning solution. The training is quite long
+    though: we trained our model on a GeForce GTX 680 card and it took about
+    an afternoon to generate the winning solution.
+
+    When running the training script, set the following Theano flags
+    environment variable to exploit GPU parallelism:
+
+    <verbatim|THEANO_FLAGS=floatX=float32,device=gpu,optimizer=FAST_RUN>
+
+    Theano is only compatible with CUDA, which requires an Nvidia GPUs.
+    Training on the CPU is also possible but much slower.
+  </enumerate>
+
+  <section|Additional Comments and Observations>
+
+  The training examples fed to the model are not full trajectories, since
+  that would make no sense, but prefixes of those trajectories that are
+  generated on-the-fly by a Fuel transformer, <verbatim|TaxiGenerateSplits>,
+  whose code is available in <verbatim|data/transformers.py>. The data
+  pipeline is as follows:
+
+  <\itemize>
+    <item>Select a random full trajectory from the dataset
+
+    <item>Generate a maximum of 100 prefixes for that trajectory. If the
+    trajectory is smaller than 100 data points, generate all possible
+    prefixes. Otherwise, chose a random subset of prefixes. Keep the final
+    destination somewhere as it is used as a target for the training.
+
+    <item>Take only the 5 first and 5 last points of the trajectory.
+
+    <item>At this points we have a stream of prefixes sucessively taken from
+    different trajectories. We create batches of size 200 with the items of
+    the previous stream, taken in the order in which they come. The prefixes
+    generated from a single trajectory may end up in two sucessive batches,
+    or all in a single batch.
+  </itemize>
+
+  <section|References>
+
+  <\enumerate>
+    <item><label|gs_cit0>Bengio, Y., Ducharme, R., Vincent, P., & Janvin, C.
+    (2003). A neural probabilistic language model.
+    <with|font-shape|italic|The Journal of Machine Learning Research>,
+    <with|font-shape|italic|3>, 1137-1155.
+
+    <item><label|gs_cit0>Glorot, X., Bordes, A., & Bengio, Y. (2011). Deep
+    sparse rectifier neural networks. In <with|font-shape|italic|International
+    Conference on Artificial Intelligence and Statistics> (pp. 315-323).
+
+    <item><label|gs_cit0>Bergstra, J., Bastien, F., Breuleux, O., Lamblin,
+    P., Pascanu, R., Delalleau, O., ... & Bengio, Y. (2011). Theano: Deep
+    learning on gpus with python. In <with|font-shape|italic|NIPS 2011,
+    BigLearning Workshop, Granada, Spain>.
+
+    <item><label|gs_cit0>Bastien, F., Lamblin, P., Pascanu, R., Bergstra, J.,
+    Goodfellow, I., Bergeron, A., ... & Bengio, Y. (2012). Theano: new
+    features and speed improvements. <with|font-shape|italic|arXiv preprint
+    arXiv:1211.5590>.
+  </enumerate>
+</body>
+
+<\initial>
+  <\collection>
+    <associate|page-medium|paper>
+    <associate|page-screen-margin|true>
+  </collection>
+</initial>
+
+<\references>
+  <\collection>
+    <associate|auto-1|<tuple|1|1>>
+    <associate|auto-10|<tuple|8|?>>
+    <associate|auto-2|<tuple|2|1>>
+    <associate|auto-3|<tuple|1|1>>
+    <associate|auto-4|<tuple|3|2>>
+    <associate|auto-5|<tuple|4|2>>
+    <associate|auto-6|<tuple|5|3>>
+    <associate|auto-7|<tuple|6|3>>
+    <associate|auto-8|<tuple|7|4>>
+    <associate|auto-9|<tuple|8|4>>
+    <associate|gs_cit0|<tuple|4|4>>
+  </collection>
+</references>
+
+<\auxiliary>
+  <\collection>
+    <\associate|table>
+      <tuple|normal|Embeddings and corresponding dimensions used by the
+      model|<pageref|auto-3>>
+    </associate>
+    <\associate|toc>
+      <vspace*|1fn><with|font-series|<quote|bold>|math-font-series|<quote|bold>|1<space|2spc>Summary>
+      <datoms|<macro|x|<repeat|<arg|x>|<with|font-series|medium|<with|font-size|1|<space|0.2fn>.<space|0.2fn>>>>>|<htab|5mm>>
+      <no-break><pageref|auto-1><vspace|0.5fn>
+
+      <vspace*|1fn><with|font-series|<quote|bold>|math-font-series|<quote|bold>|2<space|2spc>Feature
+      Selection/Extraction> <datoms|<macro|x|<repeat|<arg|x>|<with|font-series|medium|<with|font-size|1|<space|0.2fn>.<space|0.2fn>>>>>|<htab|5mm>>
+      <no-break><pageref|auto-2><vspace|0.5fn>
+
+      <vspace*|1fn><with|font-series|<quote|bold>|math-font-series|<quote|bold>|3<space|2spc>Modelling
+      Techniques and Training> <datoms|<macro|x|<repeat|<arg|x>|<with|font-series|medium|<with|font-size|1|<space|0.2fn>.<space|0.2fn>>>>>|<htab|5mm>>
+      <no-break><pageref|auto-4><vspace|0.5fn>
+
+      <vspace*|1fn><with|font-series|<quote|bold>|math-font-series|<quote|bold>|4<space|2spc>Code
+      Description> <datoms|<macro|x|<repeat|<arg|x>|<with|font-series|medium|<with|font-size|1|<space|0.2fn>.<space|0.2fn>>>>>|<htab|5mm>>
+      <no-break><pageref|auto-5><vspace|0.5fn>
+
+      <vspace*|1fn><with|font-series|<quote|bold>|math-font-series|<quote|bold>|5<space|2spc>Dependencies>
+      <datoms|<macro|x|<repeat|<arg|x>|<with|font-series|medium|<with|font-size|1|<space|0.2fn>.<space|0.2fn>>>>>|<htab|5mm>>
+      <no-break><pageref|auto-6><vspace|0.5fn>
+
+      <vspace*|1fn><with|font-series|<quote|bold>|math-font-series|<quote|bold>|6<space|2spc>How
+      To Generate The Solution> <datoms|<macro|x|<repeat|<arg|x>|<with|font-series|medium|<with|font-size|1|<space|0.2fn>.<space|0.2fn>>>>>|<htab|5mm>>
+      <no-break><pageref|auto-7><vspace|0.5fn>
+
+      <vspace*|1fn><with|font-series|<quote|bold>|math-font-series|<quote|bold>|7<space|2spc>Additional
+      Comments and Observations> <datoms|<macro|x|<repeat|<arg|x>|<with|font-series|medium|<with|font-size|1|<space|0.2fn>.<space|0.2fn>>>>>|<htab|5mm>>
+      <no-break><pageref|auto-8><vspace|0.5fn>
+
+      <vspace*|1fn><with|font-series|<quote|bold>|math-font-series|<quote|bold>|8<space|2spc>References>
+      <datoms|<macro|x|<repeat|<arg|x>|<with|font-series|medium|<with|font-size|1|<space|0.2fn>.<space|0.2fn>>>>>|<htab|5mm>>
+      <no-break><pageref|auto-9><vspace|0.5fn>
+    </associate>
+  </collection>
+</auxiliary>+
\ No newline at end of file

	taxi Winning entry to the Kaggle taxi competition
	git clone https://esimon.eu/repos/taxi.git
	Log \| Files \| Refs \| README

M	.gitignore	\|	7	++++++-
A	Makefile	\|	23	+++++++++++++++++++++++
A	doc/biblio.bib	\|	36	++++++++++++++++++++++++++++++++++++
A	doc/report.tm	\|	353	+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++