commit c5187418bc93c34e3fdce4fdc1a3b5316812b69a
parent 793be7b049cecba43072858341dc7006fef352e7
Author: Alex Auvolat <alex@adnab.me>
Date: Fri, 10 Jul 2015 19:20:57 -0400
Add documentation
Diffstat:
4 files changed, 418 insertions(+), 1 deletion(-)
diff --git a/.gitignore b/.gitignore
@@ -1,5 +1,9 @@
.idea/*
+# Source archive
+submission.tgz
+*.pdf
+
# Byte-compiled / optimized / DLL files
__pycache__/
*.py[cod]
@@ -58,8 +62,9 @@ docs/_build/
# PyBuilder
target/
-# Vim tmpfiles
+# Vim & other tmpfiles
*.swp
+*~
# Random datafiles
*.csv
diff --git a/Makefile b/Makefile
@@ -0,0 +1,23 @@
+FILES=config/__init__.py \
+ config/dest_mlp_tgtcls_1_cswdtx_alexandre.py \
+ data/cuts/__init__.py \
+ data/cuts/test_times_0.py \
+ data/__init__.py \
+ data/csv_to_hdf5.py \
+ data/hdf5.py \
+ data/init_valid.py \
+ data/make_valid_cut.py \
+ data/transformers.py \
+ model/__init__.py \
+ model/mlp.py \
+ model/dest_mlp_tgtcls.py \
+ data_analysis/cluster_arrival.py \
+ doc/report.pdf \
+ __init__.py \
+ error.py \
+ ext_saveload.py \
+ ext_test.py \
+ train.py
+
+submission.tgz: $(FILES)
+ tar czf $@ $(FILES)
diff --git a/doc/biblio.bib b/doc/biblio.bib
@@ -0,0 +1,36 @@
+
+@article{bengio2003neural,
+ title={A neural probabilistic language model},
+ author={Bengio, Yoshua and Ducharme, R{\'e}jean and Vincent, Pascal and Janvin, Christian},
+ journal={The Journal of Machine Learning Research},
+ volume={3},
+ pages={1137--1155},
+ year={2003},
+ publisher={JMLR. org}
+}
+
+@inproceedings{bergstra2010theano,
+ title={Theano: a CPU and GPU math expression compiler},
+ author={Bergstra, James and Breuleux, Olivier and Bastien, Fr{\'e}d{\'e}ric and Lamblin, Pascal and Pascanu, Razvan and Desjardins, Guillaume and Turian, Joseph and Warde-Farley, David and Bengio, Yoshua},
+ booktitle={Proceedings of the Python for scientific computing conference (SciPy)},
+ volume={4},
+ pages={3},
+ year={2010},
+ organization={Austin, TX}
+}
+
+@article{bastien2012theano,
+ title={Theano: new features and speed improvements},
+ author={Bastien, Fr{\'e}d{\'e}ric and Lamblin, Pascal and Pascanu, Razvan and Bergstra, James and Goodfellow, Ian and Bergeron, Arnaud and Bouchard, Nicolas and Warde-Farley, David and Bengio, Yoshua},
+ journal={arXiv preprint arXiv:1211.5590},
+ year={2012}
+}
+
+@inproceedings{glorot2011deep,
+ title={Deep sparse rectifier neural networks},
+ author={Glorot, Xavier and Bordes, Antoine and Bengio, Yoshua},
+ booktitle={International Conference on Artificial Intelligence and Statistics},
+ pages={315--323},
+ year={2011}
+}
+
diff --git a/doc/report.tm b/doc/report.tm
@@ -0,0 +1,352 @@
+<TeXmacs|1.99.2>
+
+<style|generic>
+
+<\body>
+ <doc-data|<doc-title|Taxi Destination Prediction Challenge<next-line>Winner
+ Team's Report>|<doc-author|<author-data|<\author-affiliation>
+ <em|Montréal, July 2015>
+ </author-affiliation>>>>
+
+ <center|<tabular*|<tformat|<table|<row|<cell|<name|Alex
+ Auvolat>>|<cell|<name|Alexandre De Brébisson>>|<cell|<name|Étienne
+ Simon>>>|<row|<cell|ENS Paris>|<cell|Université de Montréal>|<cell|ENS
+ Cachan>>|<row|<cell|France>|<cell|Québec,
+ Canada>|<cell|France>>|<row|<cell|<verbatim|alexis211@gmail.com>>|<cell|<verbatim|<strong|adbrebs@gmail.com>>>|<cell|<verbatim|esimon@esimon.eu>>>>>>>
+
+ <section|Summary>
+
+ Our model is based on a multi-layer perceptron (MLP), a simple feed-forward
+ neural network architecture. Our MLP model is trained by stochastic
+ gradient descent (SGD) on the training trajectories. The inputs to our MLP
+ are the 5 first and 5 last positions of the known part of the trajectory,
+ as well as embeddings for the context information (date, client and taxi
+ identification). \ The embeddings are trained with SGD jointly with the MLP
+ parameters. The MLP outputs probabilities for 3392 target points, and a
+ mean is calculated to get a unique destination point as an output. We did
+ no ensembling and used no external data.
+
+ <section|Feature Selection/Extraction>
+
+ We used a mean-shift algorithm on the destination points of all the
+ training trajectories to extract 3392 classes for the destination point.
+ These classes were used as a fixed output layer for the MLP architecture.
+
+ We used the embedding method which is common in neural language modeling
+ approaches (see [1]) to take the metainformation into account in our model.
+ The following embeddings were used (listed with corresponding
+ dimensionnality):
+
+ <big-table|<tabular|<tformat|<table|<row|<cell|<tabular|<tformat|<cwith|1|1|1|-1|cell-bborder|1px>|<table|<row|<cell|<strong|Meta-data>>|<cell|<strong|Embedding
+ Dimension>>|<cell|<strong|Number of classes>>>|<row|<cell|Unique caller
+ number>|<cell|10>|<cell|57125>>|<row|<cell|Unique stand
+ number>|<cell|10>|<cell|64>>|<row|<cell|Unique taxi
+ number>|<cell|10>|<cell|448>>|<row|<cell|Week of
+ year>|<cell|10>|<cell|54>>|<row|<cell|Day of
+ week>|<cell|10>|<cell|7>>|<row|<cell|1/4 of hour of the
+ day>|<cell|10>|<cell|96>>|<row|<cell|Day type (invalid
+ data)>|<cell|10>|<cell|3>>>>>>>>>>|Embeddings and corresponding dimensions
+ used by the model>
+
+ The embeddings were first initialized to random variables and were then let
+ to evolve freely with SGD along with the other model parameters.
+
+ The geographical data input in the network is a centered and normalized
+ version of the GPS data points.
+
+ We did no other preprocessing or feature selection.
+
+ <section|Modelling Techniques and Training>
+
+ Here is a brief description of the model we used:
+
+ <\itemize>
+ <item><strong|Input.> The input layer of the MLP is the concatenation of
+ the following inputs:
+
+ <\itemize>
+ <item>Five first and five last points of the known part of the
+ trajectory.
+
+ <item>Embeddings for all the metadata.
+ </itemize>
+
+ <item><strong|Hidden layer.> We use a single hidden layer MLP. The hidden
+ layer is of size 500, and the activation function is a Rectifier Linear
+ Unit (ie <math|f<around*|(|x|)>=max<around*|(|0,x|)>>). See [2] for more
+ information about ReLUs.
+
+ <item><strong|Output layer.> The output layer predicts a probability
+ vector for the 3392 output classes that we obtained with our clustering
+ preprocessing step. If <math|\<b-p\>> is the probability vector output by
+ our MLP (output by a softmax layer) and <math|c<rsub|i>> is the centroid
+ of cluster <math|i>, our prediciton is given by:
+
+ <\eqnarray*>
+ <tformat|<table|<row|<cell|<wide|y|^>>|<cell|=>|<cell|<big|sum><rsub|i>p<rsub|i>*c<rsub|i>>>>>
+ </eqnarray*>
+
+ Since <math|\<b-p\>> sums to one, this is a valid point on the map.
+
+ <item><strong|Cost.> We directly train using an approximation of the mean
+ Haversine Distance as a cost.
+
+ <item><strong|SGD and optimization.> We used a minibatch size of 200. The
+ optimization algorithm is simple SGD with a learning rate of 0.01 and a
+ momentum of 0.9.
+
+ <item><strong|Validation.> To generate our validation set, we tried to
+ create a set that looked like the training set. For that we generated
+ ``cuts'' from the training set, i.e. extracted all the taxi rides that
+ were occuring at given times. The times we selected for our validation
+ set are similar to those of the test set, only one year before.
+ </itemize>
+
+ <section|Code Description>
+
+ Here is a brief description of the Python files in the archive:
+
+ <\itemize>
+ <item><verbatim|config/*.py> : configuration files for the different
+ models we have experimented with
+
+ The model which gets the best solution is
+ <verbatim|mlp_tgtcls_1_cswdtx_alexandre.py>
+
+ <item><verbatim|data/*.py> : files related to the data pipeline:
+
+ <\itemize>
+ <item><verbatim|__init__.py> contains some general statistics about the
+ data
+
+ <item><verbatim|csv_to_hdf5.py> : convert the CSV data file into an
+ HDF5 file usable directly by Fuel
+
+ <item><verbatim|hdf5.py> : utility functions for exploiting the HDF5
+ file
+
+ <item><verbatim|init_valid.py> : initializes the HDF5 file for the
+ validation set
+
+ <item><verbatim|make_valid_cut.py> : generate a validation set using a
+ list of time cuts. Cut lists are stored in Python files in
+ <verbatim|data/cuts/> (we used a single cut file)
+
+ <item><verbatim|transformers.py> : Fuel pipeline for transforming the
+ training dataset into structures usable by our model
+ </itemize>
+
+ <item><strong|<verbatim|data_analysis/*.py>> : scripts for various
+ statistical analyses on the dataset
+
+ <\itemize>
+ <item><verbatim|cluster_arrival.py> : the script used to generate the
+ mean-shift clustering of the destination points, producing the 3392
+ target points
+ </itemize>
+
+ <item><verbatim|model/*.py> : source code for the various models we tried
+
+ <\itemize>
+ <item><verbatim|__init__.py> contains code common to all the models,
+ including the code for embedding the metadata
+
+ <item><verbatim|mlp.py> contains code common to all MLP models
+
+ <item><verbatim|dest_mlp_tgtcls.py> containts code for our MLP
+ destination prediction model using target points for the output layer
+ </itemize>
+
+ <item><verbatim|error.py> contains the functions for calculating the
+ error based on the Haversine Distance
+
+ <item><verbatim|ext_saveload.py> contains a Blocks extension for saving
+ and reloading the model parameters so that training can be interrupted
+
+ <item><verbatim|ext_test.py> contains a Blocks extension that runs the
+ model on the test set and produces an output CSV submission file
+
+ <item><verbatim|train.py> contains the main code for the training and
+ testing
+ </itemize>
+
+ In the archive we have included only the files listed above, which are
+ strictly necessary for reproducing our results. More files for the other
+ models we have tried are available on GitHub at
+ <hlink|https://github.com/adbrebs/taxi|><hlink||https://github.com/adbrebs/taxi>.
+
+ <section|Dependencies>
+
+ We used the following packages developped at the MILA lab:
+
+ <\itemize>
+ <item><strong|Thano.> A general GPU-accelerated python math library, with
+ an interface similar to numpy (see [3, 4]).
+ <hlink|http://deeplearning.net/software/theano/|>
+
+ <item><strong|Blocks.> A deep-learning and neural network framework for
+ Python based on Theano. <hlink|https://github.com/mila-udem/blocks|>
+
+ <item><strong|Fuel.> A data pipelining framework for Blocks.
+ <hlink|https://github.com/mila-udem/fuel|>
+ </itemize>
+
+ We also used the <verbatim|scikit-learn> Python library for their
+ mean-shift clustering algorithm. <verbatim|numpy>, <verbatim|cPickle> and
+ <verbatim|h5py> are also used at various places.
+
+ <section|How To Generate The Solution>
+
+ <\enumerate>
+ <item>Set the <verbatim|TAXI_PATH> environment variable to the path of
+ the folder containing the CSV files.
+
+ <item>Run <verbatim|data/csv_to_hdf5.py> to generate the HDF5 file (which
+ is generated in <verbatim|TAXI_PATH>, along the CSV files). This takes
+ around 20 minutes on our machines.
+
+ <item>Run <verbatim|data/init_valid.py> to initialize the validation set
+ HDF5 file.
+
+ <item>Run <verbatim|data/make_valid_cut.py test_times_0> to generate the
+ validation set. This can take a few minutes.
+
+ <item>Run <verbatim|data_analysis/cluster_arrival.py> to generate the
+ arrival point clustering. This can take a few minutes.
+
+ <item>Create a folder <verbatim|model_data> and a folder
+ <verbatim|output> (next to the training script), which will recieve
+ respectively a regular save of the model parameters and many submission
+ files generated from the model at a regular interval.
+
+ <item>Run <verbatim|./train.py dest_mlp_tgtcls_1_cswdtx_alexandre> to
+ train the model. Output solutions are generated in <verbatim|output/>
+ every 1000 iterations. Interrupt the model with three consecutive Ctrl+C
+ at any times. The training script is set to stop training after 10 000
+ 000 iterations, but a result file produced after less than 2 000 000
+ iterations is already the winning solution. The training is quite long
+ though: we trained our model on a GeForce GTX 680 card and it took about
+ an afternoon to generate the winning solution.
+
+ When running the training script, set the following Theano flags
+ environment variable to exploit GPU parallelism:
+
+ <verbatim|THEANO_FLAGS=floatX=float32,device=gpu,optimizer=FAST_RUN>
+
+ Theano is only compatible with CUDA, which requires an Nvidia GPUs.
+ Training on the CPU is also possible but much slower.
+ </enumerate>
+
+ <section|Additional Comments and Observations>
+
+ The training examples fed to the model are not full trajectories, since
+ that would make no sense, but prefixes of those trajectories that are
+ generated on-the-fly by a Fuel transformer, <verbatim|TaxiGenerateSplits>,
+ whose code is available in <verbatim|data/transformers.py>. The data
+ pipeline is as follows:
+
+ <\itemize>
+ <item>Select a random full trajectory from the dataset
+
+ <item>Generate a maximum of 100 prefixes for that trajectory. If the
+ trajectory is smaller than 100 data points, generate all possible
+ prefixes. Otherwise, chose a random subset of prefixes. Keep the final
+ destination somewhere as it is used as a target for the training.
+
+ <item>Take only the 5 first and 5 last points of the trajectory.
+
+ <item>At this points we have a stream of prefixes sucessively taken from
+ different trajectories. We create batches of size 200 with the items of
+ the previous stream, taken in the order in which they come. The prefixes
+ generated from a single trajectory may end up in two sucessive batches,
+ or all in a single batch.
+ </itemize>
+
+ <section|References>
+
+ <\enumerate>
+ <item><label|gs_cit0>Bengio, Y., Ducharme, R., Vincent, P., & Janvin, C.
+ (2003). A neural probabilistic language model.
+ <with|font-shape|italic|The Journal of Machine Learning Research>,
+ <with|font-shape|italic|3>, 1137-1155.
+
+ <item><label|gs_cit0>Glorot, X., Bordes, A., & Bengio, Y. (2011). Deep
+ sparse rectifier neural networks. In <with|font-shape|italic|International
+ Conference on Artificial Intelligence and Statistics> (pp. 315-323).
+
+ <item><label|gs_cit0>Bergstra, J., Bastien, F., Breuleux, O., Lamblin,
+ P., Pascanu, R., Delalleau, O., ... & Bengio, Y. (2011). Theano: Deep
+ learning on gpus with python. In <with|font-shape|italic|NIPS 2011,
+ BigLearning Workshop, Granada, Spain>.
+
+ <item><label|gs_cit0>Bastien, F., Lamblin, P., Pascanu, R., Bergstra, J.,
+ Goodfellow, I., Bergeron, A., ... & Bengio, Y. (2012). Theano: new
+ features and speed improvements. <with|font-shape|italic|arXiv preprint
+ arXiv:1211.5590>.
+ </enumerate>
+</body>
+
+<\initial>
+ <\collection>
+ <associate|page-medium|paper>
+ <associate|page-screen-margin|true>
+ </collection>
+</initial>
+
+<\references>
+ <\collection>
+ <associate|auto-1|<tuple|1|1>>
+ <associate|auto-10|<tuple|8|?>>
+ <associate|auto-2|<tuple|2|1>>
+ <associate|auto-3|<tuple|1|1>>
+ <associate|auto-4|<tuple|3|2>>
+ <associate|auto-5|<tuple|4|2>>
+ <associate|auto-6|<tuple|5|3>>
+ <associate|auto-7|<tuple|6|3>>
+ <associate|auto-8|<tuple|7|4>>
+ <associate|auto-9|<tuple|8|4>>
+ <associate|gs_cit0|<tuple|4|4>>
+ </collection>
+</references>
+
+<\auxiliary>
+ <\collection>
+ <\associate|table>
+ <tuple|normal|Embeddings and corresponding dimensions used by the
+ model|<pageref|auto-3>>
+ </associate>
+ <\associate|toc>
+ <vspace*|1fn><with|font-series|<quote|bold>|math-font-series|<quote|bold>|1<space|2spc>Summary>
+ <datoms|<macro|x|<repeat|<arg|x>|<with|font-series|medium|<with|font-size|1|<space|0.2fn>.<space|0.2fn>>>>>|<htab|5mm>>
+ <no-break><pageref|auto-1><vspace|0.5fn>
+
+ <vspace*|1fn><with|font-series|<quote|bold>|math-font-series|<quote|bold>|2<space|2spc>Feature
+ Selection/Extraction> <datoms|<macro|x|<repeat|<arg|x>|<with|font-series|medium|<with|font-size|1|<space|0.2fn>.<space|0.2fn>>>>>|<htab|5mm>>
+ <no-break><pageref|auto-2><vspace|0.5fn>
+
+ <vspace*|1fn><with|font-series|<quote|bold>|math-font-series|<quote|bold>|3<space|2spc>Modelling
+ Techniques and Training> <datoms|<macro|x|<repeat|<arg|x>|<with|font-series|medium|<with|font-size|1|<space|0.2fn>.<space|0.2fn>>>>>|<htab|5mm>>
+ <no-break><pageref|auto-4><vspace|0.5fn>
+
+ <vspace*|1fn><with|font-series|<quote|bold>|math-font-series|<quote|bold>|4<space|2spc>Code
+ Description> <datoms|<macro|x|<repeat|<arg|x>|<with|font-series|medium|<with|font-size|1|<space|0.2fn>.<space|0.2fn>>>>>|<htab|5mm>>
+ <no-break><pageref|auto-5><vspace|0.5fn>
+
+ <vspace*|1fn><with|font-series|<quote|bold>|math-font-series|<quote|bold>|5<space|2spc>Dependencies>
+ <datoms|<macro|x|<repeat|<arg|x>|<with|font-series|medium|<with|font-size|1|<space|0.2fn>.<space|0.2fn>>>>>|<htab|5mm>>
+ <no-break><pageref|auto-6><vspace|0.5fn>
+
+ <vspace*|1fn><with|font-series|<quote|bold>|math-font-series|<quote|bold>|6<space|2spc>How
+ To Generate The Solution> <datoms|<macro|x|<repeat|<arg|x>|<with|font-series|medium|<with|font-size|1|<space|0.2fn>.<space|0.2fn>>>>>|<htab|5mm>>
+ <no-break><pageref|auto-7><vspace|0.5fn>
+
+ <vspace*|1fn><with|font-series|<quote|bold>|math-font-series|<quote|bold>|7<space|2spc>Additional
+ Comments and Observations> <datoms|<macro|x|<repeat|<arg|x>|<with|font-series|medium|<with|font-size|1|<space|0.2fn>.<space|0.2fn>>>>>|<htab|5mm>>
+ <no-break><pageref|auto-8><vspace|0.5fn>
+
+ <vspace*|1fn><with|font-series|<quote|bold>|math-font-series|<quote|bold>|8<space|2spc>References>
+ <datoms|<macro|x|<repeat|<arg|x>|<with|font-series|medium|<with|font-size|1|<space|0.2fn>.<space|0.2fn>>>>>|<htab|5mm>>
+ <no-break><pageref|auto-9><vspace|0.5fn>
+ </associate>
+ </collection>
+</auxiliary>+
\ No newline at end of file