First of all - brace yourselves and put on your tin-foil hats - this article is not particularly friendly towards the hosts of the recent Power Laws challenges on Driven Data. Otherwise we applied new models and learned a lot, as usual)
-1 Negative TLDR - the business / community part
Recently there were 3 electricity consumption related challenges (anomalies, forecasting and optimization) hosted on Driven Data - a platform that I otherwise praised for its nice challenges, datasets and simplicity (here and here). This time something obviously went wrong - looks like the corporate bureaucracy could not pay enough attention to host decent challenges. Let me explain why. If you do not want to read the whole explanation - it just looks like they made only one decent challenge, and the rest was just there for some reason.
First of all, though me with my colleague managed to claim ~16th place both on private and public LB (it remained more or less the same in spite of severe shake-up) for the forecasting challenge - we have a serious bone to pick with the hosts of the challenges. Also as I noted many times - ML competitions seem to be losing their appeal as the hosts do not bring a decent combination of dataset quality / data science expertise / proper organization / proper prizes to the table. It seems that after acquisition of Kaggle by Google they are starting to use the platform for marketing purposes and / or business is finally understanding the real applications of ML and uses competitions just to annotate data and get some stacked models for change. A recent case where a US$1m competition's winning solution was ... 10+ stacked ResNets - was also laughable.
So, one by one - why the organization of challenges sucked:
- Anomaly detection
- Some "proprietary algorithm" was used for annotation. In reality all reasonable people, who tried to come up with a decent anomaly detection pipeline (as opposed to just probing the leaderboard by submitting 1 value at a time) reported that honestly nothing worked. Nice. But at first, it was apparent that the whole leaderboard consisted of such "probing". I stopped following it then;
- In practice - there were only 3 (THREE, yes three, number 3) time series compared to 5000 in the forecasting challenge. Why could not they apply the same "proprietary" algorithm on, say, 1000 time series?;
- Also the fact that there was a prize for "the best report" - heavily implies that the hosts did not really have confidence in their data;
- Optimization challenge
- With no clear metric - the final decisions were to be made by a "panel of judges" - it just means that this is just rubbish;
- With 5000 time series + mostly reasonable metric (and you can be sure that the annotation is correct - you just slice the time series windows - and boom - you have train and test data) it seemed mostly ok;
- But they went as far as saying that each forecast_id (time series) is evaluated separately, but did not delete the information about how these time series were sliced, i.e. you could just interpolate (see the image below);
- There was no mention of "model freeze", "dockerization", "2 stage competition" or any other decent method to avoid the use of future data and / or interpolation - just some basic rules on the forum;
- So - you could not really tell whether the LB contained real or fake scores ... also the shake up in public / private scores was severe;
- As a consequence - a much lower incentive to try as many models as you can;
An example of how the data was sliced
0 Positive TLDR - the data science / ML part and what we learned
In a nutshell, you simply cannot predict highly cyclic energy consumption patters with standard plain vanilla linear models (AR, ARMA, ARIMA, etc) for 2 reasons:
- Each time series would require manual fiddling with the model, which is intractable (there are 5000 of them);
- In 75% of cases these time series do not really look like AR processes;
So, obviously ensembles of random forests (which are basically over-fitting micro-trends) and deep CNNs with embeddings spring to mind. On this task, we established the following approximate pattern:
Linear models (AR) << Naive time-series level MLP << Random Forest Ensembles (XGB. LightGBM, CatBoost, forests from sklearn) << deep RNNs with embeddings to learn categorical data representations
I am also tempted to say that you can swap RNNs with dilated convolutions (they produced stellar local validation results) - but they did not fare well on the LB for some reason.
This is a brief depiction of models we ended up using to produce the best submit (my colleague did not use past data, but his model design was very similar):
1 The ugly - anomaly detection
As I mentioned, there were only 3 time series in this part of the challenge (wat?). I applied a panel of 10 methods to this challenge (naive random forests to calculate the unexplained residual, moving averages, exponential moving averages, etc) and then produced some average metric (or embedded it into 1-2 dimensions with PCA). This produced stellar visualizations (data shader is awesome BTW), but did not fit their "proprietary algorithm", which was much cruder or used a lot of hand-coded rules.
Also notice that this is not only my judgment, the majority of people from the ODS.ai community arrived at the same conclusion independently.
It produced a very nice visualization, though =)
A 2D embedding of my 10D outlier vectors. Looks cool - but is useless AF
Also anomaly detection in time series is by far not the most popular topic in the Internet, but I found a couple of decent links, all of which more or less revolve around moving averages and exponential moving averages:
- http://alumni.cs.ucr.edu/~ratana/SSDBM05.pdf => essential this is for very LARGE datasets and it substitutes moving average for hashing, which in principle is more or less the same;
- this awesome method in pandas;
2 The bad - time series forecasting
See some time series for yourself - highly cyclical, sometimes really erratic, sometimes with no patterns at all
I guess the picture above and the code will tell you more, but in a nutshell, I just applied an ensemble of Random Forest regressors and a deep RNN with embeddings and encoder-decoder architecture. You may want to read a more in-depth review of XGB vs. LightGBM vs. CatBooost, but here is my comparison from my telegram channel.
I also tried dilated convolution models, but they did not generalize well to the public LB.
3 Useful links and papers
- Naive Pytorch + LSTMs for simple time series, same on forums (1,2);
- More in-depth articles about LSTMs and time series - notice the simplistic tasks (1,2);
- The best article I saw so far on this topic (I based my solution mostly on it) - Wikipedia traffic prediction, first place on Kaggle;
If anything, the Wikipedia article tells us a couple of things (I tried using the author's code, but it's a mix of TF / Keras / CUDA - which is not really usable per se):
- RNNs (LSTMs and GRUs) can model only "short" time series, i.e. 100-200 points;
- Just introducing lagged features beats attention (and it's much easier) - but predictions become painful, so it's reasonable to predict seq2seq in this case;
- Just plain RNNs can handle only very simplistic cases;
Also I tried this approach - by swapping the FCN layers and encoders and decoders in the network with dilated convolution layers. It worked locally, but not on the leaderboard.
Also I noticed that when I added a third level of GRUs to the model - it worked much better locally, but not on the LB. Also visually the predictions shifted to the left:
On the above chart you can compare the predictions of encoder-decoder + MLP model (in red), random forest ensemble (orange), my colleagues' model (green) and a deeper RNN.
BTW, my colleague used encoder-decoder model with embeddings, but he just used meta data to predict the current values, without past data.
4 Models, code, files
At first I wanted to do a proper release, but disappointed after having spent a decent amount of time (first proper time series challenge seemed to be intriguing at first) I will just post my final code and pre-processed data. This is a step down in quality since the satellite competition, but I will make up for it later =)