A few notes on social phenomena descriptions available in social networks (vk.com)

Or the best features are not what you expect

Posted by snakers41 on December 27, 2017

This article in a nutshell - one set of best descriptors (among a couple more you will see below)


  • This is more of a summary than a tutorial;
  • I did a small side project on predicting some social phenomena (I cannot divulge the exact topic and the code) using scraped social network data;
  • So far I can see that I did it more or less well (ROC AUC score on train ~ 0.9, ~0.7 on validation, but with very small amount of annotated data - 5-7k items, also a small survey (100+ people) showed that my predictions mostly agreed with it);
  • Unsurprisingly the best features are:
    • The most plain indicators (age, how many friends / audios you have - activity and popularity counts);
    • Or the features that best correlate with you actual latest actions where you have intent (in my case - group reposts, your friend / network metrics - and probably likes will be better, but they are harder to collect) - nobody gives a shit about data you have not updated in your profile in years;
    • Natural log of your id in the social network is a good feature (it is least prone to trickery and may indicate age / tech savviness / social group etc);
  • In a nutshell social networks are much more mundane and harsh than you may think. Herd mentality, 90% of content being stupid shit and reposts, shitty popular public groups and posts at their finest;
  • Despite the what the best  2017  tabular data Kaggle competitions may show in top solutions, you actually need very moderate hardware for such tasks (if your are in practical reality);
  • An inside information - as of 2017 clever work with vk.com API time outs and access to API tokens with high limits actually enables you to get the necessary data legally and w/o any botnets or proxy lists;

Probably it also is a good reason why the best people (and sometimes even rich people) use social media only for professional reasons. If you want to have a strong voice - best be faceless and voiceless.

For this reason, btw, oppressive companies and countries fear third-party messengers so much. I personally advocate Telegram  - it is fast, easy to use, cross platform, no ideas about security, vibrant community and a plethora of simple yet efficient features. Also as of 2017 the messenger itself has not sold out, but the majority of public channels are kind of cringe now.

1 So, what is the fuss about?

Some random guys asked me to finish off a prediction of a social phenomenon for ~200k+ vk.com users (2 other people started it, with different coding styles I have mixed feelings about =) ). The data was scraped from vk.com public API. I did the task and in doing  so I was able to compare some of the best performing features.

Also I remember that a year ago some people were selling some bs to me that such tasks are really difficult. No they are not.  IT/ML fields have a veneer of exclusivity / reserved club, when in fact they are not.

Scraping each user on vk.com daily for 5 yeas is difficult. The data science and ETL parts are not.

2 Why is it useful?

Well, for 2 basic reasons:

  • Some ideas for your company's ML algorithms;
  • Your data is not yours and being private is real currency nowadays. Just pause and ponder at:
    • This abomination related to your facebook privacy;
    • This lawsuit between a social network scoring company and vk.com;
    • The fact that I have personally heard people claiming to be doing such data collection for law enforcement agencies;

You probably get my point.

3 My stack + a couple of tricks

Basically you just need a set of plain vanilla ML-libraries and this XGBoost tutorial. That's it.

Basic libraries:

  • xgboost
  • pandas
  • numpy
  • scipy
  • sklearn
  • matplotlib

Also read this series for simple ETL jobs in pandas before jumping into something like this. Arguably you can do all the ETL with various methods (extended SQL,  functional style scripts, shell scripts,  even some higher level python libraries like Luigi or Dask) but with leaps in model PC power chances are that a single powerful workstation (plus probably something in the lines of bcolz if there is not enough RAM or you are HDD bound) will be more than enough.

I ended up using a slightly modified version of the training script in the tutorial above:

# https://www.analyticsvidhya.com/blog/2016/03/complete-guide-parameter-tuning-xgboost-with-codes-python/
#Import libraries:
import pandas as pd
import numpy as np
import xgboost as xgb
from xgboost.sklearn import XGBClassifier
from sklearn import cross_validation, metrics   #Additional scklearn functions
from sklearn.grid_search import GridSearchCV   #Perforing grid search
import matplotlib.pylab as plt
%matplotlib inline
from matplotlib.pylab import rcParams
def modelfit_w_test(alg,
    if useTrainCV:
        xgb_param = alg.get_xgb_params()
        xgtrain = xgb.DMatrix(dtrain[predictors].values, label=dtrain[target].values)
        xgtest = xgb.DMatrix(dtest[predictors].values)
        cvresult = xgb.cv(xgb_param,
                          verbose_eval = False)
    #Fit the algorithm on the data
    alg.fit(dtrain[predictors], dtrain[target],eval_metric='auc')
    #Predict training set:
    dtrain_predictions = alg.predict(dtrain[predictors])
    dtrain_predprob = alg.predict_proba(dtrain[predictors])[:,1]
    #Print model report:
    print ("\nModel Report")
    print ("Accuracy : %.4g" % metrics.accuracy_score(dtrain[target].values, dtrain_predictions))
    print ("AUC Score (Train): %f" % metrics.roc_auc_score(dtrain[target], dtrain_predprob))
    # Predict on testing data:
    dtest_predprob = alg.predict_proba(dtest[predictors])[:,1]
    # results = test_results.merge(dtest[['ID','predprob']], on='ID')
    print ('AUC Score (Test): %f' % metrics.roc_auc_score(dtest[target], dtest_predprob))
    feat_imp = pd.Series(alg.booster().get_fscore()).sort_values(ascending=False)
    feat_imp.plot(kind='bar', title='Feature Importances')
    plt.ylabel('Feature Importance Score')
    return alg

Also I can share my Dockerfile for n-th time (also links to relevant telegram channel posts 1 2 3)  - but it is more geared towards CNNs:

# sometimes pip install does not work on some systems
# installation from source solves the problem
git clone https://github.com/ipython-contrib/jupyter_contrib_nbextensions.git
pip install -e jupyter_contrib_nbextensions
jupyter contrib nbextension install --system
# or install via pip from repository
pip install git+https://github.com/ipython-contrib/jupyter_contrib_nbextensions
jupyter contrib nbextension install --system
FROM nvidia/cuda:8.0-cudnn6-devel
RUN apt-get update && apt-get install -y openssh-server
RUN apt-get install -y unrar-free && \
    apt-get install -y p7zip-full
RUN mkdir /var/run/sshd
RUN echo 'root:Ubuntu@41' | chpasswd
RUN sed -i 's/PermitRootLogin prohibit-password/PermitRootLogin yes/' /etc/ssh/sshd_config
# SSH login fix. Otherwise user is kicked off after login
RUN sed 's@session\s*required\s*pam_loginuid.so@session optional pam_loginuid.so@g' -i /etc/pam.d/sshd
ENV NOTVISIBLE "in users profile"
RUN echo "export VISIBLE=now" >> /etc/profile
ENV CONDA_DIR /opt/conda
# writing env variables to /etc/profile as mentioned here https://docs.docker.com/engine/examples/running_ssh_service/#run-a-test_sshd-container
RUN echo "export CONDA_DIR=/opt/conda" >> /etc/profile
RUN echo "export PATH=$CONDA_DIR/bin:$PATH" >> /etc/profile
RUN mkdir -p $CONDA_DIR && \
    echo export PATH=$CONDA_DIR/bin:'$PATH' > /etc/profile.d/conda.sh && \
    apt-get update && \
    apt-get install -y wget git libhdf5-dev g++ graphviz openmpi-bin nano && \
    wget --quiet https://repo.continuum.io/miniconda/Miniconda3-4.2.12-Linux-x86_64.sh && \
    echo "c59b3dd3cad550ac7596e0d599b91e75d88826db132e4146030ef471bb434e9a *Miniconda3-4.2.12-Linux-x86_64.sh" | sha256sum -c - && \
    /bin/bash /Miniconda3-4.2.12-Linux-x86_64.sh -f -b -p $CONDA_DIR && \
    ln /usr/lib/x86_64-linux-gnu/libcudnn.so /usr/local/cuda/lib64/libcudnn.so && \
    ln /usr/lib/x86_64-linux-gnu/libcudnn.so.6 /usr/local/cuda/lib64/libcudnn.so.6 && \
    ln /usr/include/cudnn.h /usr/local/cuda/include/cudnn.h  && \
    rm Miniconda3-4.2.12-Linux-x86_64.sh
RUN echo "export NB_USER=keras" >> /etc/profile
RUN echo "export NB_UID=1000" >> /etc/profile
RUN echo "export LD_LIBRARY_PATH=/usr/local/cuda/lib64:/lib/x86_64-linux-gnu:$LD_LIBRARY_PATH" >> /etc/profile
RUN echo "export CPATH=/usr/include:/usr/include/x86_64-linux-gnu:/usr/local/cuda/include:$CPATH" >> /etc/profile
RUN echo "export LIBRARY_PATH=/usr/local/cuda/lib64:/lib/x86_64-linux-gnu:$LIBRARY_PATH" >> /etc/profile
RUN echo "export CUDA_HOME=/usr/local/cuda" >> /etc/profile
RUN echo "export CPLUS_INCLUDE_PATH=$CPATH" >> /etc/profile
RUN echo "export KERAS_BACKEND=tensorflow" >> /etc/profile
RUN useradd -m -s /bin/bash -N -u $NB_UID $NB_USER && \
    mkdir -p $CONDA_DIR && \ 
    chown keras $CONDA_DIR -R  
USER keras
RUN  mkdir -p /home/keras/notebook
# Python
ARG python_version=3.5
RUN conda install -y python=${python_version} && \
    pip install --upgrade pip && \
    pip install tensorflow-gpu && \
    conda install Pillow scikit-learn notebook pandas matplotlib mkl nose pyyaml six h5py && \
    conda install theano pygpu bcolz && \
    pip install keras kaggle-cli lxml opencv-python requests scipy tqdm visdom imgaug && \
    conda install pytorch torchvision cuda80 -c soumith && \
    conda clean -yt
# try alternative approach - 
RUN pip install jupyter_contrib_nbextensions && \
    pip install 'html5lib==0.9999999' && \
    jupyter contrib nbextension install --user
ENV LD_LIBRARY_PATH /usr/local/cuda/lib64:/lib/x86_64-linux-gnu:$LD_LIBRARY_PATH
ENV CPATH /usr/include:/usr/include/x86_64-linux-gnu:/usr/local/cuda/include:$CPATH
ENV LIBRARY_PATH /usr/local/cuda/lib64:/lib/x86_64-linux-gnu:$LIBRARY_PATH
ENV CUDA_HOME /usr/local/cuda
WORKDIR /home/keras/notebook
EXPOSE 8888 6006 22 8097
CMD jupyter notebook --port=8888 --ip= --no-browser

4 Input data + best performing features

There were ~200k+ users in the sample and ~5-7k annotated users. There were several major types of input data:

  • Profile features;
  • Friends profiles;
  • Songs and videos;
  • Users' posts;

I measured feature performance by running a model using only a particular set of features.

Here are the most notable feature lists sorted by their performance on binary classification challenge:

  • Repost counts from 150k groups projected into 50-dimension space using SVD -  AUC Score (Train): 0.80;
  • Populated profile numerical features - AUC Score (Train): 0.76 (age user id log, friend / video / photo counters);
  • Friend counters and extended friend counters (see images below) AUC Score (Train): 0.73 / AUC Score (Train): 0.76;
  • 5-dimension audio and video embeddings ~ 0.65 AUC;
  • Profile categorical features AUC Score (Train): 0.69;
  • Plain wall activity counters AUC Score (Train): 0.65 ()
  • Surprisingly - time / activity related features, manual group and post annotation features performed very poorly;

Numerical features from profiles

Extended friend counters

Friend counters

5 What did not work / I did not try

Well having info on 10M user posts you would think that unleashing the raw power of LSTMs with pre-trained embedding would help ... but 90%+ of this content is just public page reposts. Which in turn is well covered just by SVD matrix decomposition.