Yet another sql challenge

Or how to waste your life "properly"

Posted by yara_tchk on July 25, 2017

Trying to find another job in addition to what I have already, I am always pissed off and constantly feel moral pain. There are many reasons for such a misfortune and you probably know them as well as I do.

One of the reasons is huge and enormous test tasks that you should complete without any payment and spending dozens of hours with immense effort, like writing a whole documentation of huge prototype mobile application. And no guarantee is provided, of course. To make it more understandable I ask you to imagine your life as a stream from a full bottle of time constantly spilling down to dark deep chasm of void. Once the bottle is empty means you are dead. So why should I waste my life like that?

For some reason all my employers, including current one, never asked me to do anything like that. So I make a conclusion, that normal employer wouldn't ask you something insane and probably I wouldn't spend my time doing insane test tasks anymore from now on.

This article is about an example of that kind of test task. The task itself is really interesting and I actually enjoyed solving it. However, I mostly wasted my life solving it and I didn't get a job.

The company I was interviewed by is SkyEng. You could have heard of it if you lived in Russia.

So they provide you with a single csv file and a simple table in it containing three columns. One stands for user unique identification number or user_id for short, another is page identification string, named just as "page",  and the last one is time of action (action is either having clicked that page or having used some resources on that page - it doesn't actually matter). Here is an example from the file:

vbx = pd.read_csv("vimbox_pages_2.csv")
vbx.head()
 
Out[4]:
 57529rooms.homework-showcase2017-03-01T00:00:07.710000
0 57529 rooms.view.step.content 2017-03-01T00:00:10.275000
1 57529 rooms.view.step.content 2017-03-01T00:00:10.436000
2 168671 rooms.view.step.content 2017-03-01T00:00:12.035000
3 168671 rooms.view.step.content 2017-03-01T00:00:50.632000
4 64788 rooms.view.step.content 2017-03-01T00:01:21.460000

 

Then they provide you with definition of term "session", and session is a series of actions, having less than hour in between. And session begins with the first action in that series and ends after an hour have passed from the last action.

And what they want from you is to find all the particular sessions, that contain some pages among all the other pages in specific order, say page 1, page 2 and page 3, but not exactly one after another. That was really confusing. I have interpreted it like "11112222333" is ok, but "111122221333" is not, because the order is broken. And that was a mistake. Also I solved a part of the task using Python, not SQL, because they used an old-fashioned service with old version of postgres that does not provide powerful tools like recursive queries and json aggregation functions, and that was unacceptable.

I was 100% sure I solved the task correctly, but HR said I hadn't. And they refused to provide anything so I could check my solution. That made me really angry.

So I found an email of the guy who had checked my solution, and I emailed him. I asked him what was wrong with my solution and how I could solve the task using sql but not recursive sql. And he provided a single string that wasn't included in my solution. That's how I knew I had misunderstood that confusing part of the task. And also he provided a clue to using sql - window functions were enough to solve the task. I actually understood how I could do it using window functions and that was the single useful thing I have learned solving the task, and I am actually grateful for that.

I argued about the part of pages order in the task being confusing, that's why I made a mistake. He said he did it intentionally and that a good analyst should have spotted that. I mostly would agree with him, but only if it was a working environment. Come on, it is a test task, too much time and who should I talk to - an hr? Are you serious, guys? So I asked if I could re-submit my task if I managed to solve it with window functions. And he said yes, thank him for the kindness.

Yes, I managed to solve it. And I had an interview. Apparently, I wasn't good enough for them anyway. And they didn't say why. =)

May be, it is because I didn't like their working environment with old-fashioned service and limits on tools for solving the task, I can guess only.

So here's how I solved that interesting task and wasted my life on it. This is the final version, without messing up with python, although, python is more applicable and faster for the task. Note that I use current postgres version (9.6) here, not the old one they use, because one might find it helpful here.


 First, let's code target pages with their order and let's code with zero the rest of pages (I created a column for page code):

update "public"."vimbox_pages_raw" set page_code = 0;
update "public"."vimbox_pages_raw" set page_code = 1 where page = 'rooms.homework-showcase';
update "public"."vimbox_pages_raw" set page_code = 2 where page = 'rooms.view.step.content';
update "public"."vimbox_pages_raw" set page_code = 3 where page = 'rooms.lesson.rev.step.content';
 

Now we use a set of built-in window functions to create a session string - a set of all pages codes (or actions), concatenated in proper order. We use rank (not necessary since we have datetime, I guess, but I like it that way), lag, first_value and last_value. The hardest part for me was the part, where you create subwindows corresponding to various sessions of the same user. Thanks the solution from stack overflow, it is great. And after all we just apply regular expression filter to find our target sessions.

select 
	user_id, 
	"first_value" as session_begins, 
	"last_value" + interval '1 hour' as session_ends, 
	string_agg(page_code::varchar, '' order by acrank) as session_string
from
	(select 
		user_id, 
		acrank, 
		page_code, 
		label, 
		last_value(dttm) over (partition by user_id, label order by acrank rows between unbounded preceding and unbounded following),
		first_value(dttm) over (partition by user_id, label order by acrank rows between unbounded preceding and unbounded following)
	from
		(SELECT 
			user_id, 
			acrank, 
			page_code, 
			dttm,  
			sum(isnewgroup) over (order by user_id, acrank) as label
		from
			(select 
				user_id, 
				acrank, 
				page_code, 
				dttm,
				(case when (dttm - lag(dttm) over (partition by user_id order by user_id, acrank) > '01:00:00.000') then 1 else 0 end) as isnewgroup
				from
					(SELECT 
						user_id, 
						dttm, 
						page_code,
						rank() OVER (PARTITION BY user_id ORDER BY dttm) as acrank
					FROM vimbox_pages_raw) 
					sub) 
				sub1) 
			sub2
		order by user_id, acrank, label) 
		sub3
group by 1,2,3
having string_agg(page_code::varchar, '' order by acrank) like '%1%2%3%'

 

I am pretty glad that I found a solution. The solution is general, so you can filter any sessions with any regular expressions. There was another, simpler solution with joining the table two times on itself, but I think it is a dumb and one-case solution, although it works.

So I complained here a lot about wasting my time, although I have learned something. But most of the times I don't, and most of the times tasks are time-consuming, dull, boring and you cannot even write about it. So here the example of great and interesting and time-consuming task I could write about. As I said before, one might find it helpful. 

Still pissed off about it and triggered though...