PLEASE NOTE: We'd prefer you send your answers to [email protected] rather than post them publicly in another Gist, or in comments on this Gist ;-)
The following is a series of questions related to skills that we commonly use in our projects.
This is not a timed or monitored test, and you may use any print or online resources you would like to assist you in your work.
Please submit your answers in the form of text, markdown, or other lightweight data formats. Please indicate clearly which parts of your response correspond to which question. If your answers span multiple files, please attach them as a zip file.
You have a simple and straightforward table which holds people information. It has this structure:
people (
id integer not null, -- PK
first_name text not null,
last_name text not null,
gender char not null,
father_id integer, -- FK: people.id
mother_id integer, -- FK: people.id
age integer
);
Write down an SQL query which would select count of all adult first cousins for a given people.id
.
You have designed a system which collects event stream into a database table called fact_event
for further processing.
The table consists of event_id
, event_date
, event_type
and other event-related specific attributes. The table also has a BTree index over event_date
field. You have just implemented a report which shows ('event_type_id': count) statistics over fact_event
table data for last couple of years.
You use this query:
SELECT event_type_id, count(*) FROM fact_event WHERE event_date < now() and event_date >= now() - interval '2 years'
It works fast and smooth on staging and uses Bitmap Index Scan
over the event_date
based index. But when you promote your changes to production you notice that the query is inacceptably slow. You use EXPLAIN
with the select query and notice that instead of using Bitmap Index Scan
on the index over event_date
the optimizer uses Seq Table Scan
.
What could be a reason to prefer table sequential scan over bitmap index scan? What could you to in order to speed up the report?
Here you still have fact_event
table from the previous task, but now it has a really simple structure:
fact_event(
event_date datetime,
event_id integer,
previous_event_id integer,
referrer_host text
)
Primary key: event_id
Foreign key: previous_event_id, points to `fact_event.event_id`, nullable
So events in the table can either have a parent event or not. Let's call linked events event_id_1->event_id_2->...->event_id_n
a session. Each session starts with an event with parent_event_id is null
. Now, for each event in the table you have to find out what was the referrer host of the very first event in the given session. Let's also assume (select count(*) from fact_event) ~ 5M
.
Please describe how would you solve this issue, what could be an effective solution from your perspective and why?
You're working on a cool branch of the foobar project, and your branch contains two commits A and B. The git lineage is:
X -- Y -- Z <-- master
\
A -- B <-- your-cool-branch
You want to contribute the code back to the master foobar repo, but you realize there is a really dumb typo in one of your source code comments in commit A. You'd still like to submit the pull request as two commits A' and B', where A' is the fixed version of A, and B' is the same exact diff as B. How do you rewrite git history to make this happen? Is B' the same hash as B? Why or why not?
In one Unix command, find all of the files in /usr/local
whose contents contain the word "aardvark" (case-insensitive), and list them all in order from most-recently created to least-recently created.
Take a look at the following:
function *foo(x) {
while (x < 4) {
x += yield x;
}
return x;
}
var bar = foo(3);
console.log( bar.next(1) );
console.log( bar.next(1) );
console.log( bar.next(1) );
What is the output of this code snippet, and why does it behave in the way it does?
You have an interesting and complex task: to analyse all available Blablacar offers for a given country and try to find out correlation between the route distance and amount of offers for the route per day. In order to determine the route distance you have decided to use Google Distance Matrix API: a convenient way to get a distance between two cities. So you wrote a web-scraper for blablacar website which does all offers parsing and produces a pairs ('City1', 'City2'). Also, you have implemented a Google Distance Matrix API wrapper function:
import googlemaps
gmaps = googlemaps.Client(key='Your_API_key')
def distance(departure, destination):
"""A Google Distance API wrapper, returns distance between two cities in meters"""
my_dist = gmaps.distance_matrix(departure, destination)['rows'][0]['elements'][0]
return my_dist['distance']['value']
However, Google Distance Matrix API is paid, and you definitely don't want to pay for redundant API calls: there are more than one route could exist for the same cities pair, also City1-City2 distance is the same as City2-City1 distance. Assuming that your solution is a one-time script, which improvements could you propose in order to reduce money spent on Google Distance Matrix API calls in your code?