Prime SQL Patterns from FAANG Information Science Interviews (with Code)

SQL Patterns from FAANG Data Science Interviews

Picture by Writer

# Introduction

The technical screening for information science roles in FAANG corporations may be very thorough. Nonetheless, even they will’t give you an infinite stream of distinctive interview questions. When you’ve gone by the grind sufficient occasions, you begin to discover that some SQL patterns preserve displaying up.

Listed below are the highest 5, with examples and code (PostgreSQL) for apply.

Picture by Writer | Serviette AI

Grasp these and also you’ll be prepared for many SQL interviews.

# Sample #1: Aggregating Information with GROUP BY

Utilizing mixture capabilities with GROUP BY means that you can mixture metrics throughout classes.

This sample is commonly mixed with information filtering, which suggests utilizing one of many two clauses:

WHERE: Filters information earlier than the aggregation.
HAVING: Filters information after the aggregation.

Instance: This Meta interview query asks you to seek out the full variety of feedback made 30 or fewer days earlier than 2020-02-10 per person. Customers with no feedback ought to be excluded from the output.

We use the SUM() perform with a GROUP BY clause to sum the variety of feedback per person. Outputting the feedback solely throughout the specified interval is achieved by filtering the information earlier than aggregation, i.e., utilizing WHERE. There’s no have to calculate which date is “30 days earlier than 2020-02-10”; we merely subtract 30 days from that date utilizing the INTERVAL date perform.

SELECT user_id,
       SUM(number_of_comments) AS number_of_comments
FROM fb_comments_count
WHERE created_at BETWEEN '2020-02-10'::DATE - 30 * INTERVAL '1 day' AND '2020-02-10'::DATE
GROUP BY user_id;

Right here’s the output.

user_id	number_of_comments
5	1
8	4
9	2
…	…
99	2

Enterprise Use:

Consumer exercise metrics: DAU & MAU, churn price.
Income metrics: income per area/product/time interval.
Consumer engagement: common session size, common clicks per person.

# Sample #2: Filtering with Subqueries

When utilizing subqueries for filtering, you create an information subset, then filter the principle question towards it.

The 2 most important subquery sorts are:

Scalar subqueries: Return a single worth, e.g., most quantity.
Correlated subqueries: Reference and rely on the results of the outer question to return the values.

Instance: This interview query from Meta asks you to create a advice system for Fb. For every person, it is best to discover pages that this person doesn’t comply with, however a minimum of one in all their associates does. The output ought to encompass the person ID and the ID of the web page that ought to be advisable to this person.

The outer question returns all of the user-page pairs the place the web page is adopted by a minimum of one pal.

Then, we use a subquery within the WHERE clause to take away the pages that the person already follows. There are two situations within the subquery: one which solely considers pages adopted by this particular person (checks for this person solely), after which checks if the web page thought-about for advice is amongst these adopted by the person (checks for this web page solely).

Because the subquery returns all of the pages adopted by the person, utilizing NOT EXISTS in WHERE excludes all these pages from the advice.

SELECT DISTINCT f.user_id,
                p.page_id
FROM users_friends f
JOIN users_pages p ON f.friend_id = p.user_id
WHERE NOT EXISTS
    (SELECT *
     FROM users_pages pg
     WHERE pg.user_id = f.user_id
       AND pg.page_id = p.page_id);

Right here’s the output.

user_id	page_id
1	23
1	24
1	28
…	…
5	25

Enterprise Use:

Buyer exercise: most up-to-date login per person, newest subscription change.
Gross sales: highest order per buyer, prime income order per area.
Product efficiency: most bought product in every class, highest-revenue product per 30 days.
Consumer behaviour: Longest session per person, first buy per buyer.
Evaluations & suggestions: prime reviewer, newest evaluation for every product.
Operations: Newest cargo standing per order, quickest supply time per area.

# Sample #3: Rating with Window Capabilities

Utilizing window capabilities comparable to ROW_NUMBER(), RANK(), and DENSE_RANK() means that you can order rows inside information partitions, after which establish the primary, second, or nth file.

Here’s what every of those rating window capabilities does:

ROW_NUMBER(): Assigns a novel sequential quantity inside every partition; tied values get completely different row numbers.
RANK(): Assigns the identical rank to tied values and skips the subsequent ranks for the subsequent non-tied worth.
DENSE_RANK(): Similar as RANK(), solely it doesn’t skip rank after ties.

Instance: In an Amazon interview query, we have to discover the very best each day order price between 2019-02-01 and 2019-05-01. If a buyer has multiple order on a sure day, sum the order prices each day. The output ought to include the client’s first title, the full price of their order(s), and the date of the order.

Within the first frequent desk expression (CTE), we discover the orders between the required dates and sum the client’s each day totals for every date.

Within the second CTE, we use RANK() to rank prospects by order price descendingly for every date.

Now, we be a part of two CTEs to output the required columns and filter solely the orders with the primary rank assigned to them, i.e., the very best order.

WITH customer_daily_totals AS (
  SELECT o.cust_id,
         o.order_date,
         SUM(o.total_order_cost) AS total_daily_cost
  FROM orders o
  WHERE o.order_date BETWEEN '2019-02-01' AND '2019-05-01'
  GROUP BY o.cust_id, o.order_date
),

ranked_daily_totals AS (
  SELECT cust_id,
         order_date,
         total_daily_cost,
         RANK() OVER (PARTITION BY order_date ORDER BY total_daily_cost DESC) AS rnk
  FROM customer_daily_totals
)

SELECT c.first_name,
       rdt.order_date,
       rdt.total_daily_cost AS max_cost
FROM ranked_daily_totals rdt
JOIN prospects c ON rdt.cust_id = c.id
WHERE rdt.rnk = 1
ORDER BY rdt.order_date;

Right here’s the output.

first_name	order_date	max_cost
Mia	2019-02-01	100
Farida	2019-03-01	80
Mia	2019-03-01	80
…	…	…
Farida	2019-04-23	120

Enterprise Use:

Consumer exercise: “Prime 5 most lively customers final month”.
Income: “The second-highest income area”.
Product recognition: “Prime 10 best-selling merchandise”.
Purchases “The primary buy of every buyer”.

# Sample #4: Calculating Shifting Averages & Cumulative Sums

The transferring (rolling) common calculates the common over the past N rows, usually months or days. It’s calculated utilizing the AVG() window perform and defining the window as ROWS BETWEEN N PRECEDING AND CURRENT ROW.

The cumulative sum (operating complete) is the sum from the primary row as much as the present row, which is mirrored in defining the window as ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW within the SUM() window perform.

Instance: The interview query from Amazon needs us to seek out the 3-month rolling common of complete income from purchases. We should always output the year-month (YYYY-MM) and the 3-month rolling common, sorted from the earliest to the newest month.

Additionally, the returns (unfavorable buy values) shouldn’t be included.

We use a subquery to calculate month-to-month income utilizing SUM() and convert the acquisition date to a YYYY-MM format with the TO_CHAR() perform.

Then, we use AVG() to calculate the transferring common. Within the OVER() clause, we order the information in partition by month and outline the window as ROWS BETWEEN 2 PRECEDING AND CURRENT ROW; we calculate the 3-month transferring common, which takes under consideration the present and the earlier two months.

SELECT t.month,
       AVG(t.monthly_revenue) OVER(ORDER BY t.month ROWS BETWEEN 2 PRECEDING AND CURRENT ROW) AS avg_revenue
FROM
  (SELECT TO_CHAR(created_at::DATE, 'YYYY-MM') AS month,
          SUM(purchase_amt) AS monthly_revenue
   FROM amazon_purchases
   WHERE purchase_amt > 0
   GROUP BY 1
   ORDER BY 1) AS t
ORDER BY t.month ASC;

Right here’s the output.

month	avg_revenue
2020-01	26292
2020-02	23493.5
2020-03	25535.666666666668
…	…
2020-10	21211

To calculate a cumulative sum, we’d do it like this.

SELECT t.month,
       SUM(t.monthly_revenue) OVER(ORDER BY t.month ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) AS cum_sum
FROM
  (SELECT TO_CHAR(created_at::DATE, 'YYYY-MM') AS month,
          SUM(purchase_amt) AS monthly_revenue
   FROM amazon_purchases
   WHERE purchase_amt > 0
   GROUP BY 1
   ORDER BY 1) AS t
ORDER BY t.month ASC;

Right here’s the output.

month	cum_sum
2020-01	26292
2020-02	46987
2020-03	76607
…	…
2020-10	239869

Enterprise Use:

Engagement metrics: 7-day transferring common of DAU or messages despatched, cumulative cancellations.
Monetary KPIs: 30-day transferring common of prices/conversions/inventory costs, income reporting (cumulative YTD).
Product efficiency: logins per person transferring common, cumulative app installs.
Operations: cumulative orders shipped, tickets resolved, bugs closed.

# Sample #5: Making use of Conditional Aggregations

Conditional aggregation enables you to compute a number of segmented metrics in a single go by placing the CASE WHEN assertion inside mixture capabilities.

Instance: A query from an Amazon interview asks you to establish returning lively customers by discovering customers who made a second buy inside 1 to 7 days after their first buy. The output ought to consist solely of those customers’ IDs. The identical-day purchases ought to be ignored.

The primary CTE identifies the customers and the dates of their purchases, excluding same-day purchases through the use of the DISTINCT key phrase.

The second CTE ranks every person’s buy dates from the oldest to the latest.

The final CTE finds the primary and second purchases for every person through the use of conditional aggregation. We use MAX() to choose the one non-NULL worth for the primary and second buy dates.

Lastly, we use the results of the final CTE and retain solely customers who made a second buy (non-NULL) inside 7 days of their first buy.

WITH each day AS (
  SELECT DISTINCT user_id,
         created_at::DATE AS purchase_date
  FROM amazon_transactions
),

ranked AS (
  SELECT user_id,
         purchase_date,
         ROW_NUMBER() OVER (PARTITION BY user_id ORDER BY purchase_date) AS rn
  FROM each day
),

first_two AS (
  SELECT user_id,
         MAX(CASE WHEN rn = 1 THEN purchase_date END) AS first_date,
         MAX(CASE WHEN rn = 2 THEN purchase_date END) AS second_date
  FROM ranked
  WHERE rn <= 2
  GROUP BY user_id
)

SELECT user_id
FROM first_two
WHERE second_date IS NOT NULL AND (second_date - first_date) BETWEEN 1 AND 7
ORDER BY user_id;

Right here’s the output.

user_id
100
103
105
…
143

Enterprise Use:

Subscription reporting: paid vs. free customers, lively vs. churned customers by plan tier.
Advertising and marketing funnel dashboards: signed up vs. bought customers by visitors supply, emails opened vs. clicked vs. transformed.
E-commerce: accomplished vs. refunded vs. cancelled orders by area, new vs. returning consumers.
Product evaluation: iOS vs. Android vs. Net utilization, function adopted vs. not adopted counts per cohort.
Finance: income from new vs. current prospects, gross vs. web income.
A/B testing & experiments: management vs. remedy metrics.

# Conclusion

If you would like a job at FAANG (and others, too) corporations, deal with these 5 SQL patterns for the interviews. In fact, they’re not the one SQL ideas examined. However they’re mostly examined. By specializing in them, you make sure that your interview preparation is as environment friendly as attainable for many SQL interviews at FAANG corporations.

Nate Rosidi is an information scientist and in product technique. He is additionally an adjunct professor educating analytics, and is the founding father of StrataScratch, a platform serving to information scientists put together for his or her interviews with actual interview questions from prime corporations. Nate writes on the newest tendencies within the profession market, provides interview recommendation, shares information science initiatives, and covers the whole lot SQL.