Modeling DAU with Markov Chain. How one can predict DAU utilizing Duolingo’s… | by Vladimir Kukushkin

The Full AI Agent Choice Framework

Information Science in 2026: Is It Nonetheless Price It?

How one can predict DAU utilizing Duolingo’s progress mannequin and management the prediction

Doubtlessly, DAU, WAU, and MAU — each day, weekly, and month-to-month energetic customers — are crucial enterprise metrics. An article “How Duolingo reignited consumer progress” by Jorge Mazal, former CPO of Duolingo, is #1 within the Progress part of Lenny’s Publication weblog. On this article, Jorge paid particular consideration to the methodology Duolingo used to mannequin the DAU metric (see one other article “Significant metrics: how knowledge sharpened the main target of product groups” by Erin Gustafson). This technique has a number of strengths, however I’d wish to concentrate on how one can use this method for DAU forecasting.

The brand new 12 months is coming quickly, so many corporations are planning their budgets for the subsequent 12 months today. Price estimations usually require DAU forecasts. On this article, I’ll present how one can get this prediction utilizing Duolingo’s progress mannequin. I’ll clarify why this method is best in comparison with commonplace time-series forecasting strategies and how one can alter the prediction in keeping with your groups’ plans (e.g., advertising and marketing, activation, product groups).

The article textual content goes together with the code, and a simulated dataset is connected so the analysis is totally reproducible. The Jupyter pocket book model is out there right here. Ultimately, I’ll share a DAU “calculator” designed in Google Spreadsheet format.

I’ll be narrating on behalf of the collective “we” as if we’re speaking collectively.

A fast recap on how the Duolingo’s progress mannequin works. At day d (d = 1, 2, … ) of a consumer’s lifetime, the consumer might be in one of many following 7 (mutually-exclusive) states: new, present, reactivated, resurrected, at_risk_wau, at_risk_mau, dormant. The states are outlined in keeping with indicators of whether or not a consumer was energetic immediately, within the final 7 days, or within the final 30 days. The definition abstract is given within the desk beneath:

Having these states outlined (as a set S), we will take into account consumer conduct as a Markov chain. Right here’s an instance of a consumer’s trajectory: new→ present→ present→ at_risk_wau→…→ at_risk_mau→…→ dormant. Let M be a transition matrix related to this Markov course of: m_{i, j} = P(s_j | s_i) are the possibilities {that a} consumer strikes to state s_j proper after being at state s_i, the place s_i, s_j ∈ S. Such a matrix is inferred from the historic knowledge.

If we assume that consumer conduct is stationary (impartial of time), the matrix M totally describes the states of all customers sooner or later. Suppose that the vector u_0 of size 7 accommodates the counts of customers in sure states on a given day, denoted as day 0. Based on the Markov mannequin, on the subsequent day 1, we anticipate to have the next variety of customers states u_1:

Making use of this method recursively, we derive the variety of customers in sure states on any arbitrary day t > 0 sooner or later.

Apart from the preliminary distribution u_0, we have to present the variety of new customers that can seem within the product every day sooner or later. We’ll tackle this drawback as a common time-series forecasting.

Now, having u_t calculated, we will decide DAU values on day t:

DAU_t = #New_t + #Current_t + #Reactivated_t + #Resurrected_t

Moreover, we will simply calculate WAU and MAU metrics:

WAU_t = DAU_t + #AtRiskWau_t,
MAU_t = DAU_t + #AtRiskWau_t + #AtRiskMau_t.

Lastly, right here’s the algorithm define:

For every prediction day t = 1, …, T, calculate the anticipated variety of new customers #New_1, …, #New_T.
For every lifetime day of every consumer, assign one of many 7 states.
Calculate the transition matrix M from the historic knowledge.
Calculate preliminary state counts u_0 similar to day t=0.
Recursively calculate u_{t+1} = M^T * u_t.
Calculate DAU, WAU, and MAU for every prediction day t = 1, …, T.

This part is dedicated to technical elements of the implementation. In the event you’re focused on finding out the mannequin properties reasonably than code, it’s possible you’ll skip this part and go to the Part 4.

3.1 Dataset

We use a simulated dataset based mostly on historic knowledge of a SaaS app. The info is saved within the dau_data.csv.gz file and accommodates three columns: user_id, date, and registration_date. Every file signifies a day when a consumer was energetic. The dataset consists of exercise indicators for 51480 customers from 2020-11-01 to 2023-10-31. Moreover, knowledge from October 2020 is included to calculate consumer states correctly, because the at_risk_mau and dormant states require knowledge from one month prior.

import pandas as pddf = pd.read_csv('dau_data.csv.gz', compression='gzip')
df['date'] = pd.to_datetime(df['date'])
df['registration_date'] = pd.to_datetime(df['registration_date'])
print(f'Form: {df.form}')
print(f'Whole customers: {df['user_id'].nunique()}')
print(f'Knowledge vary: [{df['date'].min()}, {df['date'].max()}]')
df.head()

Form: (667236, 3)
Whole customers: 51480
Knowledge vary: [2020-10-01 00:00:00, 2023-10-31 00:00:00]

That is how the DAU time-series appears like.

df.groupby('date').measurement()
.plot(title='DAU, historic')

Suppose that immediately is 2023–10–31 and we need to predict the DAU metric for the subsequent 2024 12 months. We outline a few international constants PREDICTION_START and PREDICTION_END which embody the prediction interval.

PREDICTION_START = '2023-11-01'
PREDICTION_END = '2024-12-31'

3.2 Predicting new customers quantity

Let’s begin from the brand new customers prediction. We use the prophet library as one of many best methods to forecast time-series knowledge. The new_users Collection accommodates such knowledge. We extract it from the unique df dataset choosing the rows the place the registration date is the same as the date.

new_users = df[df['date'] == df['registration_date']]
.groupby('date').measurement()
new_users.head()

date
2020-10-01    4
2020-10-02    4
2020-10-03    3
2020-10-04    4
2020-10-05    8
dtype: int64

prophet requires a time-series as a DataFrame containing two columns ds and y, so we reformat the new_users Collection to the new_users_prophet DataFrame. One other factor we have to put together is to create the future variable containing sure days for prediction: from prediction_start to prediction_end. This logic is carried out within the predict_new_users perform. The plot beneath illustrates predictions for each previous and future durations.

import logging
import matplotlib.pyplot as plt
from prophet import Prophet# suppress prophet logs
logging.getLogger('prophet').setLevel(logging.WARNING)
logging.getLogger('cmdstanpy').disabled=True
def predict_new_users(prediction_start, prediction_end, new_users_train, show_plot=True):
"""
Forecasts a time-seires for brand spanking new customers
Parameters
----------
prediction_start : str
Date in YYYY-MM-DD format.
prediction_end : str
Date in YYYY-MM-DD format.
new_users_train : pandas.Collection
Historic knowledge for the time-series previous the prediction interval.
show_plot : boolean, default=True
If True, a chart with the practice and predicted time-series values is displayed.
Returns
-------
pandas.Collection
Collection containing the anticipated values.
"""
m = Prophet()
new_users_train = new_users_train
.loc[new_users_train.index < prediction_start]
new_users_prophet = pd.DataFrame({
'ds': new_users_train.index,
'y': new_users_train.values
})
m.match(new_users_prophet)
durations = len(pd.date_range(prediction_start, prediction_end))
future = m.make_future_dataframe(durations=durations)
new_users_pred = m.predict(future)
if show_plot:
m.plot(new_users_pred)
plt.title('New customers prediction');
new_users_pred = new_users_pred
.assign(yhat=lambda _df: _df['yhat'].astype(int))
.rename(columns={'ds': 'date', 'yhat': 'rely'})
.set_index('date')
.clip(decrease=0)
['count']
return new_users_pred

new_users_pred = predict_new_users(PREDICTION_START, PREDICTION_END, new_users)

The new_users_pred Collection shops the anticipated customers quantity.

new_users_pred.tail(5)

date
2024-12-27    52
2024-12-28    56
2024-12-29    71
2024-12-30    79
2024-12-31    74
Title: rely, dtype: int64

3.3 Getting the states

In apply, essentially the most calculations are affordable to execute as SQL queries to a database the place the information is saved. Hereafter, we’ll simulate such querying utilizing the duckdb library.

We need to assign one of many 7 states to every day of a consumer’s lifetime throughout the app. Based on the definition, for every day, we have to take into account no less than the previous 30 days. That is the place SQL window capabilities are available in. Nonetheless, because the df knowledge accommodates solely data of energetic days, we have to explicitly lengthen them and embody the times when a consumer was not energetic. In different phrases, as a substitute of this listing of data:

user_id    date          registration_date
1234567    2023-01-01    2023-01-01
1234567    2023-01-03    2023-01-01

we’d wish to get a listing like this:

user_id    date          is_active    registration_date
1234567    2023-01-01    TRUE         2023-01-01
1234567    2023-01-02    FALSE        2023-01-01
1234567    2023-01-03    TRUE         2023-01-01
1234567    2023-01-04    FALSE        2023-01-01
1234567    2023-01-05    FALSE        2023-01-01
...        ...           ...          ...
1234567    2023-10-31    FALSE        2023-01-01

For readability functions we break up the next SQL question into a number of subqueries.

full_range: Create a full sequence of dates for every consumer.
dau_full: Get the complete listing of each energetic and inactive data.
states: Assign one of many 7 states for every day of a consumer’s lifetime.

import duckdbDATASET_START = '2020-11-01'
DATASET_END = '2023-10-31'
OBSERVATION_START = '2020-10-01'
question = f"""
WITH
full_range AS (
SELECT
user_id, UNNEST(generate_series(best(registration_date, '{OBSERVATION_START}'), date '{DATASET_END}', INTERVAL 1 DAY))::date AS date
FROM (
SELECT DISTINCT user_id, registration_date FROM df
)
),
dau_full AS (
SELECT
fr.user_id,
fr.date,
df.date IS NOT NULL AS is_active,
registration_date
FROM full_range AS fr
LEFT JOIN df USING(user_id, date)
),
states AS (
SELECT
user_id,
date,
is_active,
first_value(registration_date IGNORE NULLS) OVER (PARTITION BY user_id ORDER BY date) AS registration_date,
SUM(is_active::int) OVER (PARTITION BY user_id ORDER BY date ROWS BETWEEN 6 PRECEDING and 1 PRECEDING) AS active_days_back_6d,
SUM(is_active::int) OVER (PARTITION BY user_id ORDER BY date ROWS BETWEEN 29 PRECEDING and 1 PRECEDING) AS active_days_back_29d,
CASE
WHEN date = registration_date THEN 'new'
WHEN is_active = TRUE AND active_days_back_6d BETWEEN 1 and 6 THEN 'present'
WHEN is_active = TRUE AND active_days_back_6d = 0 AND IFNULL(active_days_back_29d, 0) > 0 THEN 'reactivated'
WHEN is_active = TRUE AND active_days_back_6d = 0 AND IFNULL(active_days_back_29d, 0) = 0 THEN 'resurrected'
WHEN is_active = FALSE AND active_days_back_6d > 0 THEN 'at_risk_wau'
WHEN is_active = FALSE AND active_days_back_6d = 0 AND ifnull(active_days_back_29d, 0) > 0 THEN 'at_risk_mau'
ELSE 'dormant'
END AS state
FROM dau_full
)
SELECT user_id, date, state FROM states
WHERE date BETWEEN '{DATASET_START}' AND '{DATASET_END}'
ORDER BY user_id, date
"""
states = duckdb.sql(question).df()

The question outcomes are saved within the states DataFrame:

3.4 Calculating the transition matrix

Having obtained these states, we will calculate state transition frequencies. Within the Part 4.3 we’ll research how the prediction depends upon a interval wherein transitions are thought-about, so it’s affordable to pre-aggregate this knowledge on each day foundation. The ensuing transitions DataFrame accommodates date, state_from, state_to, and cnt columns.

Now, we will calculate the transition matrix M. We implement the get_transition_matrix perform, which accepts the transitions DataFrame and a pair of dates that embody the transitions interval to be thought-about.

As a baseline, let’s calculate the transition matrix for the entire 12 months from 2022-11-01 to 2023-10-31.

M = get_transition_matrix(transitions, '2022-11-01', '2023-10-31')
M

The sum of every row of any transition matrix equals 1 because it represents the possibilities of shifting from one state to some other state.

3.5 Getting the preliminary state counts

An preliminary state is retrieved from the states DataFrame by the get_state0 perform and the corresponding SQL question. The one argument of the perform is the date for which we need to get the preliminary state. We assign the outcome to the state0 variable.

def get_state0(date):
question = f"""
SELECT state, rely(*) AS cnt
FROM states
WHERE date = '{date}'
GROUP BY state
"""state0 = duckdb.sql(question).df()
state0 = state0.set_index('state').reindex(states_order)['cnt']
return state0

state0 = get_state0(DATASET_END)
state0

state
new               20
present          475
reactivated       15
resurrected       19
at_risk_wau      404
at_risk_mau     1024
dormant        49523
Title: cnt, dtype: int64

3.6 Predicting DAU

The predict_dau perform beneath accepts all of the earlier variables required for the DAU prediction and makes this prediction for a date vary outlined by the start_date and end_date arguments.

def predict_dau(M, state0, start_date, end_date, new_users):
"""
Predicts DAU over a given date vary.Parameters
----------
M : pandas.DataFrame
Transition matrix representing consumer state modifications.
state0 : pandas.Collection
counts of preliminary state of customers.
start_date : str
Begin date of the prediction interval in 'YYYY-MM-DD' format.
end_date : str
Finish date of the prediction interval in 'YYYY-MM-DD' format.
new_users : int or pandas.Collection
The anticipated quantity of recent customers for every day between `start_date` and `end_date`.
If a Collection, it ought to have dates because the index.
If an int, the identical quantity is used for every day.
Returns
-------
pandas.DataFrame
DataFrame containing the anticipated DAU, WAU, and MAU for every day within the date vary,
with columns for various consumer states and tot.
"""
dates = pd.date_range(start_date, end_date)
dates.identify = 'date'
dau_pred = []
new_dau = state0.copy()
for date in dates:
new_dau = (M.transpose() @ new_dau).astype(int)
if isinstance(new_users, int):
new_users_today = new_users
else:
new_users_today = new_users.astype(int).loc[date] 
new_dau.loc['new'] = new_users_today
dau_pred.append(new_dau.tolist())
dau_pred = pd.DataFrame(dau_pred, index=dates, columns=states_order)
dau_pred['dau'] = dau_pred['new'] + dau_pred['current'] + dau_pred['reactivated'] + dau_pred['resurrected']
dau_pred['wau'] = dau_pred['dau'] + dau_pred['at_risk_wau']
dau_pred['mau'] = dau_pred['dau'] + dau_pred['at_risk_wau'] + dau_pred['at_risk_mau']
return dau_pred

dau_pred = predict_dau(M, state0, PREDICTION_START, PREDICTION_END, new_users_pred)
dau_pred

That is how the DAU prediction dau_pred appears like for the PREDICTION_START – PREDICTION_END interval. Apart from the anticipated dau, wau, and mau columns, the output accommodates the variety of customers in every state for every prediction date.

Lastly, we calculate the ground-truth values of DAU, WAU, and MAU (together with the consumer state counts), preserve them within the dau_true DataFrame, and plot the anticipated and true values altogether.

question = f"""
SELECT date, state, COUNT(*) AS cnt
FROM states
GROUP BY date, state
ORDER BY date, state;
"""dau_true = duckdb.sql(question).df()
dau_true['date'] = pd.to_datetime(dau_true['date'])
dau_true = dau_true.pivot(index='date', columns='state', values='cnt')
dau_true['dau'] = dau_true['new'] + dau_true['current'] + dau_true['reactivated'] + dau_true['resurrected']
dau_true['wau'] = dau_true['dau'] + dau_true['at_risk_wau']
dau_true['mau'] = dau_true['dau'] + dau_true['at_risk_wau'] + dau_true['at_risk_mau']

dau_true.head()

pd.concat([dau_true['dau'], dau_pred['dau']])
.plot(title='DAU, historic & predicted');
plt.axvline(PREDICTION_START, coloration='ok', linestyle='--');

We’ve obtained the prediction however thus far it’s not clear whether or not it’s truthful or not. Within the subsequent part, we’ll consider the mannequin.

4.1 Baseline mannequin

To start with, let’s test whether or not we actually have to construct a fancy mannequin to foretell DAU. Wouldn’t or not it’s higher to foretell DAU as a common time-series utilizing the talked about prophet library? The perform predict_dau_prophet beneath implements this. We attempt to use some tweaks accessible within the library with the intention to make the prediction extra correct. Specifically:

we use logistic mannequin as a substitute of linear to keep away from damaging values;
we add explicitly month-to-month and yearly seasonality;
we take away the outliers;
we explicitly outline a peak interval in January and February as “holidays”.

def predict_dau_prophet(prediction_start, prediction_end, dau_true, show_plot=True):
# assigning peak days for the brand new 12 months
holidays = pd.DataFrame({
'vacation': 'january_spike',
'ds': pd.date_range('2022-01-01', '2022-01-31', freq='D').tolist() + 
pd.date_range('2023-01-01', '2023-01-31', freq='D').tolist(),
'lower_window': 0,
'upper_window': 40
})m = Prophet(progress='logistic', holidays=holidays)
m.add_seasonality(identify='month-to-month', interval=30.5, fourier_order=3)
m.add_seasonality(identify='yearly', interval=365, fourier_order=3)
practice = dau_true.loc[(dau_true.index < prediction_start) & (dau_true.index >= '2021-08-01')]
train_prophet = pd.DataFrame({'ds': practice.index, 'y': practice.values})
# removining outliers
train_prophet.loc[train_prophet['ds'].between('2022-06-07', '2022-06-09'), 'y'] = None
train_prophet['new_year_peak'] = (train_prophet['ds'] >= '2022-01-01') &
(train_prophet['ds'] <= '2022-02-14')
m.add_regressor('new_year_peak')
# setting logistic higher and decrease bounds
train_prophet['cap'] = dau_true.max() * 1.1
train_prophet['floor'] = 0
m.match(train_prophet)
durations = len(pd.date_range(prediction_start, prediction_end))
future = m.make_future_dataframe(durations=durations)
future['new_year_peak'] = (future['ds'] >= '2022-01-01') & (future['ds'] <= '2022-02-14')
future['cap'] = dau_true.max() * 1.1
future['floor'] = 0
pred = m.predict(future)
if show_plot:
m.plot(pred);
# changing the predictions to an acceptable format
pred = pred
.assign(yhat=lambda _df: _df['yhat'].astype(int))
.rename(columns={'ds': 'date', 'yhat': 'rely'})
.set_index('date')
.clip(decrease=0)
['count']
.loc[lambda s: (s.index >= prediction_start) & (s.index <= prediction_end)]
return pred

The truth that the code seems to be fairly refined signifies that one can’t merely apply prophet to the DAU time-series.

Hereafter we take a look at a prediction for a number of predicting horizons: 3, 6, and 12 months. Because of this, we get 3 take a look at units:

3-months horizon: 2023-08-01 – 2023-10-31,
6-months horizon: 2023-05-01 – 2023-10-31,
1-year horizon: 2022-11-01 – 2023-10-31.

For every take a look at set we calculate the MAPE loss perform.

from sklearn.metrics import mean_absolute_percentage_errormapes = []
prediction_end = '2023-10-31'
prediction_horizon = [3, 6, 12]
for offset in prediction_horizon:
prediction_start = pd.to_datetime(prediction_end) - pd.DateOffset(months=offset - 1)
prediction_start = prediction_start.exchange(day=1)
prediction_end = '2023-10-31'
pred = predict_dau_prophet(prediction_start, prediction_end, dau_true['dau'], show_plot=False)
mape = mean_absolute_percentage_error(dau_true['dau'].reindex(pred.index), pred)
mapes.append(mape)
mapes = pd.DataFrame({'horizon': prediction_horizon, 'MAPE': mapes})
mapes

The MAPE error seems to be excessive: 18% — 35%. The truth that the shortest horizon has the very best error signifies that the mannequin is tuned for the long-term predictions. That is one other inconvenience of such an method: we’ve to tune the mannequin for every prediction horizon. Anyway, that is our baseline. Within the subsequent part we’ll evaluate it with extra superior fashions.

4.2 Basic analysis

On this part we consider the mannequin carried out within the Part 3.6. To this point we set the transition interval as 1 12 months earlier than the prediction begin. We’ll research how the prediction depends upon the transition interval within the Part 4.3. As for the brand new customers, we run the mannequin utilizing two choices: the true values and the anticipated ones. Equally, we repair the identical 3 prediction horizons and take a look at the mannequin on them.

The make_predicion helper perform beneath implements the described choices. It accepts prediction_start, prediction_end arguments defining the prediction interval for a given horizon, new_users_mode which might be both true or predict, and transition_period. The choices of the latter argument shall be defined additional.

import redef make_prediction(prediction_start, prediction_end, new_users_mode='predict', transition_period='last_30d'):
prediction_start_minus_1d = pd.to_datetime(prediction_start) - pd.Timedelta('1d')
state0 = get_state0(prediction_start_minus_1d)
if new_users_mode == 'predict':
new_users_pred = predict_new_users(prediction_start, prediction_end, new_users, show_plot=False)
elif new_users_mode == 'true':
new_users_pred = new_users.copy()
if transition_period.startswith('last_'):
shift = int(re.search(r'last_(d+)d', transition_period).group(1))
transitions_start = pd.to_datetime(prediction_start) - pd.Timedelta(shift, 'd')
M = get_transition_matrix(transitions, transitions_start, prediction_start_minus_1d)
dau_pred = predict_dau(M, state0, prediction_start, prediction_end, new_users_pred)
else:
transitions_start = pd.to_datetime(prediction_start) - pd.Timedelta(240, 'd')
M_base = get_transition_matrix(transitions, transitions_start, prediction_start_minus_1d)
dau_pred = pd.DataFrame()
month_starts = pd.date_range(prediction_start, prediction_end, freq='1MS')
N = len(month_starts)
for i, prediction_month_start in enumerate(month_starts):
prediction_month_end = pd.offsets.MonthEnd().rollforward(prediction_month_start)
transitions_month_start = prediction_month_start - pd.Timedelta('365D')
transitions_month_end = prediction_month_end - pd.Timedelta('365D')
M_seasonal = get_transition_matrix(transitions, transitions_month_start, transitions_month_end)
if transition_period == 'smoothing':
i = min(i, 12)
M = M_seasonal * i / (N - 1)  + (1 - i / (N - 1)) * M_base
elif transition_period.startswith('seasonal_'):
seasonal_coef = float(re.search(r'seasonal_(0.d+)', transition_period).group(1))
M = seasonal_coef * M_seasonal + (1 - seasonal_coef) * M_base
dau_tmp = predict_dau(M, state0, prediction_month_start, prediction_month_end, new_users_pred)
dau_pred = pd.concat([dau_pred, dau_tmp])
state0 = dau_tmp.loc[prediction_month_end][states_order]
return dau_pred
def prediction_details(dau_true, dau_pred, show_plot=True, ax=None):
y_true = dau_true.reindex(dau_pred.index)['dau']
y_pred = dau_pred['dau']
mape = mean_absolute_percentage_error(y_true, y_pred) 
if show_plot:
prediction_start = str(y_true.index.min().date())
prediction_end = str(y_true.index.max().date())
if ax is None:
y_true.plot(label='DAU true')
y_pred.plot(label='DAU pred')
plt.title(f'DAU prediction, {prediction_start} - {prediction_end}')
plt.legend()
else:
y_true.plot(label='DAU true', ax=ax)
y_pred.plot(label='DAU pred', ax=ax)
ax.set_title(f'DAU prediction, {prediction_start} - {prediction_end}')
ax.legend()
return mape

In complete, we’ve 6 prediction situations: 2 choices for brand spanking new customers and three prediction horizons. The diagram beneath illustrates the outcomes. The charts on the left relate to the new_users_mode = 'predict' possibility, whereas the precise ones relate to the new_users_mode = 'true' possibility.

fig, axs = plt.subplots(3, 2, figsize=(15, 6))
mapes = []
prediction_end = '2023-10-31'
prediction_horizon = [3, 6, 12]for i, offset in enumerate(prediction_horizon):
prediction_start = pd.to_datetime(prediction_end) - pd.DateOffset(months=offset - 1)
prediction_start = prediction_start.exchange(day=1)
args = {
'prediction_start': prediction_start,
'prediction_end': prediction_end,
'transition_period': 'last_365d'
}
for j, new_users_mode in enumerate(['predict', 'true']):
args['new_users_mode'] = new_users_mode
dau_pred = make_prediction(**args)
mape = prediction_details(dau_true, dau_pred, ax=axs[i, j])
mapes.append([offset, new_users_mode, mape])
mapes = pd.DataFrame(mapes, columns=['horizon', 'new_users', 'MAPE'])
plt.tight_layout()

And listed here are the MAPE values summarizing the prediction high quality:

mapes.pivot(index='horizon', columns='new_users', values='MAPE')

We discover a number of issues.

Basically, the mannequin demonstrates significantly better outcomes than the baseline. Certainly, the baseline relies on the historic DAU knowledge solely, whereas the mannequin makes use of the consumer states data.
Nonetheless, for the 1-year horizon and new_users_mode='predict' the MAPE error is large: 65%. That is 3 occasions greater than the corresponding baseline error (21%). However, new_users_mode='true' possibility offers a significantly better outcome: 8%. It signifies that the brand new customers prediction has a big impact on the mannequin, particularly for long-term predictions. For the shorter durations the distinction is much less dramatic. The main cause for such a distinction is that 1-year interval consists of Christmas with its excessive values. Because of this, i) it is arduous to foretell such excessive new consumer values, ii) the interval closely impacts consumer conduct, the transition matrix and, consequently, DAU values. Therefore, we strongly suggest to implement the brand new consumer prediction rigorously. The baseline mannequin was specifically tuned for this Christmas interval, so it is not stunning that it outperforms the Markov mannequin.
When the brand new customers prediction is correct, the mannequin captures developments effectively. It signifies that utilizing final 12 months for the transition matrix calculation is an inexpensive alternative.
Apparently, the true new customers knowledge gives worse outcomes for the 3-months prediction. That is nothing however a coincidence. The mistaken new customers prediction in October 2023 reversed the anticipated DAU development and made MAPE a bit decrease.

Now, let’s decompose the prediction error and see which states contribure essentially the most. By error we imply right here dau_pred – dau_true values, by relative error – ( dau_pred – dau_true) / dau_true – see left and proper diagrams beneath correspondingly. With a purpose to concentrate on this facet, we’ll slender the configuration to the 3-months prediction horizon and the new_users_mode='true' possibility.

dau_component_cols = ['new', 'current', 'reactivated', 'resurrected']dau_pred = make_prediction('2023-08-01', '2023-10-31', new_users_mode='true', transition_period='last_365d')
determine, (ax1, ax2) = plt.subplots(1, 2, figsize=(12, 4))
dau_pred[dau_component_cols]
.subtract(dau_true[dau_component_cols])
.reindex(dau_pred.index)
.plot(title='Prediction error by state', ax=ax1)
dau_pred[['current']]
.subtract(dau_true[['current']])
.div(dau_true[['current']])
.reindex(dau_pred.index)
.plot(title='Relative prediction error (present state)', ax=ax2);

From the left chart we discover that the error is principally contributed by the present state. It isn’t stunning since this state contributes to DAU essentially the most. The error for the reactivated, and resurrected states is kind of low. One other attention-grabbing factor is that this error is usually damaging for the present state and largely optimistic for the resurrected state. The previous is likely to be defined by the truth that the brand new customers who appeared within the prediction interval are extra engaged that the customers from the previous. The latter signifies that the resurrected customers in actuality contribute to DAU lower than the transition matrix expects, so the dormant→ resurrected conversion charge is overestimated.

As for the relative error, it is sensible to investigate it for the present state solely. It is because the each day quantity of the reactivated and resurrected states are low so the relative error is excessive and noisy. The relative error for the present state varies between -25% and 4% which is kind of excessive. And since we have fastened the brand new customers prediction, this error is defined by the transition matrix inaccuracy solely. Specifically, the present→ present conversion charge is roughly 0.8 which is excessive and, consequently, it contributes to the error lots. So if we need to enhance the prediction we have to take into account tuning this conversion charge foremost.

4.3 Transitions interval impression

Within the earlier part we saved the transitions interval fastened: 1 12 months earlier than a prediction begin. Now we’re going to check how lengthy this era must be to get extra correct prediction. We take into account the identical prediction horizons of three, 6, and 12 months. With a purpose to mitigate the noise from the brand new customers prediction, we use the true values of the brand new customers quantity: new_users_mode='true'.

Right here comes various of the transition_period argument. Its values are masked with the last_d sample the place N stands for the variety of days in a transitions interval. For every prediction horizon we calculate 12 totally different transition durations of 1, 2, …, 12 months. Then we calculate the MAPE error for every of the choices and plot the outcomes.

outcome = []for prediction_offset in prediction_horizon:
prediction_start = pd.to_datetime(prediction_end) - pd.DateOffset(months=prediction_offset - 1)
prediction_start = prediction_start.exchange(day=1)
for transition_offset in vary(1, 13):
dau_pred = make_prediction(
prediction_start, prediction_end, new_users_mode='true',
transition_period=f'last_{transition_offset*30}d'
)
mape = prediction_details(dau_true, dau_pred, show_plot=False)
outcome.append([prediction_offset, transition_offset, mape])
outcome = pd.DataFrame(outcome, columns=['prediction_period', 'transition_period', 'mape'])
outcome.pivot(index='transition_period', columns='prediction_period', values='mape')
.plot(title='MAPE by prediction and transition interval');

It seems that the optimum transitions interval depends upon the prediction horizon. Shorter horizons require shorter transitions durations: the minimal MAPE error is achieved at 1, 4, and eight transition durations for the three, 6, and 12 months correspondingly. Apparently, it’s because the longer horizons include some seasonal results that may very well be captured solely by the longer transitions durations. Additionally, it appears that evidently for the longer prediction horizons the MAPE curve is U-shaped that means that too lengthy and too brief transitions durations are each not good for the prediction. We’ll develop this concept within the subsequent part.

4.4 Obsolence and seasonality

Nonetheless, fixing a single transition matrix for predicting the entire 12 months forward doesn’t appear to be a good suggestion: such a mannequin could be too inflexible. Normally, consumer conduct varies relying on a season. For instance, customers who seem after Christmas may need some shifts in conduct. One other typical scenario is when customers change their conduct in summer season. On this part, we’ll attempt to have in mind these seasonal results.

So we need to predict DAU for 1 12 months forward ranging from November 2022. As an alternative of utilizing a single transition matrix M_base which is calculated for the final 8 months earlier than the prediction begin, in keeping with the earlier subsection outcomes (and labeled because the last_240d possibility beneath), we’ll take into account a mix of this matrix and a seasonal one M_seasonal. The latter is calculated on month-to-month foundation lagging 1 12 months behind. For instance, to foretell DAU for November 2022 we outline M_seasonal because the transition matrix for November 2021. Then we shift the prediction horizon to December 2022 and calculate M_seasonal for December 2021, and so forth.

With a purpose to combine M_base and M_seasonal we outline the next two choices.

seasonal_0.3: M = 0.3 * M_seasonal + 0.7 * M_base. 0.3 is a weight that was chosen as an area minimal after some experiments.
smoothing: M = i/(N-1) * M_seasonal + (1 – i/(N – 1)) * M_base the place N is the variety of months throughout the predicting interval, i = 0, …, N – 1 – the month index. The concept of this configuration is to steadily swap from the newest transition matrix M_base to seasonal ones because the prediction month strikes ahead from the prediction begin.

outcome = pd.DataFrame()
for transition_period in ['last_240d', 'seasonal_0.3', 'smoothing']:
outcome[transition_period] = make_prediction(
'2022-11-01', '2023-10-31',
'true',
transition_period
)['dau']
outcome['true'] = dau_true['dau']
outcome['true'] = outcome['true'].astype(int)
outcome.plot(title='DAU prediction by totally different transition matrices');

mape = pd.DataFrame()
for col in outcome.columns:
if col != 'true':
mape.loc[col, 'mape'] = mean_absolute_percentage_error(outcome['true'], outcome[col])
mape

Based on the MAPE errors, seasonal_0.3 configuration gives the very best outcomes. Apparently, smoothing method has gave the impression to be even worse than the last_240d. From the diagram above we see that every one three fashions begin to underestimate the DAU values in July 2023, particularly the smoothing mannequin. Plainly the brand new customers who began showing in July 2023 are extra engaged than the customers from 2022. In all probability, the app was improved sufficiently or the advertising and marketing group did an important job. Because of this, the smoothing mannequin that a lot depends on the outdated transitions knowledge from July 2022 – October 2022 fails greater than the opposite fashions.

4.5 Remaining resolution

To sum issues up, let’s make a ultimate prediction for the 2024 12 months. We use the seasonal_0.3 configuration and the anticipated values for brand spanking new customers.

dau_pred = make_prediction(
PREDICTION_START, PREDICTION_END,
new_users_mode='predict',
transition_period='seasonal_0.3'
)
dau_true['dau'].plot(label='true')
dau_pred['dau'].plot(label='seasonal_0.3')
plt.title('DAU, historic & predicted')
plt.axvline(PREDICTION_START, coloration='ok', linestyle='--')
plt.legend();

Within the Part 4 we studied the mannequin efficiency from the prediction accuracy perspective. Now let’s focus on the mannequin from the sensible perspective.

Apart from poor accuracy, predicting DAU as a time-series (see the Part 4.1) makes this method very stiff. Basically, it makes a prediction in such a way so it will match historic knowledge finest. In apply, when planning for a subsequent 12 months we often have some sure expectations concerning the future. For instance,

the advertising and marketing group goes to launch some new more practical campaings,
the activation group is planning to enhance the onboarding course of,
the product group will launch some new options that will interact and retain customers extra.

Our mannequin can have in mind such expectations. For the examples above we will alter the brand new customers prediction, the new→ present and the present→ present conversion charges respectively. Because of this, we will get a prediction that does not match with the historic knowledge however nonetheless could be extra reasonable. This mannequin’s property is not only versatile – it is interpretable. You possibly can simply focus on all these changes with the stakeholders, and so they can perceive how the prediction works.

One other benefit of the mannequin is that it doesn’t require predicting whether or not a sure consumer shall be energetic on a sure day. Generally binary classifiers are used for this function. The draw back of this method is that we have to apply such a classifier to every consumer together with all of the dormant customers and every day from a prediction horizon. It is a tremedous computational value. In distinction, the Markov mannequin requires solely the preliminary quantity of states ( state0). Furthermore, such classiffiers are sometimes black-box fashions: they’re poorly interpretable and arduous to regulate.

The Markov mannequin additionally has some limitations. As we have already got seen, it’s delicate to the brand new customers prediction. It’s straightforward to completely wreck the prediction by a mistaken new customers quantity. One other drawback is that the Markov mannequin is memoryless that means that it doesn’t have in mind the consumer’s historical past. For instance, it doesn’t distinguish whether or not a present consumer is a beginner, skilled, or reactivated/ resurrected one. The retention charge of those consumer sorts must be actually totally different. Additionally, as we mentioned earlier, the consumer conduct is likely to be of various nature relying on the season, advertising and marketing sources, international locations, and so forth. To this point our mannequin just isn’t in a position to seize these variations. Nonetheless, this is likely to be a topic for additional analysis: we might lengthen the mannequin by becoming extra transition matrices for various consumer segments.

Lastly, as we promised within the introduction, we offer a DAU spreadsheet calculator. Within the Prediction sheet you may have to fill the preliminary states distribution row (marked with blue) and the brand new customers prediction column (marked with purple). Within the Conversions sheet you may alter the transition matrix values. Do not forget that the sum of every row of the matrix must be equal to 1.