Having the following schema of model's predictions over time per user:
timestamp
user_id
model_id
version
prediction
2022-06-22 05:29:36.344
1
model_a
1
[0.1226]
2022-06-22 05:29:41.307
1
model_a
1
[0.932]
...
1
model_a
1
...
2022-06-22 05:29:43.511
2
model_a
1
[0.0226]
2022-06-22 05:29:43.870
2
model_a
1
[0.132]
...
2
model_a
1
...
I would like to select the last prediction per user (by timestamp), then take the result vector and calculate the boundary values of 100 percentiles. I have come out with the following query that get the job done:
with preds as
(select user_id, last_prediction
from
(select user_id, round(prediction[1],5) last_prediction, timestamp curr_ts, max(timestamp) over (partition by user_id) max_ts
FROM "my-schema"."my-table"
where date(timestamp) BETWEEN date('2022-08-08') AND date('2022-08-09')
AND version = '1' AND model_id = 'model_a')
where curr_ts = max_ts),
with_ntiles as
(select *,NTILE(100) OVER(ORDER BY last_prediction) calculated_ntile
from preds)
select calculated_ntile, min(last_prediction) min_pred, max(last_prediction) max_pred
from with_ntiles
group by 1 order by 1
Result:
#
calculated_ntile
min_pred
max_pred
1
1
0.00172
0.00261
2
2
0.00263
0.00305
3
3
0.00305
0.00345
...
...
...
...
I'm looking for a more elegant and maybe a faster solution.