Popularity

How does it work in Catalog?

What is popularity?

A table or Dashboard popularity tells you how frequently it is used by human users.

The popularity is a score given to data assets from 1 to 1 million and then downsized to a 5 stars system.

The popularity computation is a quantile computation based on the number of queries (for table) or number of views (for dashboard) amongst all tables/dashboards of the same source.

We may notably exclude table queries from specific users (settings on the extraction), usually coming from bots or services.

How is it computed?

  • Ranking: We sort the assets with respect to their score (number of queries/views)

  • Bucketing: We put the asset into different buckets according to their rank, aka "global scoring”

    • There are 8 buckets of varying size with the following thresholds:

      Bucket 0: assets ranked between 0% -> 33% -- Bottom 33%
      Bucket 1: assets ranked between 33% -> 48%
      Bucket 2: assets ranked between 48% -> 63%
      Bucket 3: assets ranked between 63% -> 73%
      Bucket 4: assets ranked between 73% -> 83%
      Bucket 5: assets ranked between 83% -> 93%
      Bucket 6: assets ranked between 93% -> 98%
      Bucket 7: assets ranked between 98% -> 100% -- Top 2%

  • In-bucket ranking: Within a single bucket, we sort again the assets and share them equally for

  • Final Score: Finally we compute a score out of the max popularity for all assets of a given source

    • With MAX being 1 000 000

    • With the number of buckets being 8

  • From the number to the stars 💫

    • Everything before this step gives us a score that we store in our database, however, another process happens in the frontend in order to show you the number of stars according to the score.

    • As mentioned, the popularity can range from 0 to 1 000 000 (or be undefined). Then we bucket it down to 11 states, corresponding to stars and half stars - 0, 0.5, 1, ..., 4.5, 5.

Which queries are used for computation

First of all, popularity is calculated on 30 days of activities following the last refresh of your source.

Then there are a few exclusions that allow us to determine a more accurate popularity:

  • We only use read queries, we want to be about usage, not update

  • We exclude queries that are immense or too small

  • We exclude service accounts from the calculation as we want to determine human usage and behavior. Queries by service account are translated in the lineage as you'll find parent/children assets there.

Some facts

  • There is exactly 1 asset per source with a perfect score of 1M

  • Due to the way the bucketing is done, two assets with the same number of queries might end up in 2 different buckets

  • The top 2% of assets are all in the 8th bucket and hence have a score over 875000 which means a number of stars between 4.5 and 5.

  • The bottom 33% of assets are all in the 1st bucket and as such have a score lower than 125000 which means a number of stars between 0 and 0.5

Last updated

Was this helpful?