BigQuery set up

It's time to start the technical onboarding with Catalog. It's pretty simple; you need to create a Service Account for Catalog and add it to the Catalog App.

Which rights do you need to complete the onboarding ?

In order to create the Google Service account, you need to have at minimum one of the following roles

  • 🧑‍💼 Service Account Admin

  • 🧑‍💻 Editor Basic

What will Castor do with this access ?

Catalog's access: The given roles will only allow Catalog to see the data model, the queries and the users, by running queries on the metadata. It is the minimum roles allowing us to read your metadata while not being able to read your data.

Need some more info ?

  • If you want some further information on service accounts, :

  • 🧑‍🔧 Service account role documentation is here

  • 🔑 Roles and permission documentation is here

1. Whitelist Catalog's IP on BigQuery

Needed only if you've set up network policies, meaning your BigQuery instance only accepts connections coming from specific IPs.

To allow Catalog to connect to BigQuery, you will first need to whitelist Catalog's IP address:

35.246.176.138

BiqQuery's documentation on configuring public IP connectivity can be found here

2. Create Catalog User on BigQuery

2.1 Create Google Service Account for Catalog

Client creates a service account for Catalog from the Google Console. See how to here

Make sure to create and download a json key for that service account. See how-to here

2.2 Grant Roles to Google Service Account

Client grants the needed access to this new service account with the following roles. See how to here.

You can find the documentation on the different roles here.

  • Bigquery Read Session User

    The Bigquery Read Session User role allows to Catalog create and use read sessions. It allows read capabilities via SQL yet does not grant by itself any data access.

  • Bigquery Metadata Viewer

    The metadataViewer role allows Catalog to fetch the schemas of your data

  • Bigquery Job User

    The jobUser role is used by Catalog to parse queries and compute lineage and usage stats

  • Bigquery Resource Viewer

    The resourceViewer is also used by Catalog to parse queries and compute lineage and usage stats

3. Add new credentials in Catalog App

You must a Catalog admin to do it.

You can now enter the newly created credentials in the Catalog App here.

  • Go to "Settings > Integrations"

  • Click on "BigQuery Add"

  • Add the credentials

If you are using a separate project to host your service accounts: Make sure update the json key. You'll have to overwrite the "project_id" with a project from which the service user can run query from.

Appendix

Queries Ran by the Castor User

  • Databases

    WITH catalogs AS (
        SELECT DISTINCT catalog_name
        FROM `{project}.region-{region}.INFORMATION_SCHEMA.SCHEMATA`
    )
    
    SELECT
        catalog_name AS database_id,
        catalog_name AS database_name
    FROM catalogs
  • Schemas

    SELECT
        catalog_name AS database_id,
        catalog_name AS database_name,
        schema_name,
        schema_owner,
        creation_time,
        last_modified_time,
        location,
        CONCAT(catalog_name, '.', schema_name) AS schema_id
    FROM `{project}.region-{region}.INFORMATION_SCHEMA.SCHEMATA`
  • Tables

    WITH d AS (
        SELECT
            table_catalog,
            table_schema,
            table_name,
            option_value AS `comment`
        FROM
            `{project}.region-{region}.INFORMATION_SCHEMA.TABLE_OPTIONS`
        WHERE TRUE
            AND option_name = 'description'
            AND option_value IS NOT NULL
            AND option_value != ''
            AND option_value != '""'
    ),
    t AS
    (
        SELECT
            table_catalog,
            table_schema,
            table_name,
            option_value AS tags
        FROM
            `{project}.region-{region}.INFORMATION_SCHEMA.TABLE_OPTIONS`
        WHERE TRUE
            AND option_name = 'labels'
            AND option_value IS NOT NULL
            AND option_value != ''
            AND option_value != '""'
    ),
    
    dt AS
    (
        SELECT
            catalog_name,
            schema_name,
            option_value AS tags
        FROM
            `{project}.region-{region}.INFORMATION_SCHEMA.SCHEMATA_OPTIONS`
        WHERE TRUE
            AND option_name = 'labels'
            AND option_value IS NOT NULL
            AND option_value != ''
            AND option_value != '""'
    )
    
    SELECT
        i.table_catalog AS database_name,
        i.table_catalog AS database_id,
        i.table_schema AS `schema_name`,
        i.table_name AS table_name,
        i.table_type,
        i.is_insertable_into,
        i.is_typed,
        i.creation_time,
        d.comment,
        CONCAT(
            COALESCE(t.tags, ""),
            COALESCE(dt.tags, "")
        ) AS tags,
        CONCAT(i.table_catalog, '.', i.table_schema) AS schema_id,
        CONCAT(i.table_catalog, '.', i.table_schema, '.', i.table_name) AS table_id
    FROM
        `{project}.region-{region}.INFORMATION_SCHEMA.TABLES` AS i
    LEFT JOIN d ON i.table_catalog = d.table_catalog
                       AND i.table_schema = d.table_schema
                       AND i.table_name = d.table_name
    LEFT JOIN t ON i.table_catalog = t.table_catalog
                    AND i.table_schema = t.table_schema
                    AND i.table_name = t.table_name
    LEFT JOIN dt ON i.table_catalog = dt.catalog_name
                    AND i.table_schema = dt.schema_name
  • Columns

    WITH field_path AS (
        SELECT
            table_catalog,
            table_schema,
            table_name,
            column_name,
            description,
            field_path,
            data_type
        FROM
            `{project}.region-{region}.INFORMATION_SCHEMA.COLUMN_FIELD_PATHS`
    )
    
    SELECT
        c.table_catalog AS database_id,
        c.table_catalog AS database_name,
        c.table_schema AS `schema_name`,
        c.table_name AS table_name,
        f.field_path AS column_name,
        c.ordinal_position,
        c.is_nullable,
        f.data_type,
        c.is_generated,
        c.generation_expression,
        c.is_stored,
        c.is_hidden,
        c.is_updatable,
        c.is_system_defined,
        c.is_partitioning_column,
        c.clustering_ordinal_position,
        f.description AS `comment`,
        CONCAT(c.table_catalog, '.', c.table_schema) AS schema_id,
        CONCAT(c.table_catalog, '.', c.table_schema, '.', c.table_name) AS table_id,
        CONCAT(c.table_catalog, '.', c.table_schema, '.', c.table_name, '.', f.field_path) AS column_id
    FROM
        `{{project}}.region-{{region}}.INFORMATION_SCHEMA.COLUMNS` AS c
        LEFT JOIN field_path AS f ON
                c.table_catalog = f.table_catalog
                AND c.table_schema = f.table_schema
                AND c.table_name = f.table_name
                AND c.column_name = f.column_name
    WHERE TRUE
        AND c.column_name != '_PARTITIONTIME'
  • Users

    SELECT DISTINCT user_email AS user_email
    FROM `{project}.region-{region}.INFORMATION_SCHEMA.JOBS_BY_PROJECT`
    WHERE TRUE
        AND DATE(creation_time) >= DATE_ADD(CURRENT_DATE, INTERVAL -30 DAY)
  • Queries

    SELECT
        job_id AS query_id,
        creation_time,
        project_id AS database_name,
        user_email AS user_name,
        user_email AS user_id,
        job_type,
        statement_type,
        priority,
        start_time,
        end_time,
        query AS query_text,
        state,
        reservation_id,
        total_bytes_processed,
        total_bytes_billed,
        total_slot_ms,
        error_result,
        cache_hit,
        destination_table,
        referenced_tables,
        labels,
        parent_job_id
    FROM `{project}.region-{region}.INFORMATION_SCHEMA.JOBS_BY_PROJECT`
    WHERE TRUE
        AND DATE(creation_time) = DATE_ADD(CURRENT_DATE, INTERVAL -1 DAY)
        AND EXTRACT(hour FROM creation_time) BETWEEN 0 AND 23
        AND job_type = 'QUERY'
  • View DDL

    SELECT
        table_catalog AS database_name,
        table_schema AS schema_name,
        table_name AS view_name,
        '{{"project_id": "' || table_catalog || '", "dataset_id": "' || table_schema || '", "table_id": "' || table_name || '"}}' as destination_table,
        view_definition
     FROM `{project}.{dataset}.INFORMATION_SCHEMA.VIEWS`

References

Here's a list of all Google Cloud documentation referenced above

  • Configuring public IP connectivity: click here

  • Service account role documentation: click here

  • Roles and permission documentation: click here

  • Setting roles and permission: click here

Last updated

Was this helpful?