☑️File checker

This tool checks generic CSV files (data types, mandatory columns, broken references, etc.).

Introduction

◀️ The tool takes two things as input:

  • A file template, specifying the rules:

    • Mandatory columns

    • Data types

    • Reference (Foreign Keys)

    • Unicity (Primary Keys)

  • A set of CSV files, to be checked

▶️ After running, the tool produces two outputs:

  • A status, indicating if the files are valid or not

  • A summary of detected errors (empty when the files are valid)

  • A detailed log of errors, when the verbose mode is activated

This tool allows to track issues, fix them, and re-run the checks. Once you're happy with the % of valid rows, you can push the files to Catalog.

Example

Here's an example of the output logs (without the verbose mode):

INFO - DATABASE -- 2 rows -- valid

INFO - SCHEMA -- 12 rows -- valid

INFO - TABLE -- 62 rows -- valid

INFO - COLUMN -- 616 rows -- ERROR (3 invalid rows)
| MISSING_VALUE: 1
| UNAUTHORIZED_VALUE: 2
| UNKNOWN_REFERENCE: 1

INFO - USER -- 4 rows -- valid

INFO - VIEW_DDL -- 2 rows -- valid

INFO - QUERY -- 153 rows -- valid

In that case, the files are almost valid, except for COLUMNS:

  • 3 rows are not valid (4 issues were detected)

  • If you push those files to Catalog, 613 columns will be loaded

List of issues

The following errors can be detected:

  • MISSING_VALUE

    • some columns are mandatory and must be provided

    • check the header: some columns might be missing or have a wrong name

    • check the rows: some cells might be empty

  • WRONG_DATATYPE

    • dates, floats, integers

  • UNAUTHORIZED_VALUE

    • this column accepts only certain values

    • example: { TABLE | VIEW | EXTERNAL}

    • detailed logs show the list of authorized values

  • DUPLICATE_VALUE

    • this column must be unique and a duplicate was detected

  • UNKNOWN_REFERENCE

    • the column references another column which has not been found

    • check that the reference is not broken

    • check that the referred row is valid itself (otherwise it will be considered as unknown)

  • REPEATED_QUOTES

    • the value is surrounded by repeated quotes that will be interpreted as such

    • the checker fails to avoid loading quoted text in Catalog, such as "dim_client"

    • it's probably not intentional and it generally happens when using spreadsheet to generate the CSV (File > Export As > CSV)

Spreadsheet handles quoting itself when generating CSV files.

Templates

Generic warehouse

It allows to push generic warehouse CSV files to Catalog:

  • Databases

  • Schemas

  • Tables

  • etc.

More info about this template here

For now, this is the only template provided with the tool.

Usage

castor-file-check [arguments]
  • --directory: Path to the directory containing your CSV files

  • --verbose: (optional) When provided, a log of each invalid row will be displayed 🔍

The given directory must contain all files mentioned in the template:

  • database.csv

  • schema.csv

  • etc.

The files can be prefixed (with timestamps or anything else):

978193-database.csv

978193-schema.csv

etc.

If several files are found, the most recently created will be used 🕑

Last updated

Was this helpful?