Skip to main content

How to convert a JSON file in strucutured data in Snowflake with the FLATTEN function

·1706 words·9 mins
snowflake sql
Pablo Sáenz de Tejada
Pablo Sáenz de Tejada
I help people analyze, visualyze and communicate with data.
Table of Contents

Do you need to transform semi-structured data coming from a JSON file into structured data? It’s easier than it sounds with the FLATTEN function in Snowflake.

Nowadays, semi-structured data such as the one we find in JSON files, is very common. The INE1 for example, allows to download data in semi-structured JSON files and a lot of websites and APIs return data in this format. The structured of JSON files can be of great advantage when handling large volumes of information but it also tends to be more difficult when we want to analyze and visualyze the data in those formats. In these cases, we can quickly load data into our data lake or data warehouse in that semi-structured format. But, how do we later facilitate the analysis? This is where the FLATTEN function available in Snowflake comes into play, which is incredibly useful for transforming that semi-structured data into structured data.

What does Snowflake’s FLATTEN function do?

In a short and direct way:

The FLATTEN function allows us to start from a nested data structure, such as a JSON file, and convert it into a flat table where each nested element is converted into an individual row, while maintaining the columns of the original table.

In other words, it allows us to transform data from this structure:

| id | COD       | Data                                                            |
| 1  | ADRH18589 | [{"Year":2021, "Value":20587.0},{"Year":2020, "Value":20118.0}] |
| 2  | ADRH18587 | [{"Year":2021, "Value":12271.0},{"Year":2020, "Value":11226.0}] |

To this one:

| id | COD       | Year | Value   |
| 1  | ADRH18589 | 2021 | 20587.0 |
| 1  | ADRH18589 | 2020 | 20118.0 |
| 2  | ADRH18587 | 2021 | 12271.0 |
| 2  | ADRH18587 | 2020 | 11226.0 |

How can we do this?

The SQL syntax in Snowflake is not complex. Based on the previous example, it will be something like this:

    myTable t,
    LATERAL FLATTEN (input => t.Data) d;

We need to specify we want to query the COD column of the myTable table (t) and also that we want to query the values of Year and Value as two separate columns, but these columns are inside the Data column, which we want to expand with that LATERAL FLATTEN function.

Practical example: Income by census area data from the INE

The INE offers data about average net income per person in Spain by census areas. This data is easily downloadable through the website and also it’s easy to load it into modern platforms like Snowflake.

JSON file data structure from INE

When we download the data in JSON format we will have a file with the following structure (metadata will be in Spanish, but it’s basically average net income per person for each census area and each year):

        "Nombre":"Melilla sección 01001. Dato base. Renta neta media por persona. ", 
        "T3_Escala":" ", 
            {"Id":366733, "Variable":{"Id":847, "Nombre":"Secciones", "Codigo":"SECC"}, "Nombre":"Melilla sección 01001", "Codigo":"5200101001"},
            {"Id":72, "Variable":{"Id":3, "Nombre":"Tipo de dato", "Codigo":""}, "Nombre":"Dato base", "Codigo":""},
            {"Id":284048, "Variable":{"Id":482, "Nombre":"SALDOS CONTABLES", "Codigo":""}, "Nombre":"Renta neta media por persona ", "Codigo":""}
            {"Fecha":"2021-01-01T00:00:00.000+01:00", "T3_TipoDato":"Definitivo", "T3_Periodo":"A", "Anyo":2021, "Valor":20587.0},
            {"Fecha":"2020-01-01T00:00:00.000+01:00", "T3_TipoDato":"Definitivo", "T3_Periodo":"A", "Anyo":2020, "Valor":20118.0},
            {"Fecha":"2019-01-01T00:00:00.000+01:00", "T3_TipoDato":"Definitivo", "T3_Periodo":"A", "Anyo":2019, "Valor":19863.0},
            {"Fecha":"2018-01-01T00:00:00.000+01:00", "T3_TipoDato":"Definitivo", "T3_Periodo":"A", "Anyo":2018, "Valor":18507.0},
            {"Fecha":"2017-01-01T00:00:00.000+01:00", "T3_TipoDato":"Definitivo", "T3_Periodo":"A", "Anyo":2017, "Valor":17919.0},
            {"Fecha":"2016-01-01T00:00:00.000+01:00", "T3_TipoDato":"Definitivo", "T3_Periodo":"A", "Anyo":2016, "Valor":16805.0},
            {"Fecha":"2015-01-01T00:00:00.000+01:00", "T3_TipoDato":"Definitivo", "T3_Periodo":"A", "Anyo":2015, "Valor":16237.0}
        "Nombre":"Melilla sección 01002. Dato base. Renta neta media por persona. ", 
        "T3_Escala":" ", 
            {"Id":366734, "Variable":{"Id":847, "Nombre":"Secciones", "Codigo":"SECC"}, "Nombre":"Melilla sección 01002", "Codigo":"5200101002"},
            {"Id":72, "Variable":{"Id":3, "Nombre":"Tipo de dato", "Codigo":""}, "Nombre":"Dato base", "Codigo":""},
            {"Id":284048, "Variable":{"Id":482, "Nombre":"SALDOS CONTABLES", "Codigo":""}, "Nombre":"Renta neta media por persona ", "Codigo":""}
            {"Fecha":"2021-01-01T00:00:00.000+01:00", "T3_TipoDato":"Definitivo", "T3_Periodo":"A", "Anyo":2021, "Valor":12271.0},
            {"Fecha":"2020-01-01T00:00:00.000+01:00", "T3_TipoDato":"Definitivo", "T3_Periodo":"A", "Anyo":2020, "Valor":11226.0},
            {"Fecha":"2019-01-01T00:00:00.000+01:00", "T3_TipoDato":"Definitivo", "T3_Periodo":"A", "Anyo":2019, "Valor":11145.0},
            {"Fecha":"2018-01-01T00:00:00.000+01:00", "T3_TipoDato":"Definitivo", "T3_Periodo":"A", "Anyo":2018, "Valor":10769.0},
            {"Fecha":"2017-01-01T00:00:00.000+01:00", "T3_TipoDato":"Definitivo", "T3_Periodo":"A", "Anyo":2017, "Valor":10204.0},
            {"Fecha":"2016-01-01T00:00:00.000+01:00", "T3_TipoDato":"Definitivo", "T3_Periodo":"A", "Anyo":2016, "Valor":10024.0},
            {"Fecha":"2015-01-01T00:00:00.000+01:00", "T3_TipoDato":"Definitivo", "T3_Periodo":"A", "Anyo":2015, "Valor":9752.0}

The JSON file above contains the data for two census areas. Each element within the JSON, i.e. each census area, is made up of 5 elements. In addition, two of them (Metadata and Data) contain sub-elements:

  • COD
  • Nombre
  • T3_Unidad
  • T3_Escala
  • Metadata - With 3 sub-elements, each of those formed by:
    • ID
    • Variable
    • Nombre
    • Código
  • Data - With 7 sub-elements, each of those formed by:
    • Fecha
    • T3_TipoDato
    • T3_Periodo
    • Anyo
    • Valor

Table structure when uploading the data to Snowflake

One of Snowflake’s advantages is that we can directly load data from JSON files like this one. Doing so will store all the information in 2 rows, one per census section, and in 5 columns: COD, Name, T3_Unidad, T3_escala, Metadata, and Data.

Our table will look something like this:

| COD       | Nombre | T3_Unidad | T3_Escala | Metadata    | Data    |
| ADRH18589 | Texto  | Euros     |           | [Metadata] | [Data]   |
| ADRH18587 | Texto  | Euros     |           | [Metadata] | [Data]   |

How can we analyze the data about the average net income per person, for each census area and each year?

Somehow, we have to get a column with the Date or Anyo and the Value of each census area that is within Data. In addition, we may also need to keep information within the Metadata section as separate columns. For example the Name and Code but only for the first element within that Metadata column. It’s precisely for these types of use cases that Snowflake’s native FLATTEN function becomes very useful.

The SQL query in Snowflake with the FLATTEN function

This is the query we need to run in Snowflake and that will allow us to analyze the information more easily by expanding the array of DATA elements within the JSON file, maintaining the rest of the structure and also generating additional columns for the census area name and census area code found in the first element of the Metadata array:

    d.value:Anyo::INTEGER as year,
    d.value:Fecha::date as Date,
    d.value:T3_TipoDato::VARCHAR(50) as DataType,
    d.value:T3_Periodo::VARCHAR(1) as PeriodType,
    d.value:Valor::FLOAT as avgNetIncomePerPerson,
    t.metadata[0]:Nombre::STRING as censusAreaName,
    t.metadata[0]:Codigo::STRING as censusAreaCode
    myTable t,
    LATERAL FLATTEN(input => t.DATA) d

What we are defining in the query is:

  • The columns from the raw table that we want to keep as they are: like the COD column.
  • The columns we don’t want to expand, but we want to extract information from a concrete element of the array, like in the Metadata column. For this column we want to query the nombre and codigo information, but only for the first element of the array. That’s why we specify metadata[0] (first element of the metadata array).
  • The values inside the data array that we want to expand as different rows, specifying with :: the type of data for those new columns. For example: d.value:Fecha::date as Fecha indicates we want to get the value inside the d element (the data column) call Fecha and treat that as a Date column type, naming that new column Date.
  • And what does the d.value mean? Here is where the last part of the query, the LATERAL FLATTEN (input => t.DATA) d comes into play. What we are telling Snowflake to do is, look into the table myTable t, and queremos “unnest” the column DATA, generating one new row for each element inside the DATA array. What it’s also very powerful is that we can use several instances of the LATERAL FUNCTION because each LATERAL view is based on the previous one to refer to elements in multiple levels of arrays.

The output

Using the previous query can build a new table or view with the following structure:

| COD       | year | Date       | DataType   | PeriodType  | avgNetIncomePerPerson | censusAreaName        | censusAreaCode      |
| ADRH18589 | 2021 | 2021-01-01 | Definitivo | A           | 20587.0               | Melilla sección 01001 | 5200101001          |
| ADRH18589 | 2020 | 2020-01-01 | Definitivo | A           | 20118.0               | Melilla sección 01001 | 5200101001          |
| ADRH18589 | 2019 | 2019-01-01 | Definitivo | A           | 19863.0               | Melilla sección 01001 | 5200101001          |
| ADRH18589 | 2018 | 2018-01-01 | Definitivo | A           | 18507.0               | Melilla sección 01001 | 5200101001          |
| ADRH18589 | 2017 | 2017-01-01 | Definitivo | A           | 17919.0               | Melilla sección 01001 | 5200101001          |
| ADRH18589 | 2016 | 2016-01-01 | Definitivo | A           | 16805.0               | Melilla sección 01001 | 5200101001          |
| ADRH18589 | 2015 | 2015-01-01 | Definitivo | A           | 16237.0               | Melilla sección 01001 | 5200101001          |
| ADRH18587 | 2021 | 2021-01-01 | Definitivo | A           | 12271.0               | Melilla sección 01002 | 5200101002          |
| ADRH18587 | 2020 | 2020-01-01 | Definitivo | A           | 11226.0               | Melilla sección 01002 | 5200101002          |
| ADRH18587 | 2019 | 2019-01-01 | Definitivo | A           | 11145.0               | Melilla sección 01002 | 5200101002          |
| ADRH18587 | 2018 | 2018-01-01 | Definitivo | A           | 10769.0               | Melilla sección 01002 | 5200101002          |
| ADRH18587 | 2017 | 2017-01-01 | Definitivo | A           | 10204.0               | Melilla sección 01002 | 5200101002          |
| ADRH18587 | 2016 | 2016-01-01 | Definitivo | A           | 10024.0               | Melilla sección 01002 | 5200101002          |
| ADRH18587 | 2015 | 2015-01-01 | Definitivo | A           | 9752.0                | Melilla sección 01002 | 5200101002          |

With this new structure is much easier to analyze the data with any analytical platform like Tableau.

Advantages of the FLATTEN function

The FLATTEN function is an easy and fast way to restructure semi-structured data in Snowflake. Some of it’s advantages are:

  • Allows access to nested elements: This makes it quite easy to work with those nested elements and data.
  • Fast change of the data structure: Allowing us to transform from semi-structured to structured data and helping the analysis and visualization of the data with most of the tools available.
  • Flexible queries: Allows us to work much easier with big volumes of semi-structured data.
  • Save time compared to other data preparation alternatives: Allows us to skip other heavier and longer ETL processes by loading the raw data directly in our data warehouse and transforming it in a single query, without using other tools and reducing the transformation time.
English is not my first language. Sorry for any grammatical, punctuation or any other errors.

  1. National Statistics Institute of Spain ↩︎