Schemas

Is normalization the best solution?

Like in RDB's schema delevelopment is a question for the data & analytics team.
Shall a normalization create a granularity of data with many tables, or shall we go the fully denormalized route and just store all levels of granularity in a single big table.

Normalization

relational schemas are build with many table and this causes afterwards a cost intensive operation with JOINS

Denormalization

using one big table causes the side effect that data is repeated.

Cost intensive means due RDBMs need to open each record entirely. Because the RDBMS are record based.

Example of structs for a booking schema

Events (Event struct colon event.status, event.time)
Pick-offs colon (Pickup struct colon pickup.latitude, pickup.longitude)
Destination colon (Destination colon destination.latitude, destination.longitude)
Duration (Duration struct colon duratation.booking_to_dispatch, duration.to_pickup)

Which database supports nested and repeated fields?

BigQuery support for nested and repeated fields.

What are Nested and Repeated Fields?

Nested Field are complex data types like arrays. This allows a data warehouse schema which is to combine data.

Struct is a logical container and can be used instead a table
Sstruct colon pre-joined table within a table

Example nested fields
Take a look on the prefix and on the type RECORD
destination
destination.latitude

Example Repeated Fields
Take a look on the prefix transaction and on the type list REPEATED
nested field in nested field
transaction
transaction_inputs.input_script_bytes

Creating your own arrays with ARRAY_AGG()

You can do some pretty useful things with arrays like:

finding the number of elements with ARRAY_LENGTH(array)
deduplicating elements with ARRAY_AGG(DISTINCT field)
ordering elements with ARRAY_AGG((field) ORDER BY (field))
limiting ARRAY_AGG((field) LIMIT 5)

How does an unnest look like?

Here an example query which does not run correctly due nested Table error:
Cannot access field product on a value with type ARRAY> at [5:8]

SELECT visitId, hits.page.pageTitle FROM `bigquery-public-data.google_analytics_sample.ga_sessions_20170801` WHERE visitId = 1501570398

Here the example query how get the data from the nested Table

SELECT DISTINCT visitId, h.page.pageTitle FROM `bigquery-public-data.google_analytics_sample.ga_sessions_20170801`, UNNEST(hits) AS h WHERE visitId = 1501570398 LIMIT 10

DISTINCT to our ARRAY_AGG()

Which field is the STRUCT? How do you know? The participants field is the STRUCT because it is of type RECORD Which field is the ARRAY? The participants.splits field is an array of floats inside of the parent participants struct. It has a REPEATED Mode which indicates an array. Values of that array are called nested values since they are multiple values inside of a single field. A SQL STRUCT is simply a container of other data fields which can be of different data types. The word struct means data structure. Recall the example from earlier: __STRUCT(__"Rudisha" as name, [23.4, 26.3, 26.4, 26.1] as splits__)__AS runner STRUCTs are given an alias (like runner above) and can conceptually be thought of as a table inside of your main table. STRUCTs (and ARRAYs) must be unpacked before you can operate over their elements. Wrap an UNNEST() around the name of the struct itself or the struct field that is an array in order to unpack and flatten it.