Is normalization the best solution?
Like in RDB's schema delevelopment is a question for the data & analytics team.
Shall a normalization create a granularity of data with many tables, or shall we go the fully denormalized route and just store all levels of granularity in a single big table.
- Normalization
- relational schemas are build with many table and this causes afterwards a cost intensive operation with JOINS
- Denormalization
- using one big table causes the side effect that data is repeated.
Cost intensive means due RDBMs need to open each record entirely. Because the RDBMS are record based.
Example of structs for a booking schema
- Events (Event struct colon event.status, event.time)
- Pick-offs colon (Pickup struct colon pickup.latitude, pickup.longitude)
- Destination colon (Destination colon destination.latitude, destination.longitude)
- Duration (Duration struct colon duratation.booking_to_dispatch, duration.to_pickup)
Which database supports nested and repeated fields?
BigQuery support for nested and repeated fields.
What are Nested and Repeated Fields?
Nested Field are complex data types like arrays. This allows a data warehouse schema which is to combine data.
Struct is a logical container and can be used instead a table
Sstruct colon pre-joined table within a table
Example nested fields
Take a look on the prefix and on the type RECORD
destination
destination.latitude
Example Repeated Fields
Take a look on the prefix transaction and on the type list REPEATED
nested field in nested field
transaction
transaction_inputs.input_script_bytes
Creating your own arrays with ARRAY_AGG()
You can do some pretty useful things with arrays like:
- finding the number of elements with ARRAY_LENGTH(array)
- deduplicating elements with ARRAY_AGG(DISTINCT field)
- ordering elements with ARRAY_AGG((field) ORDER BY (field))
- limiting ARRAY_AGG((field) LIMIT 5)
How does an unnest look like?
Here an example query which does not run correctly due nested Table error:
Cannot access field product on a value with type ARRAY> at [5:8]
SELECT
visitId,
hits.page.pageTitle
FROM `bigquery-public-data.google_analytics_sample.ga_sessions_20170801`
WHERE visitId = 1501570398
Here the example query how get the data from the nested Table
SELECT DISTINCT
visitId,
h.page.pageTitle
FROM `bigquery-public-data.google_analytics_sample.ga_sessions_20170801`,
UNNEST(hits) AS h
WHERE visitId = 1501570398
LIMIT 10
DISTINCT to our ARRAY_AGG()
Which field is the STRUCT? How do you know?
The participants field is the STRUCT because it is of type RECORD
Which field is the ARRAY?
The participants.splits field is an array of floats inside of the parent participants struct. It has a REPEATED Mode which indicates an array. Values of that array are called nested values since they are multiple values inside of a single field.
A SQL STRUCT is simply a container of other data fields which can be of different data types. The word struct means data structure. Recall the example from earlier:
__STRUCT(__"Rudisha" as name, [23.4, 26.3, 26.4, 26.1] as splits__)__AS runner
STRUCTs are given an alias (like runner above) and can conceptually be thought of as a table inside of your main table.
STRUCTs (and ARRAYs) must be unpacked before you can operate over their elements. Wrap an UNNEST() around the name of the struct itself or the struct field that is an array in order to unpack and flatten it.