BigQuery

Filter with SQL Structured ML/BigQueryML

BigQuery is a serverless data warehouse. Tables in BigQuery are organized into datasets. In this lab, messages published into Pub/Sub will be aggregated and stored in BigQuery.

Data warehouse

BigQuery replaces a typical data warehouse hardware setup.

Data mart

BigQuery organizes data tables into units called datasets.

Data lake

BigQuery defines schemas and issues queries directly on external data sources.

Tables and views

Functions the same way as in a traditional data warehouse. BigQuery is column-based storage, not record-based.

Grants

Cloud IAM grants permission to perform specific actions on datasets. Granular permissions on tables are better realized with views.

How to refer to a table in BigQuery

When you reference a table from the command line or SQL queries, you refer to it by using the construct: project.dataset.table

Regions in BigQuery

Like cloud storage, BigQuery datasets can be regional or multi-regional. Regional datasets are replicated across multiple zones in the region, while multi-regional datasets are replicated across multiple regions.

Schemas in BigQuery

Every table has a schema, which can be entered:

Manually through the GCP console
By supplying a JSON file

Schemas can also be received during load jobs.

Load jobs in BigQuery

There are daily limits for load jobs on each project. Load jobs are asynchronous, so you don’t need to maintain a client connection while the job is being executed.

Auto schema detection is possible for Avro format. For CSV, the first row can be skipped after being loaded. For other formats, manual schema setup is recommended.

You can launch load jobs through the BigQuery web UI, or automate the process by setting up cloud functions to listen to cloud storage events.

Import data into BigQuery

BigQuery supports various data formats like CSV, newline-delimited JSON, Avro, Parquet, and Apache ORC.

BigQuery query service vs. BigQuery storage service

The BigQuery query service is separate from the BigQuery storage service.

What are native tables?

Querying native tables is the most common and performant way to use BigQuery.

What are Federated queries?

You can query external data sources without loading it into BigQuery, these are called Federated queries.

Which data formats are supported by BigQuery?

Supported formats include: CSV, newline-delimited JSON, Avro files, Parquet files, and Apache ORC.

How does BigQuery handle resources?

BigQuery dynamically allocates storage and query resources based on usage patterns. Storage resources are allocated as you consume them and deallocated as you remove data or drop tables.

Query resources are allocated based on the query type and complexity. Each query uses slots—units of computation comprising CPU and RAM.

Costs for BigQuery

To calculate pricing, you can use BigQuery's query validator combined with the pricing calculator for estimates.

BigQuery offers 1 TB of querying for free every month, making public datasets a great way to try out BigQuery.

You can separate the cost of storage and queries by separating projects A and B. This way, users can query a dataset in project A without running jobs from project A. Charges for queries will go to the user's project, not the dataset's project.

Does BigQuery encode data?

BigQuery uses run-length encoding and dictionary encoding for data storage.

Create a dataset in BigQuery

Open the command-line tool to create a dataset in BigQuery.

Analyze incoming dataflow job data using BigQuery

SELECT * FROM tablename.realtime LIMIT 10

Perform aggregations on the stream for reporting

Aggregations can be performed on the stream of incoming data to generate reports.