BigQuery is a serverless data warehouse. Tables in BigQuery are organized into datasets. In this lab, messages published into Pub/Sub will be aggregated and stored in BigQuery.
BigQuery replaces a typical data warehouse hardware setup.
BigQuery organizes data tables into units called datasets.
BigQuery defines schemas and issues queries directly on external data sources.
Functions the same way as in a traditional data warehouse. BigQuery is column-based storage, not record-based.
Cloud IAM grants permission to perform specific actions on datasets. Granular permissions on tables are better realized with views.
When you reference a table from the command line or SQL queries, you refer to it by using the construct: project.dataset.table
Like cloud storage, BigQuery datasets can be regional or multi-regional. Regional datasets are replicated across multiple zones in the region, while multi-regional datasets are replicated across multiple regions.
Every table has a schema, which can be entered:
Schemas can also be received during load jobs.
There are daily limits for load jobs on each project. Load jobs are asynchronous, so you don’t need to maintain a client connection while the job is being executed.
Auto schema detection is possible for Avro format. For CSV, the first row can be skipped after being loaded. For other formats, manual schema setup is recommended.
You can launch load jobs through the BigQuery web UI, or automate the process by setting up cloud functions to listen to cloud storage events.
BigQuery supports various data formats like CSV, newline-delimited JSON, Avro, Parquet, and Apache ORC.
The BigQuery query service is separate from the BigQuery storage service.
Querying native tables is the most common and performant way to use BigQuery.
You can query external data sources without loading it into BigQuery, these are called Federated queries.
Supported formats include: CSV, newline-delimited JSON, Avro files, Parquet files, and Apache ORC.
BigQuery dynamically allocates storage and query resources based on usage patterns. Storage resources are allocated as you consume them and deallocated as you remove data or drop tables.
Query resources are allocated based on the query type and complexity. Each query uses slots—units of computation comprising CPU and RAM.
To calculate pricing, you can use BigQuery's query validator combined with the pricing calculator for estimates.
BigQuery offers 1 TB of querying for free every month, making public datasets a great way to try out BigQuery.
You can separate the cost of storage and queries by separating projects A and B. This way, users can query a dataset in project A without running jobs from project A. Charges for queries will go to the user's project, not the dataset's project.
BigQuery uses run-length encoding and dictionary encoding for data storage.
Open the command-line tool to create a dataset in BigQuery.
SELECT * FROM tablename.realtime LIMIT 10
Aggregations can be performed on the stream of incoming data to generate reports.