Hadoop  Colossus 

Cloud Storage

Very handy service when it comes to work with data, especially unstructured data. Not transactional data. Not for analytical unstructured data. It is an object store, so it just stores and retrieves binary objects without regard to what data is contained in the objects. It also provides file system compatibility and can make objects look like and work like as if they were files, so you can copy files in and out of it.

Cloud Storage allows world-wide storage and retrieval of any amount of data at any time. You can use Cloud Storage for a range of scenarios including serving website content, storing data for archival and disaster recovery, or distributing large data objects to users via direct download.

What is a bucket?

Buckets are containers for data. Buckets are containers which hold objects, and objects exist inside of those buckets and not apart from them.
All Bucket are in global name space and therefore each bucket has a unique name. It is recommended not to use sentitive information in the bucket name while the bucket names a global.

Bucket properties

The bucket name is declass

What are the bucket roles to secure my data?

Conrolling access with Cloud IAM and access lists

What is GMEK? -google-managed-encryption keys

all data is encrypted in rest as in transit as well, it cant be switched off.

We use two levels of encryption. First, the data is encrypted using a data encryption key, and then the data encryption key itself is then encrypted using a key encryption key or a KEK. These KEKs are automatically rotated on a schedule that use the current KEK stored in Cloud KMS, or the Key Management Service. You don't have to do anything. This automatically happens.

What is GMEK? -customer-managed-encryption keys

I can control the creation and existence of the KEK key in Cloud KMS.

What is CSEK? -customer-supplied-encryption keys-

I provide the KEK key

Encryption vs. Data Locking

Data locking is different from encryption. Where encryption prevents somebody from understanding the data, locking prevents them from modifying the data.

What is a object?

When an object is stored, cloud storage replicates that object, it'll then monitor the replicas and if one of them is lost or corrupted it'll replace it automatically with a fresh copy. This is how cloud storage get many of the nines of durability.
From multi-region bucket the objects are replicated across regions. And for a single region bucket as you might expect, the objects are replicated across zones within that one region.
The objects are stored with metadata, metadata is information about that object.
The object name is de/modules/O2/script.sh

Create a Cloud Storage bucket

Use the magic variable $DEVSHELL_PROJECT_ID, which knows your current project, and simply add -vcm to the end.

gsutil mb -p $DEVSHELL_PROJECT_ID \
-c regional \
-l us-central1 \
gs://$DEVSHELL_PROJECT_ID-vcm/

Create a Cloud Storage bucket

The training images are publicly available in a Cloud Storage bucket. Use the gsutil command-line utility for Cloud Storage to copy the training images into your bucket:

gsutil -m cp -r gs://cloud-training/automl-lab-clouds/* gs://$DEVSHELL_PROJECT_ID-vcm/

Load data from a Cloud Storage bucket into a BigQuery Data Warehouse

gsutil -m cp ...

List folders in Cloud Storage bucket

List with all folders in in the bucket

gsutil ls gs://$DEVSHELL_PROJECT_ID-vcm/

List folders with all files in Cloud Storage bucket

List with * all files in all folders

gsutil ls gs://$DEVSHELL_PROJECT_ID-vcm/*

Retention policies and retention policy locks

You can add a retention policy to a bucket to specify a retention period.

Directory rename in HDFS not the same as in Cloud storage

Hadoop 

Cloud storage is at its core an object store, it only simulates a file directory.

Directory rename in HDFS not the same as in Cloud storage,
because Cloud Storage has no concepts of directores.

mv gs://foo/bar gs://foo/bar2

listgs://foo/bar

copygs://foo/bar/baz1, gs://foo/bar/baz2

deletegs://foo/bar/baz1, gs://foo/bar/baz2