2020-05-23

gcp►b-data-engineer

b07-Cloud Dataflow

Data Processing

Solution:
Apache Beam + Cloud Dataflow

Cloud Dataflow

Auto scaling, No-Ops, Stream and Batch Processing
Built on Apache Beam
Pipelines are regional-based

Data Transformation

Cloud Dataproc vs Cloud Dataflow

Key Terms

Element : single entry of data (eg. table row)
PCollection: Distributed data set, input and output
Transform: Data processing in pipeline
ParDo: Type of Transform

2020-05-23

gcp►b-data-engineer

b06-Cloud-PubSub

Cloud Pub/Sub

Global scale messaging buffer/coupler
No-ops
Decouples senders and receivers
Equivalent to Kafka
At-least-once delivery

Pub/Sub Core Concepts

Topic: A named resource to which messages are sent by publishers
Subscription: A named resource representing the stream of messages from a single, specific topic, to be delivered to the subscribing application.
Message: The combination of data and (optional) attributes that a publisher sends to a topic and is eventually delivered to subscribers.
Message attribute: A key-value pair that a publisher can define

Pub/Sub Message flow

Publisher sends messages to the topic
Messages are stored in message store until they are delivered and acknowledged by subscribers
Pub/Sub forwards messages from a topic to subscribers. messages can be pushed by Pub/Sub to subscriber or pulled by subscribers from Pub/Sub
Subscriber receives pending messages from subscription and acknowledge to Pub/Sub
After message is acknowledged by the subscriber, it is removed from the subscription’s queue of messages.

Basic flow of messages Message flow

Messages A, B are sent to the topic.
There are 2 subscriptions, Subscriber1 and Subscriber2 are subscribed to Subscription1 and each one got one message.
Subscriber3 is subscribed to Subscription2, and it got both message A and B

Publishing

Message format
- Message data
- Ordering Key
- Attributes
Using Schema
- Avro
- Protocol Buffer

Receiving

Subscriber

At Least Once delivery
Retention duration
Messages stored in PubSub
10 minutes -> 7 days
Ack Deadline
Messages to be deleted if not be consumed
10 seconds (by default)
Expiration period
Subscription to be deleted if no messages arrive
31 days be (by default)

Delivery mode

Push = lower latency, more real-time
- Push subscribers must be Webhook endpoints that accept POST over HTTPS
Pull ideal for large volume of messages - batch delivery

Replaying

Seeking to a timestamp

retain_acked_messages set to True
messagesRetentionDuration
7 days by default

Seeking to a snapshot

define e snapshot, messages to replay are those who have not been consumed when creating of the
snapshot and the new ones.

2020-05-20

gcp►b-data-engineer

b05-Cloud-Spanner

Cloud Spanner

Fully managed, highly scalable/available, relational database
Similar architecture to Bigtable

What is it used for?

A relational database that need strong transactional consistency(ACID)
Wide scale
Higher workload than Cloud SQL
Standard SQL format

Spanner vs CLoud SQL

Cloud SQL = MySQL
Spanner is not “dropped in” replacement for MySQL
- Not MySQL/PostgreSQL compatible
- Work need to migrate

Spanner architecture

Nodes handle computation, each node serves up to 2 TB of storage
Storage is replicated across zones, compute and storage are separated
Replication is automatic

2020-05-19

gcp►b-data-engineer

b04-Bigtable

Cloud Bigtable

High performance, massively scalable NoSQL
Ideal for large analytical workloads

Bigtable infrastructure

Front-end server pool serves requests to nodes
Compute and Storage are separate, No data is stored on the node except for metadata to direct requests to the correct tablet
tables are shards into tablets. They are stored on Colossus, google’s filesystem. as storage is separate from compute node,
replication and recovery of node data is very fast, as only metadata/pointers need to be updated

instances

Entire bigtable project called ‘instance’

cluster

Nodes grouped into clusters

1 or more cluster per instance

instance type

Development - low cost, single node, on replication
Production - 3+ nodes per cluster, replications available

Schema Design

Per table - Row key is only indexed item

Hands on

install cbt in Google Cloud SDK

1 2	gcloud components update glcoud components install cbt

set env variable

1	echo -e "project=[PROJECT_ID]\ninstance=[INSTANCE_ID]">~/.cbtrc

create table

1	cbt createtable my-table

list table

cbt ls

add column family

1	cbt createfamily my-table cf1

list column family

1	cbt ls my-table

add value to row1, column family cf1, column qualifier c1

1	cbt set my-table r1 cf1:c1=testvalue

read table

1	cbt read my-table

delete table

1	cbt deletetable my-table

2020-05-16

gcp►b-data-engineer

b03-Cloud-Datastore

What is a Cloud Datastore

NoSQL

Flexible structure/relationship

No Ops

No provisioning of instances
Compute layer is abstracted away

Scalable

Multi-regions access
Sharding/replication automatic

We can have only 1 Datastore per project

When to use Datastore

Applications need scale
Product catalog - real time inventory
User profiles - mobile apps
Game save states
ACID transactions, eg, transferring funds

When not use Datastore

Analytics(full SQL semantics)
- Use Big Query/Cloud Spanner
Extreme scale(10M+ read/writes per second)
- Use Bigtable
Don’t need ACID
- Bigtable
Lift and shift(existing MYSQL)
- Cloud SQL
Near zero latency(< 10ms)
- Use in-memory database (Redis)

Relational Database vs Datastore

Entities can be hierarchical

Queries and Indexing

Query

retrieve entity from datastore
query methods
- programmatic
- web console
- GQL( google query language)

Indexing

queries get results from indexes
index type
- Built-in: Allows single property queries
- Composite: use index.yaml

Danger - Exploding Indexes !

solutions:

use index.yaml to narrow index scope
do not index properties that don’t need indexing

Data Consistency

Performance vs Accuracy

Strongly Consistent
- Parallel processes with orders guaranteed
- Use case: financial transaction
Eventually Consistent
- Parallel processes not with orders guaranteed
- Use case: Census population, order not important

2020-05-14

gcp►b-data-engineer

b02-Cloud-SQL

Cloud SQL

Max disk size 10 TB
- If need > 10 TB, then choose Cloud Spanner
Read replicas limited to same region as master

Cloud SQL hands on

2020-05-11

gcp►b-data-engineer

b01-Fundamental-Concepts

Database Types

Relational database

SQL
ACID ( Atomicity, Consistency, Isolation, Durability)
Transactional
Examples
- Mysql, Microsoft SQL Server, Oracle, PostgreSQL
Pros
- Standard, consistent, reliable, data integrity
Cons
- Poor scaling, not fast, not good for semi-structured data

“Consistency and Reliability over Performance”

Non-Relational Database

Non-structured
Some have ACID (Datastore)
Examples
- Redis, MongoDB, Cassandra, HBase, Bigtable, RavenDB
Pros
- Scalable, High Performance, Not Structure-Limited
Cons
- Consistency, Data Integrity

“Performance over Consistency”

How to choose the right storage

2020-05-09

gcp►a-core-infrastructure

a08-GCP Big Data and Machine Learning

Big Data

Serverless means you don’t have to worry about provisioning compute instances to run your jobs. the services are fully managed.

Cloud Dataproc

is managed Hadoop

Fast, easy, managed way to run Hadoop, Spark/Hive/Pig on GCP
Create clusters in 90 seconds
Scale clusters up and down even when jobs are running
Easily migrate on-premises Hadoop jobs to the cloud
Save money with preemptible instances

Cloud Dataflow

is managed data pipelines

Processes data using Compute Engine
- Clusters are sized for you
- Automated scaling
Write code for batch and streaming

Why use Cloud Dataflow?

ETL
Data analytics: batch or streaming
Orchestration: create pipelines that coordinate services, including external services
Integrates with GCP services

Big Query

is managed data warehouse

It provides near real-time interactive analysis of massive dataset using SQL
No cluster maintenance
Compute and Storage are separated with a terabit network in between
You only pay for storage and processing used
Automatic discount for long-term data storage(when data reaches 90 days in BigQuery, google drops the price of storage)

Cloud Pub/Sub

is scalable, reliable messaging

Supports many-to-many asynchronous messaging
Push/pull to topics
Support for offline consumers
At least once delivery policy

Cloud Datalab

interactive data exploration(Notebook)

Built on Jupyter(formerly IPython)

Easily deploy models to BigQuery. You can visualize data with Google Charts or map plot line

Cloud Machine Learning Platform

TensorFlow
Cloud ML
Machine Learning APIs

Why use CLoud Machine Learning Platform?

For structured data

Classification and regression
Recommendation
Anomaly detection

For unstructured data

Image and video analytics
Text analytics

CLoud Vision API

Gain insight from images
Detect inappropriate content
Analyze sentiment
Extract text

Cloud Natural Language API

can return text in real time
Highly accurate, even in noisy environments
Access from any device

Cloud Translation API

Translate strings
Programmatically detect a document’s language
Support for dozen’s languages

Cloud Video Intelligence API

Annotate the contents of video
Detect scene changes
Flag inappropriate content
Support for a variety of video formats

2020-05-08

gcp►a-core-infrastructure

a07-GCP Developing, Deploying and Monitoring

Cloud Source Repositories

Fully featured Git repositories hosted on GCP

Cloud Functions

Create single-purpose functions that respond to events without a server or runtime
Written in Js, execute in managed node.js environment on GCP

Deployment Manager

Provides repeatable deployments
Create a .yaml template describing your environment and use Deployment Manager to create resources

Stackdriver

is GCP’s tool for monitoring, logging and diagnostics

2020-05-08

gcp►a-core-infrastructure

a06-GCP-Applications

App Engine

is a PaaS for building scalable applications

it makes deployment, maintenance and scalability easy
suited for building scalable web applications and mobile backends

It offers 2 environments:

standard environment
- easily deploy applications
- autoscale
- free daily quota
- usage based pricing
- Specific versions of Java, Python, PHP and Go are supported
Sandbox constraints:
- No writing to local files
- All requests time out at 60s
- Limits on third-party software
flexible environment
- Build and deploy containerized apps with a click
- No sandbox constraints
- Can access App Engine resources

Comparing standard and flexible environments

Comparing Kubernetes Engine and App Engine

GCP provides 2 API management tools

Cloud Endpoints

Control access and validate calls with JSON Web Token and Google API keys
- Identify web, mobile users with Auth0 and Firebase Authentication
Generate client libraries

Apigee Edge

A platform for making APIs available to your customers and partners
Contains analytics, monetization, and a developer portal
The backend services for Apigee need not to be in GCP