b07-Cloud Dataflow

Data Processing

Solution:
Apache Beam + Cloud Dataflow

Cloud Dataflow

  • Auto scaling, No-Ops, Stream and Batch Processing
  • Built on Apache Beam
  • Pipelines are regional-based

Data Transformation

Cloud Dataproc vs Cloud Dataflow

Key Terms

  • Element : single entry of data (eg. table row)
  • PCollection: Distributed data set, input and output
  • Transform: Data processing in pipeline
  • ParDo: Type of Transform

b06-Cloud-PubSub

Cloud Pub/Sub

  • Global scale messaging buffer/coupler
  • No-ops
  • Decouples senders and receivers
  • Equivalent to Kafka
  • At-least-once delivery

Pub/Sub Core Concepts

  • Topic: A named resource to which messages are sent by publishers
  • Subscription: A named resource representing the stream of messages from a single, specific topic, to be delivered to the subscribing application.
  • Message: The combination of data and (optional) attributes that a publisher sends to a topic and is eventually delivered to subscribers.
  • Message attribute: A key-value pair that a publisher can define

Pub/Sub Message flow

  1. Publisher sends messages to the topic
  2. Messages are stored in message store until they are delivered and acknowledged by subscribers
  3. Pub/Sub forwards messages from a topic to subscribers. messages can be pushed by Pub/Sub to subscriber or pulled by subscribers from Pub/Sub
  4. Subscriber receives pending messages from subscription and acknowledge to Pub/Sub
  5. After message is acknowledged by the subscriber, it is removed from the subscription’s queue of messages.

Basic flow of messages Message flow

Messages A, B are sent to the topic.
There are 2 subscriptions, Subscriber1 and Subscriber2 are subscribed to Subscription1 and each one got one message.
Subscriber3 is subscribed to Subscription2, and it got both message A and B

Publishing

  • Message format

    • Message data
    • Ordering Key
    • Attributes
  • Using Schema

    • Avro
    • Protocol Buffer

Receiving

Subscriber

  • At Least Once delivery

  • Retention duration
    Messages stored in PubSub
    10 minutes -> 7 days

  • Ack Deadline
    Messages to be deleted if not be consumed
    10 seconds (by default)

  • Expiration period
    Subscription to be deleted if no messages arrive
    31 days be (by default)

Delivery mode

  • Push = lower latency, more real-time
    • Push subscribers must be Webhook endpoints that accept POST over HTTPS
  • Pull ideal for large volume of messages - batch delivery

Replaying

Seeking to a timestamp

  • retain_acked_messages set to True
  • messagesRetentionDuration
    7 days by default

Seeking to a snapshot

define e snapshot, messages to replay are those who have not been consumed when creating of the
snapshot and the new ones.

b05-Cloud-Spanner

Cloud Spanner

  • Fully managed, highly scalable/available, relational database
  • Similar architecture to Bigtable

What is it used for?

  • A relational database that need strong transactional consistency(ACID)
  • Wide scale
  • Higher workload than Cloud SQL
  • Standard SQL format

Spanner vs CLoud SQL

  • Cloud SQL = MySQL
  • Spanner is not “dropped in” replacement for MySQL
    • Not MySQL/PostgreSQL compatible
    • Work need to migrate

Spanner architecture

  • Nodes handle computation, each node serves up to 2 TB of storage
  • Storage is replicated across zones, compute and storage are separated
  • Replication is automatic

b04-Bigtable

Cloud Bigtable

  • High performance, massively scalable NoSQL
  • Ideal for large analytical workloads

Bigtable infrastructure

  • Front-end server pool serves requests to nodes
  • Compute and Storage are separate, No data is stored on the node except for metadata to direct requests to the correct tablet
  • tables are shards into tablets. They are stored on Colossus, google’s filesystem. as storage is separate from compute node,
    replication and recovery of node data is very fast, as only metadata/pointers need to be updated

instances

Entire bigtable project called ‘instance’

cluster

Nodes grouped into clusters

  • 1 or more cluster per instance

instance type

  • Development - low cost, single node, on replication
  • Production - 3+ nodes per cluster, replications available

Schema Design

  • Per table - Row key is only indexed item

Hands on

install cbt in Google Cloud SDK

1
2
gcloud components update
glcoud components install cbt

set env variable

1
echo -e "project=[PROJECT_ID]\ninstance=[INSTANCE_ID]">~/.cbtrc

create table

1
cbt createtable my-table

list table

1
cbt ls

add column family

1
cbt createfamily my-table cf1

list column family

1
cbt ls my-table

add value to row1, column family cf1, column qualifier c1

1
cbt set my-table r1 cf1:c1=testvalue

read table

1
cbt read my-table

delete table

1
cbt deletetable my-table

b03-Cloud-Datastore

What is a Cloud Datastore

  • NoSQL
  • Flexible structure/relationship
  • No Ops
  • No provisioning of instances
  • Compute layer is abstracted away
  • Scalable
  • Multi-regions access
  • Sharding/replication automatic
  • We can have only 1 Datastore per project

When to use Datastore

  • Applications need scale
  • Product catalog - real time inventory
  • User profiles - mobile apps
  • Game save states
  • ACID transactions, eg, transferring funds

When not use Datastore

  • Analytics(full SQL semantics)
    • Use Big Query/Cloud Spanner
  • Extreme scale(10M+ read/writes per second)
    • Use Bigtable
  • Don’t need ACID
    • Bigtable
  • Lift and shift(existing MYSQL)
    • Cloud SQL
  • Near zero latency(< 10ms)
    • Use in-memory database (Redis)

Relational Database vs Datastore

Entities can be hierarchical

Queries and Indexing

Query

  • retrieve entity from datastore
  • query methods
    • programmatic
    • web console
    • GQL( google query language)

Indexing

  • queries get results from indexes
  • index type
    • Built-in: Allows single property queries
    • Composite: use index.yaml

Danger - Exploding Indexes !

  • solutions:
  • use index.yaml to narrow index scope
  • do not index properties that don’t need indexing

Data Consistency

Performance vs Accuracy

  • Strongly Consistent
    • Parallel processes with orders guaranteed
    • Use case: financial transaction
  • Eventually Consistent
    • Parallel processes not with orders guaranteed
    • Use case: Census population, order not important

b01-Fundamental-Concepts

Database Types

Relational database

  • SQL
  • ACID ( Atomicity, Consistency, Isolation, Durability)
  • Transactional
  • Examples
    • Mysql, Microsoft SQL Server, Oracle, PostgreSQL
  • Pros
    • Standard, consistent, reliable, data integrity
  • Cons
    • Poor scaling, not fast, not good for semi-structured data

“Consistency and Reliability over Performance”

Non-Relational Database

  • Non-structured
  • Some have ACID (Datastore)
  • Examples
    • Redis, MongoDB, Cassandra, HBase, Bigtable, RavenDB
  • Pros
    • Scalable, High Performance, Not Structure-Limited
  • Cons
    • Consistency, Data Integrity

“Performance over Consistency”

How to choose the right storage

a08-GCP Big Data and Machine Learning

Big Data

Serverless means you don’t have to worry about provisioning compute instances to run your jobs. the services are fully managed.

Cloud Dataproc

is managed Hadoop

  • Fast, easy, managed way to run Hadoop, Spark/Hive/Pig on GCP
  • Create clusters in 90 seconds
  • Scale clusters up and down even when jobs are running
  • Easily migrate on-premises Hadoop jobs to the cloud
  • Save money with preemptible instances

Cloud Dataflow

is managed data pipelines

  • Processes data using Compute Engine
    • Clusters are sized for you
    • Automated scaling
  • Write code for batch and streaming

Why use Cloud Dataflow?

  • ETL
  • Data analytics: batch or streaming
  • Orchestration: create pipelines that coordinate services, including external services
  • Integrates with GCP services

Big Query

is managed data warehouse

  • It provides near real-time interactive analysis of massive dataset using SQL
  • No cluster maintenance
  • Compute and Storage are separated with a terabit network in between
  • You only pay for storage and processing used
  • Automatic discount for long-term data storage(when data reaches 90 days in BigQuery, google drops the price of storage)

Cloud Pub/Sub

is scalable, reliable messaging

  • Supports many-to-many asynchronous messaging
  • Push/pull to topics
  • Support for offline consumers
  • At least once delivery policy

Cloud Datalab

interactive data exploration(Notebook)

Built on Jupyter(formerly IPython)

Easily deploy models to BigQuery. You can visualize data with Google Charts or map plot line

Cloud Machine Learning Platform

  • TensorFlow
  • Cloud ML
  • Machine Learning APIs

Why use CLoud Machine Learning Platform?

  • For structured data
  • Classification and regression
  • Recommendation
  • Anomaly detection
  • For unstructured data
  • Image and video analytics
  • Text analytics

CLoud Vision API

  • Gain insight from images
  • Detect inappropriate content
  • Analyze sentiment
  • Extract text

Cloud Natural Language API

  • can return text in real time
  • Highly accurate, even in noisy environments
  • Access from any device

Cloud Translation API

  • Translate strings
  • Programmatically detect a document’s language
  • Support for dozen’s languages

Cloud Video Intelligence API

  • Annotate the contents of video
  • Detect scene changes
  • Flag inappropriate content
  • Support for a variety of video formats

a07-GCP Developing, Deploying and Monitoring

Cloud Source Repositories

Fully featured Git repositories hosted on GCP

Cloud Functions

  • Create single-purpose functions that respond to events without a server or runtime
  • Written in Js, execute in managed node.js environment on GCP

Deployment Manager

  • Provides repeatable deployments
  • Create a .yaml template describing your environment and use Deployment Manager to create resources

Stackdriver

is GCP’s tool for monitoring, logging and diagnostics

a06-GCP-Applications

App Engine

is a PaaS for building scalable applications

  • it makes deployment, maintenance and scalability easy
  • suited for building scalable web applications and mobile backends

It offers 2 environments:

  • standard environment

    • easily deploy applications
    • autoscale
    • free daily quota
    • usage based pricing
    • Specific versions of Java, Python, PHP and Go are supported

    Sandbox constraints:

    • No writing to local files
    • All requests time out at 60s
    • Limits on third-party software
  • flexible environment

    • Build and deploy containerized apps with a click
    • No sandbox constraints
    • Can access App Engine resources

Comparing standard and flexible environments

Comparing Kubernetes Engine and App Engine

GCP provides 2 API management tools

Cloud Endpoints

  • Control access and validate calls with JSON Web Token and Google API keys
    • Identify web, mobile users with Auth0 and Firebase Authentication
  • Generate client libraries

Apigee Edge

  • A platform for making APIs available to your customers and partners
  • Contains analytics, monetization, and a developer portal
    The backend services for Apigee need not to be in GCP