a05-GCP-Containers

Kubernetes

orchestrator for containers so you can better manage and scale your applications

Kubernetes Engine

Kubernetes as a managed service in the cloud.

You can create Kubernetes Cluster with Kubernetes Engine

Cluster

is a set of master components that control the system as a whole and a set of nodes that run containers
In Kubernetes, a node represents a computing instance
In GCP, a node is VM running in Compute Engine

Pod

the smallest deployable unit in Kubernetes. It has 1 container often, but it could have multiple containers, where the containers
will share the networking and have the same disk storage volume

Deployment

a deployment represents a group a replicas of the same pod. It keeps your pods running even if a node fails

Service

is a fundamental way Kubernetes represents load balancing. It has a public IP so the external can access the cluster

In GKE, this kind of load balancer is a network load balancer

Anthos

Hybrid and Multi-Cloud distributed systems and service management from Google. It rests on Kubernetes and Kubernetes Engine deployed on-premise

a04-GCP-Storage

Cloud Storage

It is binary large-object storage(you have a storage and have a unique key to access) which is not the same as
file storage(you manage your data as a hierarchy of folders) or
block storage(the operating system manage data as chunks of disk)

Cloud storage works well with web technologies and is scalable

Each of objects in Cloud Storage has a URL and is immutable

Usage:

  • serving website content
  • storing data for archival and disaster recovery
  • distributing large data objects to your end users via Direct Download

For most case, IAM is sufficient, but if you need finer control, you can create ACLs(access control lists).
Each ACL consists of:

  • scope: a user or group
  • permission: read or write

Object Versioning

Cloud Storage keeps a history of modification, you can list the archived versions of an object, restore or delete it

It is not available on cloud console, you can change it via the command line
NB: after enabling of versioning, adding several versions of the object, when disable the versioning, the different versions
always exist in the bucket, you should delete them manually.

lifecycle

Cloud Storage offers lifecycle management policies, eg: you could delete objects older than 5 days

Cloud Storage has different types of storage classes: Multi-Regional, Regional, Nearline, Coldline

3 Ways to bring data into Cloud Storage:

  • Online Transfer
  • Storage Transfer Service (online)
  • Transfer Appliance (offline)

Cloud Bigtable

It is google’s NoSQL, big data database service. It supports high throughput, both read and write.

It offers same API as HBase(native database for Apache Hadoop), differences:

  • Bigtable can scale and manage fast and easily
  • Bigtable encrypts data in-flight and at rest
  • Bigtable can be controlled access with IAM
  • Bigtable drives major applications such as Google Analytics, Gmail

Bigtable access patterns:

  • application api
  • streaming
  • batch processing

Cloud SQL

managed RDBMS

  • MySQL and PostgreSQL databases
  • Automatic replication
  • Managed backups
  • Vertical scaling(read and write)
  • Horizontal scaling(read)
  • Google security

Cloud Spanner

horizontally scalable RDBMS

  • Strong global consistency
  • High availability
  • SQL queries
  • Automatic replication

Cloud Datastore

horizontally scalable NoSQL DB

  • Designed for application backend
  • Supports transactions
  • includes a free daily quota

Cloud Storage Comparing

a03-GCP-Virtual-Machines

We can run virtual machines by using of Compute Engine

VPC Network

Virtual Private Cloud: it connects your GCP resources to each other and to the internet.

Google Cloud VPC networks are global, but the subnets are regional

In the example below, us-east1-b and us-east1-c are on the same subnet but in different zones

VPCs have routing tables, you can define firewall rules in terms of tags on compute engine.

VPC Peering: establish a peering relationship between projects
Shared VPC: you can use IAM to control

Cloud Load Balancing

Users get a single, global anycast IP address

Cloud CDN(Content Delivery Network)

Use Google’s globally distributed edge caches to cache content close to your users

Compute Engine

Create virtual machine by using of:

  • GCP Console
  • gcloud

You can run images of linux or windows servers.

You can configure memories and cpus of each VM, you can choose 2 kinds of storage: standard or SSD

You can choose a Preemptible VM to save money (the instance will terminate after 24 hours, it can be used for applications
distribute processes across multiple instances in a cluster or for test)

  • Scale up:
    use big VMs for memory and compute-intensive applications

  • Scale out:
    Use Autoscaling for resilient, scalable applications

spark cross join

//todo: to complete

Cross join matches every row from left with every row from right, generating a Cartesian cross product.

val joinDF=statesPopulationDF.crossJoin(statesTaxRatesDF)

without using a join condition.

a02-Getting-Started-With-GCP

IAM

IAM: Google Cloud’s Identity and Access Management.
It has 3 parts:

  • Who:
    can be defined by google account, google group, service account
  • Can do what: can be defined by an IAM role which is a collection of permissions
    • There are 3 kinds of roles:
      • Primitive role
      • Predefined role
      • Custom role: can only be defined in organization or project, but not in folders
  • On which resource

GCP resource hierarchy

polices can define in organization, folder, project, they are inherited in the hierarchy.

Projects are the main way you organise your gcp resources.

Each project has

  • Project ID: immutable (assigned by you)
  • Project Name: mutable (assigned by you)
  • Project number: immutable (assigned by GCP)

Policies defined in organisation level can be inherited to all children.

GCP use least privilege in managing any kind of compute infrastructure.

The policies implemented at a higher level in this hierarchy can’t take away access that’s granted at a lower level
Eg: if you grant Editor role to Organisation and Viewer role to the folder, then the folder is granted the Editor role.

Projects can have different owners and users - they are built separately and managed separately.

Service Account

For example: If you want to give permissions to a Compute Engine rather than a person, you would use service account.
A service account is also a resource, so you can give a role to it.

There are 4 ways to interact with GCP’s management layer:

  • GCP console
    • Manage and create projects
    • Access to GCP APIs
    • Offers access to Cloud Shell
  • Cloud shell and Cloud SDK
    • Includes command line tools: gcloud, gsutil(Cloud storage), bq(BigQuery)
  • API
    • Enabled through GCP console
    • APIs Explorer is an interface tool that let you easily try GCP APIs using a browser
    • Use libraries within your code
      • Cloud Client Libraries
        (Latest and recommended libraries)
      • Google API Client Libraries
  • Cloud Console Mobile App

When using GCP, it handles most of the lower security layer, the upper layers remain the customer’s responsibility

Cloud MarketPlace( formerly Cloud Launcher)

It’s a tool for quickly deploying functional software packages on GCP

a01-GCP-introduction

What is GCP ?

GCP stands for Google Cloud Platform which is a suite of public cloud computing services offered by Google. It offers 4 main kinds of services: compute, storage, big data, machine learning that run on Google hardware.

what is cloud computing ?

  • On-demand self-service: use interface to get processing, no human intervention needed
  • Broad network access: access from anywhere
  • Resource pooling: provider shares resources to customers
  • Rapid elasticity: get more resources quickly as needed
  • Measured service: pay only for what you consume

compute evolution

  1. Physical/colo: user-configured, managed and maintained
  2. Virtualized: user-configured; provider-managed and maintained
  3. Serverless: container based architecture, fully automated

gcp computing architectures

  • IaaS: Infrastructure as a Service
    • provides raw compute, storage and network organized in ways that are like from data centers
  • PaaS: Platform as a Service
    • provides a platform allowing customers to develop, run and manage applications without maintaining infrastructure
  • SaaS: Software as a Service
    • software is licensed on a subscription basis and is centrally hosted

regions and zones

GCP is organized into regions and zones. Zone is department area for GCP resources. For example, the compute engine runs in a zone specified.

Locations within regions usually have network latencies of under 5 milliseconds

As of building a fault tolerant application, you can spread their resources across multiple regions

open apis

GCP services are compatible with open sources products. for example:

  • Cloud Bigtable <==> Apache HBase
  • Cloud Dataproc <==> Hadoop

gcp services

GCP offers services as compute, storage, big data and machine learning

budgets and billing

GCP provides 4 tools to help:

budgets and alert: define budget per billing or per project, you can create an alert at 50 percent, 90 percent and 100 percent

billing: you have billing account on GCP project

export: store detailed billing information

reports and quotas:

  • report is a visual tool to monitor your expenditure.
  • quotas are designed to prevent over-consumption of resources
    • rate quotas
      reset after a specific time. eg: Kubernetes service set a quota of 1000 calls per 100 seconds
    • allocation quotas
      govern the number of resources you can have in your projects. eg: no more than 5 VPN per project

spark listeners

When running spark job, driver collects logs from different executors and send events via an event bus to Web UI and to EventLog Listener simultaneously.

EventLog Listener send events to a some directory for example in hdfs to store the events, and then spark history server will expose these events in its interface.

We could define our custom EventLog Listener, there are several listeners already developed and we can use them directly, here I will prent 3 examples.

sparklens

Sparklens is a profiling tool for Spark with built-in Spark Scheduler simulator, it reports

  • Estimated completion time and estimated cluster utilisation with different number of executors

  • Job/Stage timeline which shows how the parallel stages were scheduled within a job. This makes it easy to visualise the DAG with stage dependencies at the job level.

Here is a screen capture of sparklens reporting ( in my example, I added sparklens jar in the classpath of zepplin, and import sparklens package, and then I can use sparklens directly)

import package

1
2
3
import com.qubole.sparklens.QuboleNotebookListener
val QNL = new QuboleNotebookListener(sc.getConf)
sc.addSparkListener(QNL)

sparklint

Sparklint is a profiling tool for Spark with advance metrics and better visualization about your spark application’s resource utilization. It helps you find out where the bottle neck are

We can use it in the application code, or we can also use it to analyse event logs, at the moment when I write this blog, the actual version of sparklint is 1.0.13, and we can’t analyse event logs of history server yet if it is compressed in the configuration.
But we can decompress it to json file and then we can work on it.

here is a screen capture of sparklint

sparkMeasure

sparkMeasure is a tool for performance troubleshooting of Apache Spark workloads, It simplifies the collection and analysis of Spark performance metrics.

It is a tool for multiple uses: instrument interactive (notebooks) and batch workloads

Hive vs Presto

Presto

Presto is a distributed SQL query engine for big data, it allows users to query a variety of data sources such as Hadoop, RDBMS, Mongo etc. One can even query data from multiple data sources within a single query.

Hive

Hive is a data warehouse software built on top of Apache Hadoop for providing data query and analysis. It translates SQL queries into multiple stages of MapReduce and it is powerful enough to handle huge numbers of jobs

Comparisons

Tables Presto Hive
best of use Interactive queries Large data aggregation(join for example)
sql standardized fidelity ANSI SQL HiveQL
window function yes yes
optimized target latency query throughput
execution mode Intermediate data in memory Spills intermediate data to File System
fault tolerance No Yes
speed 2 times faster than hive slow
data processing models push model pull model
limitation limited at the maximum amount of memory that each task can store

Jenkins multiple pipelines

Jenkins architecture is fundamentally “Master + Agent”, the master is designed to do the co-ordination and provide the GUI and API endpoint, and the Agents are designed to execute the work.

Jenkins can run on distributed mode, this may be for scale or to provide different tools and we can launch jenkins en multiple pipelines simultaneously, it can work on parallel

Here I will show an example how to install multiple agents to a Jenkins master. In my example, I use docker to install Jenkins master and vagrant to install two machines as agents.

Create agent machines by vagrant

I created 2 machines by vagrant with the Vagrantfile as below, you can go the directory of this file and run the command :

1
2
vagrant up

[Vagrantfile]

Run master via docker

1
2
docker run -p 8080:8080 -p 50000:50000 jenkins/jenkins:lts

copy the password to the interface of jenkins, configure user information, install the plugins recommended and then we can access to jenkins home page

Add agent

go to Manage Jenkins/Manage Nodes and click “New Node” button, we give a Node name and check Permanent Agent

SSH connexion

In the first example, we will add an agent via SSH:
Before doing that, I generated the ssh key inside of docker container, and copy id_rsa.pub to ~/.ssh/known_hosts in agent

1
2
ssh-keygen -t rsa -b 4096 -C "sdmj45@gmail.com"

Labels: group name of the agent
Host: agent ip
Credentials: configure credential as below, we give the username as vagrant and private key generated in docker container

Launch agent:
We can click Launch agent button to launch Agent

Java Web Start connexion

In the second example, we will add an agent via Java Web Start:

Labels: group name of the agent

Launch agent:
We can launch agent by “Run from agent command line”: copy the agent.jar to agent machine and run the command by changing localhost:8080 to host_ip:8080

Multiple pipeline build

I add a new build of test and execute a simple command, I added “sleep 30” in order to keep some time in the building

1
2
3
echo "hello slave"
sleep 30
echo "bye slave"

and then I can launch multiple pipeline build as follow (as there are 2 executors in each agent, I can launch 4 build simultaneously)

Kubernetes Deploy Example - Part 2

This is part 2 of Kubernetes Deploy Example, in this example, we will use the yaml file to deploy a nginx.

In this example, we have combined the Deployment part and the Service part together. We could deploy them at the same time.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
apiVersion: apps/v1 # for versions before 1.9.0 use apps/v1beta2 
kind: Deployment
metadata:
name: nginx-deployment
spec:
selector:
matchLabels:
app: nginx
replicas: 2 # tells deployment to run 2 pods matching the template
template:
metadata:
labels:
app: nginx
spec:
containers:
- name: nginx
image: nginx:latest
ports:
- containerPort: 80
---
# https://kubernetes.io/docs/concepts/services-networking/service/#defining-a-service
kind: Service
apiVersion: v1
metadata:
name: nginx-service
spec:
selector:
app: nginx
ports:
- protocol: TCP
port: 80
targetPort: 80
nodePort: 32000
type: NodePort

Step 1: Apply applications with yaml file

1
2
3
kubectl apply -f ./deployment.yaml
deployment.apps "nginx-deployment" created
service "nginx-service" created

Step 2: Get kubernetes ip

1
2
3
4
5
kubectl get endpoints kubernetes

NAME ENDPOINTS AGE
kubernetes 192.168.99.100:8443 143d

Step 3: Access to nginx

Open your navigator and type the url of kubernetes with the port of nginx(we defined 32000 in the yaml file), you will access to welcome page of nginx

1
2
http://192.168.99.100:32000