2021-01-17

Python-a04-List,Tuple,Set

This article shows the examples of List, Tuple, Set in python

## List 
# List basics
    courses = ['History', 'Math', 'Physics', 'CompSci']
    print(courses)
    print(len(courses))
    print(courses[0])
    print(courses[-1]) # start from the end 
    print(courses[4]) # error
    print(courses[0:2]) # the first index include, not the second one
    print(courses[:2]) # from the first to index 2
    print(courses[2:]) # from index 2 to the end

# Add elements to List
    courses1 = ['History', 'Math', 'Physics', 'CompSci']    
    courses1.append('Art')
    print(courses1) # ['History', 'Math', 'Physics', 'CompSci', 'Art']
    
    courses2 = ['History', 'Math', 'Physics', 'CompSci']    
    courses2.insert(0, 'Art')
    print(courses2) # ['Art', 'History', 'Math', 'Physics', 'CompSci']
    
    
    courses3 = ['History', 'Math', 'Physics', 'CompSci'] 
    courses4 = ['Art', 'Education']
    courses3.insert(0, courses4)
    print(courses3) # [['Art', 'Education'], 'History', 'Math', 'Physics', 'CompSci'] 
    
    courses5 = ['History', 'Math', 'Physics', 'CompSci'] 
    courses6 = ['Art', 'Education']
    courses5.extend(courses6)
    print(courses5) # ['History', 'Math', 'Physics', 'CompSci', 'Art', 'Education'] 
    
# Remove elements to List    
    courses = ['History', 'Math', 'Physics', 'CompSci']
    courses.remove('Math')
    print(courses) #['History', 'Physics', 'CompSci']
    
    courses1 = ['History', 'Math', 'Physics', 'CompSci']
    courses.pop()
    print(courses) #['History', 'Math', 'Physics']
    
    
    courses2 = ['History', 'Math', 'Physics', 'CompSci']
    popped = courses.pop()
    print(popped) #'CompSci'
    
# Order in List       
    courses = ['History', 'Math', 'Physics', 'CompSci']
    courses.reverse()
    print(courses) #['CompSci', 'Physics', 'Math', 'History']
    
    courses1 = ['History', 'Math', 'Physics', 'CompSci']
    courses1.sort()
    print(courses1) #['CompSci', 'History', 'Math', 'Physics']
     
    courses2 = ['History', 'Math', 'Physics', 'CompSci']
    courses2.sort(reverse=True)
    print(courses2) #['Physics', 'Math', 'History', 'CompSci']
    
 
    courses3 = ['History', 'Math', 'Physics', 'CompSci']
    courses4 = sorted(courses3)
    print(courses4) #['CompSci', 'History', 'Math', 'Physics']
    
# Min, Max, Sum in List  
    nums = [1,5,2,4,3]
    print(min(nums)) #1
    print(max(nums)) #5
    print(sum(nums)) #15
    
# index in List  
    courses = ['History', 'Math', 'Physics', 'CompSci']
    print(courses.index('ComSci')) # 3
    print(courses.index('Art')) # error
    print('Art' in courses) # False
    print('Math' in courses) # True
    
    for item in courses:
        print(item)  # History Math Physics CompSci
    
    for index, course in enumerate(courses):
        print(index, course)  # 0 History  1 Math  2 Physics  3 CompSci
    
    for index, course in enumerate(courses, start=1):
        print(index, course)  # 1 History  2 Math  3 Physics  4 CompSci
    
# join in List
    courses = ['History', 'Math', 'Physics', 'CompSci']
    course_str = ', '.join(courses)
    new_courses = course_str.split(', ')
    print(course_str) # 'History', 'Math', 'Physics', 'CompSci'
    print(new_courses) # ['History', 'Math', 'Physics', 'CompSci']
    
## Tuple
# List is mutable, Tuple is immnutable, we can't add, append, modify element in Tuple
    list1 = ['History', 'Math', 'Physics', 'CompSci']
    list2 = list1
    print(list1) #['History', 'Math', 'Physics', 'CompSci']
    print(list2) #['History', 'Math', 'Physics', 'CompSci']
    
    list1[0] = 'Art'
    print(list1) #['Art', 'Math', 'Physics', 'CompSci']
    print(list2) #['Art', 'Math', 'Physics', 'CompSci']
    
    tuple1 = ('History', 'Math', 'Physics', 'CompSci')
    tuple2 = tuple1
    print(tuple1) #('History', 'Math', 'Physics', 'CompSci')
    print(tuple2) #('History', 'Math', 'Physics', 'CompSci')
    
    tuple1[0] = 'Art'  # error
    
    
## Set
# Set values are unordered and unduplicated
    set1 = {'History', 'Math', 'Physics', 'CompSci'}
    print(set1) #{'Math', 'History', 'Physics', 'CompSci'} order may change
    
    set2 = {'History', 'Math', 'Physics', 'CompSci', 'Math'}
    print(set2) #{'Math', 'History', 'Physics', 'CompSci'} order may change
    
    set3 = {'History', 'Math', 'Physics', 'CompSci'}
    print('Math' in set3) #True Set is optimized for the check existing
     
    # Set is optimized to find the same and differences between 2 ones
    set4 = {'History', 'Math', 'Physics', 'CompSci'}
    set5 = {'History', 'Math', 'Art', 'Design'}
    print(set4.intersaction(set5)) #{'History', 'Math'}
    print(set4.difference(set5)) #{'Physics', 'CompSci'}
    print(set4.union(set5)) #{'History', 'Math', 'Physics', 'CompSci', 'Art', 'Design'}
    
## How to create empty List, Tuple, Set
    empty_list = []    
    empty_list = list()    
     
    empty_tuple = ()    
    empty_tuple = tuple()    
    
    empty_set = {} # This is wrong, it's a dict
    empty_set = set()

2021-01-15

python►a-basics

Python-a03-Numeric

This article shows the examples of Float, Integer in python

# different types
    num1 = 4
    print(type(num1))
    
    num2 = 2.1
    print(type(num2))
      
# arithmetic operators
    print(5 + 2)
    print(5 - 2)
    print(5 * 2)
    print(5 / 2)
    print(5 // 2) # floor division
    print(3 ** 2) # exponent
    print(3 % 2) # modulus
    print(abs(-3)) # absolute
    print(round(3.6)) # round number 4
    print(round(3.75, 1)) # round number 3.8
    
# calculation order
    print(3 + 2 * 2)
    print((3+2)*2)   
     
# increment
    num1 = 1
    num1 = num1 + 1
    print(num1)
    
    num2 = 1
    num2 += 1
    print(num2)

# comparisons
    print(3 == 2) # equal
    print(3 != 2) # not equal
    print(3 > 2)  # greater than
    print(3 < 2)  # less than
    print(3 >= 2) # greater or equal
    print(3 <= 2) # less or equal

# cast
    num1 = '10'
    num2 = '20'
    print(num1 + num2) # 1020

    num3 = int(num1)
    num4 = int(num2)
    print(num3 + num4) # 30

2021-01-14

python►a-basics

Python-a02-String

This article shows the examples of String in python

# 1
   message1 = 'Hello World'
   print(message1)

# 2   
   message2 = "hello 'Jimmy'"
   print(message2)
   
# 3   
   message3 = 'Hello World'
   print(len(message3))
   
# 4   
   message4 = 'Hello World'
   print(message4[1])
   print(message4[-1])
   print(message4[0:5]) # first include, second not
   print(message4[:5]) # end to the 5
   print(message4[6:]) # start from the 6
   print(message4[11]) # error

# 5   
   message5 = 'Hello World'
   print(message5.lower())
   print(message5.upper())
   print(message5.count('l')) # count l's number in the message
   print(message5.find('World')) # return 6
   print(message5.find('Unknown')) # return -1

# 6   
   message6 = 'Hello World'
   # or by the same message6 = message6.replace('World', 'You')
   new_message6 = message6.replace('World', 'You')
   print(new_message6)
   
# 7   
   greeting='Hello'
   name='Bob'
   message7 = greeting + ', ' + name
   print(message7)  
   
   message8 = '{}, {}. Welcome!'.format(greeting, name)
   print(message8)
   
   message9 = f'{greeting}, {name}. Welcome!' # python >3.10
   print(message9)
   
   message10 = f'{greeting}, {name.upper()}. Welcome!' # python >3.10
   print(message10)
   
# 8
   print(help(str))   
   print(help(str.lower))

2021-01-11

python►a-basics

Python-a01-How to install embedded python on windows

install an embedded python on Windows

download zip file from https://www.python.org/downloads/windows/, unzip it and add it to the path environment variable
download get-pip.py from https://bootstrap.pypa.io/get-pip.py
run command
1
python get-pip.py
add pip path to the path in environment variable
when running pip -V there is error
ModuleNotFoundError: No module named ‘pip’

In order to fix this, we have to do as below:
open python38._pth and add the following paths to it
1
2
3
C:\Dev\softwares\python-3.8.0-embed-amd64\Scripts
C:\Dev\softwares\python-3.8.0-embed-amd64\Lib
C:\Dev\softwares\python-3.8.0-embed-amd64\Lib\site-packages

2020-10-02

gcp►b-data-engineer

b06-01-Cloud-PubSub-HelloWorld

In this article I will show you how to publish and receive messages in PubSub with Java

create topic
1
gcloud pubsub topics create my-topic

create subscription to this topic

1	gcloud pubsub subscriptions create my-sub --topic my-topic

git clone project into cloud shell

1	git clone https://github.com/googleapis/java-pubsub.git

go into the sample
1
cd samples/snippets/
modify PublisherExample.java and SubscribeAsyncExample.java to put the right project id, topic id and subscription id
compile project
1
mvn clean install -DskipTests

run subscriber

1	mvn exec:java -Dexec.mainClass="pubsub.SubscribeAsyncExample"

run publisher in another screen and observe subscriber

1	mvn exec:java -Dexec.mainClass="pubsub.PublisherExample"

2020-08-20

gcp►kubernetes

GCP-Kubernetes-Manually

In this article, we will show you how to deploy a web application by kubernetes on gcp.

run nginx on daemon
1
docker run -d -p 8080:80 nginx:latest

change index.html in nginx container

1	docker cp index.html 607de9f58775:/usr/share/nginx/html/

create docker image from the new container version
1
docker commit 607de9f58775 daccfrance:version1

create tag of docker image with project id

1	docker tag daccfrance:version1 eu.gcr.io/kube-test-286917/daccfrance:version1

push docker image to gcp container registry

1	docker push eu.gcr.io/kube-test-286917/daccfrance:version1

kill docker container
1
docker container kill #container_id

set compute zone by default

1	gcloud config set compute/zone europe-west1-b

create a kubernetes cluster

1	gcloud container clusters create gk-cluster --num-nodes=1

get authentication credentials for the cluster

1	gcloud container clusters get-credentials gk-cluster

create kubernetes deployment

1	kubectl create deployment web-server --image=eu.gcr.io/kube-test-286917/daccfrance:version1

create kubernetes service

1	kubectl expose deployment web-server --type LoadBalancer --port 80 --target-port 80

get kubernetes pods
1
kubectl get pods
get kubernetes service
1
kubectl get service web-server

2020-08-06

spark skewness

Here we have an example of key salting to resolve the problem of skewness in spark.

import org.apache.spark.SparkConf
import org.apache.spark.sql.{DataFrame, SparkSession}
import org.apache.spark.sql.functions._

object SparkSkewnessExample extends App {

  val conf = new SparkConf()
    .setMaster("local[*]")
    .setAppName("SparkSkewnessExample")

  val spark = SparkSession
    .builder()
    .config(conf)
    .getOrCreate()

  import spark.implicits._

  // DataFrame 1
  val df1 = Seq(
    ("a", "12"),
    ("a", "31"),
    ("a", "24"),
    ("a", "0"),
    ("a", "24"),
    ("b", "45"),
    ("c", "24")
  ).toDF("id", "value")
  df1.show(10,false)

  //DataFrame2
  val df2 = Seq(
    ("a", "45"),
    ("b", "575"),
    ("c", "54")
  ).toDF("id", "value")
  df2.show(10,false)

  // eliminate skewness
  def eliminateSkewness(leftDf: DataFrame, leftCol: String, rightDf: DataFrame) = {
    val df1 = leftDf
      .withColumn(leftCol, concat(
        leftDf.col(leftCol), lit("_"), lit(floor(rand(123456) * 10))))

    val df2 = rightDf
      .withColumn("saltCol",
        explode(
          array((0 to 10).map(lit(_)): _ *)
        ))

    (df1, df2)
  }

  val (df3, df4) = eliminateSkewness(df1, "id", df2)

  df3.show(100, false)
  df4.show(100, false)

  //join after eliminating data skewness
    df3.join(
      df4,
      df3.col("id") <=> concat(df4.col("id"), lit("_"), df4.col("saltCol"))
    ).drop("saltCol")
      .show(100,false)
}

2020-06-16

gcp►b-data-engineer

b10-Machine Learning

What is Machine Learning ?

Process of combining inputs to produce useful predictions

How it works

Train a model with examples(example = input + label)
Training = adjust model to learn relationship between features and labels
Feature = input variables
Inference = apply trained model to unlabeled examples

Learning types

Supervised learning
- Regression - Continuous, numeric variables
- Classification - categorical variables: yes/no
Unsupervised Learning
- Clustering - finding pattern
- No labeled or categorized
Reinforcement learning
- Use positive/negative reinforcement to complete a task
  - Complete a maze, learn chess

Neural network

Neural network - model composed of layers, consisting of neurons
Neuron - node, combines input values and create one output value
Feature - input variables used to make predictions
Hidden layer - set of neurons operating from same input set
Feature engineering - deciding which features to use in a model
Epoch - single pass through training dataset
Deep and Wide in neural network
- Wide - memorization: many features
- Deep - generalization: many hidden layers
- Deep and Wide - both: good for recommendation engines

What is Overfitting?

training model ‘overfitted’ to training data - unable to generalize with new data

Cause of Overfitting

Not enough training data
Too many features
Model fitted to unnecessary features unique to training data: “noise”

Solving of Overfitting

more data
make model less complex
remove “noise”
- increase “regularization” parameters

AI platform

Fully managed Tensorflow platform
Distributed training and predictions
Hyperparameter tuning with Hypertune

How AI Platform works

Master - manages other nodes
Workers - works on portion of training job
Parameter servers - coordinate shared model state between workers

2020-06-12

gcp►b-data-engineer

b09-BigQuery

What is BigQuery ?

Fully Managed Data warehousing
- Near real time analysis of petabyte scale databases
Serverless(no ops)
Auto scaling
Both storage and analysis
Interact with SQL

How BigQuery works

Columnar data store
It does not update exciting records
No transactional

Structure

Dataset: contains tables/views
Table: collections of columns
Job: long running action/query

IAM

can control by project, dataset, view
cannot control at table level

2020-06-05

gcp►b-data-engineer

b08-Cloud Dataproc

Dataproc

Hadoop, Spark, Hive, Pig
Lift and shift to GCP

Map Reduce

Converting from HDFS to Google Cloud Storage

Copy data to GCS
- Install connector or copy manually
Update file prefix in scripts
- From hdfs:// to gs://
Use Dataproc and run against/output to GCS

Dataproc performance optimization

Keep your data close to your cluster
- Place Dataproc cluster in same region as storage bucket
Larger persistent disk = better performance
- Using SSD over HDD
Allocate more VMs
- Use preemptible VM to save on costs

MA Jian's Blog

Enthussiasm in developing

Python-a04-List,Tuple,Set

Python-a03-Numeric

Python-a02-String

Python-a01-How to install embedded python on windows

install an embedded python on Windows

b06-01-Cloud-PubSub-HelloWorld

GCP-Kubernetes-Manually

spark skewness

b10-Machine Learning

What is Machine Learning ?

How it works

Learning types

Neural network

What is Overfitting?

Cause of Overfitting

Solving of Overfitting

AI platform

How AI Platform works

b09-BigQuery

What is BigQuery ?

How BigQuery works

Structure

IAM

b08-Cloud Dataproc

Dataproc

Map Reduce

Converting from HDFS to Google Cloud Storage

Dataproc performance optimization