Skip to content
Permalink

Comparing changes

Choose two branches to see what’s changed or to start a new pull request. If you need to, you can also or learn more about diff comparisons.

Open a pull request

Create a new pull request by comparing changes across two branches. If you need to, you can also . Learn more about diff comparisons here.
base repository: rockthejvm/spark-optimization
Failed to load repositories. Confirm that selected base ref is valid, then try again.
Loading
base: master
Choose a base ref
...
head repository: codebind26/spark-optimization
Failed to load repositories. Confirm that selected head ref is valid, then try again.
Loading
compare: master
Choose a head ref
Can’t automatically merge. Don’t worry, you can still create the pull request.

Commits on Aug 5, 2020

  1. scala recap

    daniel-ciocirlan committed Aug 5, 2020
    Copy the full SHA
    d0b5058 View commit details
  2. spark recap

    daniel-ciocirlan committed Aug 5, 2020
    Copy the full SHA
    d2f06f9 View commit details
  3. job anatomy

    daniel-ciocirlan committed Aug 5, 2020
    Copy the full SHA
    4ce35b9 View commit details
  4. query plans

    daniel-ciocirlan committed Aug 5, 2020
    Copy the full SHA
    b2fdb10 View commit details
  5. Copy the full SHA
    65e52ee View commit details
  6. Copy the full SHA
    81f2a35 View commit details
  7. Copy the full SHA
    e4af17a View commit details
  8. Copy the full SHA
    76bb1a5 View commit details
  9. Copy the full SHA
    af45822 View commit details
  10. broadcast joins

    daniel-ciocirlan committed Aug 5, 2020
    Copy the full SHA
    21c5be2 View commit details
  11. broadcast joins

    daniel-ciocirlan committed Aug 5, 2020
    Copy the full SHA
    0a4500b View commit details
  12. column pruning

    daniel-ciocirlan committed Aug 5, 2020
    Copy the full SHA
    b5fa750 View commit details
  13. pre-partitioning

    daniel-ciocirlan committed Aug 5, 2020
    Copy the full SHA
    6b232aa View commit details
  14. bucketing

    daniel-ciocirlan committed Aug 5, 2020
    Copy the full SHA
    dc6c50f View commit details
  15. Copy the full SHA
    f197439 View commit details
  16. Copy the full SHA
    a14f62a View commit details
  17. RDD joins

    daniel-ciocirlan committed Aug 5, 2020
    Copy the full SHA
    84205f8 View commit details
  18. RDD cogrouping

    daniel-ciocirlan committed Aug 5, 2020
    Copy the full SHA
    841bcc2 View commit details
  19. RDD broadcast joins

    daniel-ciocirlan committed Aug 5, 2020
    Copy the full SHA
    c93479a View commit details
  20. RDD skewed joins

    daniel-ciocirlan committed Aug 5, 2020
    Copy the full SHA
    e5fd1f8 View commit details
  21. Copy the full SHA
    1e6117d View commit details
  22. reusing JVM objects

    daniel-ciocirlan committed Aug 5, 2020
    Copy the full SHA
    2369b8f View commit details
  23. i2i transformations

    daniel-ciocirlan committed Aug 5, 2020
    Copy the full SHA
    2b57396 View commit details
  24. Copy the full SHA
    cfa40bd View commit details
  25. readme

    daniel-ciocirlan committed Aug 5, 2020
    Copy the full SHA
    366947c View commit details

Commits on Sep 10, 2020

  1. spark version 3.0.1

    daniel-ciocirlan committed Sep 10, 2020
    Copy the full SHA
    e259b52 View commit details
82 changes: 82 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,82 @@
# The official repository for the Rock the JVM Spark Optimization with Scala course

Powered by [Rock the JVM!](rockthejvm.com)

This repository contains the code we wrote during [Rock the JVM's Spark Optimization with Scala](https://rockthejvm.com/course/spark-optimization) course. Unless explicitly mentioned, the code in this repository is exactly what was caught on camera.

### Install and setup

- install [IntelliJ IDEA](https://jetbrains.com/idea)
- install [Docker Desktop](https://docker.com)
- either clone the repo or download as zip
- open with IntelliJ as an SBT project

As you open the project, the IDE will take care to download and apply the appropriate library dependencies.

To set up the dockerized Spark cluster we will be using in the course, do the following:

- open a terminal and navigate to `spark-cluster`
- run `build-images.sh` (if you don't have a bash terminal, just open the file and run each line one by one)
- run `docker-compose up`

To interact with the Spark cluster, the folders `data` and `apps` inside the `spark-cluster` folder are mounted onto the Docker containers under `/opt/spark-data` and `/opt/spark-apps` respectively.

To run a Spark shell, first run `docker-compose up` inside the `spark-cluster` directory, then in another terminal, do

```
docker exec -it spark-cluster_spark-master_1 bash
```

and then

```
/spark/bin/spark-shell
```

### How to use intermediate states of this repository

Start by cloning this repository and checkout the `start` tag:

```
git checkout start
```

### How to run an intermediate state

The repository was built while recording the lectures. Prior to each lecture, I tagged each commit so you can easily go back to an earlier state of the repo!

The tags are as follows:

* `start`
* `1.1-scala-recap`
* `1.2-spark-recap`
* `2.2-spark-job-anatomy`
* `2.3-query-plans`
* `2.3-query-plans-exercises`
* `2.4-spark-ui`
* `2.5-spark-apis`
* `2.6-deploy-config`
* `3.1-join-mechanics`
* `3.2-broadcast-joins`
* `3.3-column-pruning`
* `3.4-prepartitioning`
* `3.5-bucketing`
* `3.6-skewed-joins`
* `4.1-rdd-joins`
* `4.2-cogroup`
* `4.3-rdd-broadcast`
* `4.4-rdd-skews`
* `5.1-rdd-transformations`
* `5.2-by-key-ops`
* `5.3-reusing-objects`
* `5.5-i2i-transformations`
* `5.6-i2i-transformations-exercises`

When you watch a lecture, you can `git checkout` the appropriate tag and the repo will go back to the exact code I had when I started the lecture.

### For questions or suggestions

If you have changes to suggest to this repo, either
- submit a GitHub issue
- tell me in the course Q/A forum
- submit a pull request!
2 changes: 1 addition & 1 deletion spark-cluster/docker/base/Dockerfile
Original file line number Diff line number Diff line change
@@ -3,7 +3,7 @@ LABEL author="Daniel Ciocirlan" email="daniel@rockthejvm.com"
LABEL version="0.2"

ENV DAEMON_RUN=true
ENV SPARK_VERSION=3.0.0
ENV SPARK_VERSION=3.0.1
ENV HADOOP_VERSION=2.7
ENV SCALA_VERSION=2.12.4
ENV SCALA_HOME=/usr/share/scala
64 changes: 64 additions & 0 deletions src/META-INF/MANIFEST.MF
Original file line number Diff line number Diff line change
@@ -0,0 +1,64 @@
Manifest-Version: 1.0
Class-Path: commons-compiler-3.0.15.jar hadoop-mapreduce-client-common
-2.7.4.jar hadoop-yarn-server-nodemanager-2.7.4.jar hadoop-yarn-api-2
.7.4.jar avro-1.8.2.jar avro-mapred-1.8.2-hadoop2.jar hadoop-mapreduc
e-client-jobclient-2.7.4.jar jackson-mapper-asl-1.9.13.jar scala-xml_
2.12-1.2.0.jar commons-compress-1.8.1.jar javassist-3.22.0-CR2.jar ha
doop-yarn-common-2.7.4.jar commons-httpclient-3.1.jar spark-catalyst_
2.12-3.0.0-preview2.jar jersey-common-2.29.1.jar jackson-core-2.10.0.
jar spark-tags_2.12-3.0.0-preview2.jar parquet-column-1.10.1.jar json
4s-scalap_2.12-3.6.6.jar javax.servlet-api-3.1.0.jar jsr305-3.0.2.jar
jackson-module-scala_2.12-2.10.0.jar metrics-graphite-4.1.1.jar metr
ics-jmx-4.1.1.jar leveldbjni-all-1.8.jar guice-3.0.jar curator-recipe
s-2.7.1.jar avro-ipc-1.8.2.jar hadoop-mapreduce-client-core-2.7.4.jar
jersey-hk2-2.29.1.jar spark-core_2.12-3.0.0-preview2.jar RoaringBitm
ap-0.7.45.jar hadoop-yarn-server-common-2.7.4.jar metrics-json-4.1.1.
jar jackson-core-asl-1.9.13.jar hadoop-annotations-2.7.4.jar pyrolite
-4.30.jar orc-shims-1.5.8.jar jakarta.inject-2.6.1.jar jetty-util-6.1
.26.jar httpcore-4.2.4.jar hk2-locator-2.6.1.jar xz-1.5.jar commons-m
ath3-3.4.1.jar commons-cli-1.2.jar gson-2.2.4.jar jsp-api-2.1.jar act
ivation-1.1.1.jar curator-framework-2.7.1.jar parquet-hadoop-1.10.1.j
ar hadoop-common-2.7.4.jar slf4j-api-1.7.28.jar jersey-container-serv
let-2.29.1.jar jetty-sslengine-6.1.26.jar commons-crypto-1.0.0.jar ao
palliance-repackaged-2.6.1.jar jakarta.ws.rs-api-2.1.6.jar jcl-over-s
lf4j-1.7.16.jar jackson-databind-2.10.0.jar osgi-resource-locator-1.0
.3.jar arrow-memory-0.15.1.jar aopalliance-1.0.jar orc-mapreduce-1.5.
8.jar kryo-shaded-4.0.2.jar commons-io-2.4.jar stax-api-1.0-2.jar par
quet-jackson-1.10.1.jar log4j-1.2.17.jar jersey-client-2.29.1.jar sna
ppy-java-1.1.7.3.jar parquet-format-2.4.0.jar flatbuffers-java-1.9.0.
jar metrics-core-4.1.1.jar slf4j-log4j12-1.7.25.jar xercesImpl-2.9.1.
jar chill-java-0.9.3.jar jakarta.validation-api-2.0.2.jar jakarta.ann
otation-api-1.3.5.jar jersey-server-2.29.1.jar jersey-container-servl
et-core-2.29.1.jar zstd-jni-1.4.4-3.jar jackson-annotations-2.10.0.ja
r objenesis-2.5.1.jar scala-parser-combinators_2.12-1.1.2.jar commons
-beanutils-1.7.0.jar ivy-2.4.0.jar json4s-core_2.12-3.6.6.jar commons
-net-3.1.jar oro-2.0.8.jar spark-launcher_2.12-3.0.0-preview2.jar ant
lr4-runtime-4.7.1.jar hadoop-mapreduce-client-app-2.7.4.jar hadoop-cl
ient-2.7.4.jar hk2-api-2.6.1.jar stream-2.9.6.jar commons-configurati
on-1.6.jar zookeeper-3.4.14.jar orc-core-1.5.8.jar xbean-asm7-shaded-
4.15.jar log4j-api-2.4.1.jar api-asn1-api-1.0.0-M20.jar curator-clien
t-2.7.1.jar protobuf-java-2.5.0.jar compress-lzf-1.0.3.jar jackson-ja
xrs-1.9.13.jar arrow-format-0.15.1.jar scala-library-2.12.4.jar spark
-unsafe_2.12-3.0.0-preview2.jar spark-sql_2.12-3.0.0-preview2.jar air
compressor-0.10.jar jline-0.9.94.jar minlog-1.3.0.jar lz4-java-1.7.0.
jar unused-1.0.0.jar chill_2.12-0.9.3.jar commons-text-1.6.jar py4j-0
.10.8.1.jar parquet-encoding-1.10.1.jar jackson-xc-1.9.13.jar hadoop-
mapreduce-client-shuffle-2.7.4.jar audience-annotations-0.5.0.jar jet
tison-1.1.jar netty-all-4.1.42.Final.jar jaxb-api-2.2.2.jar jersey-me
dia-jaxb-2.29.1.jar apacheds-kerberos-codec-2.0.0-M15.jar janino-3.0.
15.jar hadoop-yarn-client-2.7.4.jar arrow-vector-0.15.1.jar log4j-cor
e-2.4.1.jar hive-storage-api-2.6.0.jar guava-16.0.1.jar spotbugs-anno
tations-3.1.9.jar spark-sketch_2.12-3.0.0-preview2.jar xmlenc-0.52.ja
r json4s-ast_2.12-3.6.6.jar scala-reflect-2.12.4.jar hk2-utils-2.6.1.
jar spark-network-common_2.12-3.0.0-preview2.jar paranamer-2.8.jar ap
acheds-i18n-2.0.0-M15.jar jul-to-slf4j-1.7.16.jar commons-lang3-3.9.j
ar metrics-jvm-4.1.1.jar jackson-module-paranamer-2.10.0.jar hadoop-h
dfs-2.7.4.jar spark-network-shuffle_2.12-3.0.0-preview2.jar xml-apis-
1.3.04.jar json4s-jackson_2.12-3.6.6.jar htrace-core-3.1.0-incubating
.jar javax.inject-1.jar httpclient-4.2.5.jar hadoop-auth-2.7.4.jar co
mmons-codec-1.10.jar commons-collections-3.2.2.jar shims-0.7.45.jar s
park-kvstore_2.12-3.0.0-preview2.jar netty-3.10.6.Final.jar parquet-c
ommon-1.10.1.jar univocity-parsers-2.8.3.jar api-util-1.0.0-M20.jar c
ommons-lang-2.6.jar commons-digester-1.8.jar
Main-Class:

18 changes: 9 additions & 9 deletions src/main/scala/generator/DataGenerator.scala
Original file line number Diff line number Diff line change
@@ -32,17 +32,17 @@ object DataGenerator {
// Laptop models generation - skewed data lectures
/////////////////////////////////////////////////////////////////////////////////

val laptopModelsSet: Seq[LatopModel] = Seq(
LatopModel("Razer", "Blade"),
LatopModel("Alienware", "Area-51"),
LatopModel("HP", "Omen"),
LatopModel("Acer", "Predator"),
LatopModel("Asus", "ROG"),
LatopModel("Lenovo", "Legion"),
LatopModel("MSI", "Raider")
val laptopModelsSet: Seq[LaptopModel] = Seq(
LaptopModel("Razer", "Blade"),
LaptopModel("Alienware", "Area-51"),
LaptopModel("HP", "Omen"),
LaptopModel("Acer", "Predator"),
LaptopModel("Asus", "ROG"),
LaptopModel("Lenovo", "Legion"),
LaptopModel("MSI", "Raider")
)

def randomLaptopModel(uniform: Boolean = false): LatopModel = {
def randomLaptopModel(uniform: Boolean = false): LaptopModel = {
val makeModelIndex = if (!uniform && random.nextBoolean()) 0 else random.nextInt(laptopModelsSet.size) // 50% of the data is of the first kind
laptopModelsSet(makeModelIndex)
}
2 changes: 1 addition & 1 deletion src/main/scala/generator/LaptopsDomain.scala
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@
package generator

case class LatopModel(make: String, model: String)
case class LaptopModel(make: String, model: String)
case class Laptop(registration: String, make: String, model: String, procSpeed: Double)
case class LaptopOffer(make: String, model: String, procSpeed: Double, salePrice: Double)
113 changes: 113 additions & 0 deletions src/main/scala/part1recap/ScalaRecap.scala
Original file line number Diff line number Diff line change
@@ -0,0 +1,113 @@
package part1recap

import scala.concurrent.Future
import scala.util.{Failure, Success}

object ScalaRecap extends App {

// values and variables
val aBoolean: Boolean = false

// expressions
val anIfExpression = if(2 > 3) "bigger" else "smaller"

// instructions vs expressions
val theUnit = println("Hello, Scala") // Unit = "no meaningful value" = void in other languages

// functions
def myFunction(x: Int) = 42

// OOP
class Animal
class Cat extends Animal
trait Carnivore {
def eat(animal: Animal): Unit
}

class Crocodile extends Animal with Carnivore {
override def eat(animal: Animal): Unit = println("Crunch!")
}

// singleton pattern
object MySingleton

// companions
object Carnivore

// generics
trait MyList[A]

// method notation
val x = 1 + 2
val y = 1.+(2)

// Functional Programming
val incrementer: Int => Int = x => x + 1
val incremented = incrementer(42)

// map, flatMap, filter
val processedList = List(1,2,3).map(incrementer)

// Pattern Matching
val unknown: Any = 45
val ordinal = unknown match {
case 1 => "first"
case 2 => "second"
case _ => "unknown"
}

// try-catch
try {
throw new NullPointerException
} catch {
case _: NullPointerException => "some returned value"
case _: Throwable => "something else"
}

// Future
import scala.concurrent.ExecutionContext.Implicits.global
val aFuture = Future {
// some expensive computation, runs on another thread
42
}

aFuture.onComplete {
case Success(meaningOfLife) => println(s"I've found $meaningOfLife")
case Failure(ex) => println(s"I have failed: $ex")
}

// Partial functions
val aPartialFunction: PartialFunction[Int, Int] = {
case 1 => 43
case 8 => 56
case _ => 999
}

// Implicits

// auto-injection by the compiler
def methodWithImplicitArgument(implicit x: Int) = x + 43
implicit val implicitInt = 67
val implicitCall = methodWithImplicitArgument

// implicit conversions - implicit defs
case class Person(name: String) {
def greet = println(s"Hi, my name is $name")
}

implicit def fromStringToPerson(name: String) = Person(name)
"Bob".greet // fromStringToPerson("Bob").greet

// implicit conversion - implicit classes
implicit class Dog(name: String) {
def bark = println("Bark!")
}
"Lassie".bark

/*
- local scope
- imported scope
- companion objects of the types involved in the method call
*/

}
Loading