This article is a step-by-step guide to help you write, build and run a standalone app on Spark 1.0 written in Scala. It assumes that you have a working installation of Spark 1.0.
Install SBT
We’ll use Scala SBT (Simple Build Tool) for building our standalone app. To install sbt, follow these steps.
Note: Replace parambirs:parambirs with your username:groupname combination
$ cd ~/Downloads
$ wget http://dl.bintray.com/sbt/native-packages/sbt/0.13.2/sbt-0.13.2.tgz
$ tar zxvf sbt-0.13.2.tgz
$ sudo mv sbt /usr/local
$ cd /usr/local
$ sudo chown -R parambirs:parambirs sbt
$ vim ~/.bashrc
Add the following at the end of the file
export PATH=$PATH:/usr/local/sbt/bin
Refresh bash environment to reflect sbt addition to path
$ source ~/.bashrc
Download some sample data for our app and add it to hdfs
$ cd ~/Downloads
$ wget http://www.gutenberg.org/ebooks/1342.txt.utf-8
$ hadoop dfs -put 1342.txt.utf-8 /user/parambirs/pride.txt
$ hadoop dfs -ls /user/parambirs
-rw-r--r-- 1 parambirs supergroup 717569 2014-05-20 13:56 /user/parambirs/pride.txt
Write a simple app
$ cd ~
$ mkdir -p SimpleApp/src/main/scala
$ cd SimpleApp
$ vim src/main/scala/SimpleApp.scala
Add the following code to SimpleApp.scala
file. The code loads the text of “Pride and Prejudice” book by Jane Austen and calculates the number of lines that contain the letter “a” and “b”. It then prints this data to the console.
/*** SimpleApp.scala ***/
import org.apache.spark._
object SimpleApp {
def main(args: Array[String]) {
val logFile = "hdfs://localhost:9000/user/parambirs/pride.txt"
val conf = new SparkConf().setAppName("Simple App")
val sc = new SparkContext(conf)
val logData = sc.textFile(logFile, 2).cache()
val numAs = logData.filter(line => line.contains("a")).count()
val numBs = logData.filter(line => line.contains("b")).count()
println("Lines with a: %s, Lines with b: %s".format(numAs, numBs))
}
}
Create sbt build file
$ vim simple.sbt
Add the following content to the build file
name := "Simple Project"
version := "1.0"
scalaVersion := "2.10.4"
libraryDependencies += "org.apache.spark" %% "spark-core" % "1.0.0-SNAPSHOT"
resolvers += "Akka Repository" at "http://repo.akka.io/releases/"
libraryDependencies += "org.apache.hadoop" % "hadoop-client" % "2.4.0"
The final structure for the source folder should be like this
$ find .
./build.sbt
./src
./src/main
./src/main/scala
./src/main/scala/SimpleApp.scala
Building the App
Before we build our app, we need to publish spark-1.0 artifacts to local maven repository because our app has a dependency on it.
$ cd /usr/local/spark
$ sbt/sbt publish-local
Now we can build our app
$ cd ~/SimpleApp
$ sbt package
Running the app on YARN
We’ll use the yarn cluster mode (using spark-submit
) to run our app. This sends our app as well as the spark assembly to the yarn cluster and the code is executed remotely.
$ cd /usr/local/spark
$ SPARK_JAR=./assembly/target/scala-2.10/spark-assembly-1.0.0-SNAPSHOT-hadoop2.4.0.jar HADOOP_CONF_DIR=/usr/local/hadoop/etc/hadoop ./bin/spark-submit --master yarn --deploy-mode cluster --class SimpleApp --num-executors 3 --driver-memory 4g --executor-memory 2g --executor-cores 1 ~/SimpleApp/target/scala-2.10/simple-project_2.10-1.0.jar
Checking the output
The output of the program is stored in yarn logs. On my machine, the application Id for this app was application_1400566266818_0007 and therefore, the output could be read using the following command
$ cat /usr/local/hadoop/logs/userlogs/application_1400566266818_0007/container_1400566266818_0007_01_000001/stdout
Lines with a: 10559, Lines with b: 5874