Apache Spark Introduction
Apache Spark is an open-source cluster computing framework. It has been designed to perform generic big data processing with great performance, it is well suited to machine learning algorithms, in this posts We will cover basic aspects such as components, installation and python console.
Spark Components
Spark Core and Resilient Distributed Datasets
Spark Core contains the basic functionality of the project. It provides distributed task dispatching, scheduling, and basic I/O . Spark Core is the place where Resilient Distributed Datasets (RDD) API is allocated, this API hides complexity inherent to distribuited systems and expose funcionally through Api clients in Java, Python, Scala. For the point of view of applications, processed data is not distribuited accross several nodes, data seems to be on the same machine and it can be accessed in the same way as any local file is accessed.
Spark SQL
This component provides query support trough this components, data can be accessed using SQL or domain-specific language to manipulate SchemaRDDs in Scala, Java, or Python. This component is very powerful because complex query with standard SQL and SchemaRDD can be mixed in order to execute complex data maniputation process.
Spark Streaming
This component ingests data and performs RDD transformations.
MLlib Machine Learning Library
This component implements many common machine learning and statistical algorithms to simplify large scale machine learning pipelines.
GraphX
This component provides an API for expressing graph computation.
Installation
>You can download and uncompress last version from spark website or You can use a chef recipe that will install it for You
First Stetps
We will assume that You have downloaded and installed spark-1.2.1 with support for hadoop2.4 on /opt directory, let's check what is inside:
$cd /opt/spark-1.2.1-bin-hadoop2.4/ $ls bin conf data ec2 examples lib LICENSE NOTICE python README.md RELEASE sbin
Python Interpreter
You can use the command
to interact with the python shell:
$ bin/pyspark Python 2.7.3 (default, Mar 13 2014, 11:03:55) [GCC 4.7.2] on linux2 Type "help", "copyright", "credits" or "license" for more information. Spark assembly has been built with Hive, including Datanucleus jars on classpath Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties Welcome to ____ __ / __/__ ___ _____/ /__ _\ \/ _ \/ _ `/ __/ '_/ /__ / .__/\_,_/_/ /_/\_\ version 1.2.1 /_/ Using Python version 2.7.3 (default, Mar 13 2014 23:03:55) SparkContext available as sc
Let's open /opt/spark-1.2.1-bin-hadoop2.4/README.md and try some basic operations:
>>> text = sc.textFile("README.md") >>> print text README.md MappedRDD[1] at textFile at NativeMethodAccessorImpl.java:-2 >>> text.count() 98 >>> text.first() u'# Apache Spark' >>> lines_with_spark = text.filter(lambda x: 'Spark' in x) >>> lines_with_spark.count() 19 >>> text.filter(lambda x: 'Spark' in x).count() 19
Find the line with the most words:
>>>text.map(lambda line: len(line.split())).reduce(lambda a, b: a if (a > b) else b) 14
Find the line with the biggest number of odd words:
>>>text.map(lambda line: len(line.split()) if len(line.split()) % 2 != 0 else 0 ).reduce(lambda a, b: a if (a > b) else b) 13
We can create a python file named process_file.py with the code below:
from operator import add from pyspark import SparkContext logFile = "/opt/spark-1.2.1-bin-hadoop2.4/README.md" # Should be some file on your system sc = SparkContext("local", "Simple App") logData = sc.textFile(logFile).cache() # get set of list with odd number of words odd_words_count = logData.filter(lambda line: len(line.split()) % 2 != 0).count() #get number of unique words unique_words = logData.flatMap(lambda line: line.split(' ')).map(lambda x: (x, 1)).reduceByKey(add).collect() print "there are %s lines with odd number of words\n" % odd_words_count print "there are %s unique words\n" % len(unique_words) print "and the unique words are :" for w in unique_words: print "%s %s times" % w
We can execute this program with the lines below:
$bin/spark-submit \ --master local[4] \ process_file.py