Apache Spark Introduction

Apache Spark is an open-source cluster computing framework. It has been designed to perform generic big data processing with great performance, it is well suited to machine learning algorithms, in this posts We will cover basic aspects such as components, installation and python console.

Spark Components

Spark Core and Resilient Distributed Datasets

Spark Core contains the basic functionality of the project. It provides distributed task dispatching, scheduling, and basic I/O . Spark Core is the place where Resilient Distributed Datasets (RDD) API is allocated, this API hides complexity inherent to distribuited systems and expose funcionally through Api clients in Java, Python, Scala. For the point of view of applications, processed data is not distribuited accross several nodes, data seems to be on the same machine and it can be accessed in the same way as any local file is accessed.

Spark SQL

This component provides query support trough this components, data can be accessed using SQL or domain-specific language to manipulate SchemaRDDs in Scala, Java, or Python. This component is very powerful because complex query with standard SQL and SchemaRDD can be mixed in order to execute complex data maniputation process.

Spark Streaming

This component ingests data and performs RDD transformations.

MLlib Machine Learning Library

This component implements many common machine learning and statistical algorithms to simplify large scale machine learning pipelines.


This component provides an API for expressing graph computation.



You can download and uncompress last version from spark website or You can use a chef recipe that will install it for You

First Stetps

We will assume that You have downloaded and installed spark-1.2.1 with support for hadoop2.4 on /opt directory, let's check what is inside:

    $cd /opt/spark-1.2.1-bin-hadoop2.4/
    bin  conf  data  ec2  examples  lib  LICENSE  NOTICE  python  README.md  RELEASE  sbin  

Python Interpreter

You can use the command to interact with the python shell:

    $ bin/pyspark
    Python 2.7.3 (default, Mar 13 2014, 11:03:55) 
    [GCC 4.7.2] on linux2
    Type "help", "copyright", "credits" or "license" for more information.
    Spark assembly has been built with Hive, including Datanucleus jars on classpath
    Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
    Welcome to
      ____              __
     / __/__  ___ _____/ /__
    _\ \/ _ \/ _ `/ __/  '_/
   /__ / .__/\_,_/_/ /_/\_\   version 1.2.1

    Using Python version 2.7.3 (default, Mar 13 2014 23:03:55)
    SparkContext available as sc  

Let's open /opt/spark-1.2.1-bin-hadoop2.4/README.md and try some basic operations:

    >>> text = sc.textFile("README.md")
    >>> print text
    README.md MappedRDD[1] at textFile at NativeMethodAccessorImpl.java:-2
    >>> text.count()
    >>> text.first()
    u'# Apache Spark'

    >>> lines_with_spark = text.filter(lambda x: 'Spark' in x)
    >>> lines_with_spark.count()

    >>> text.filter(lambda x: 'Spark' in x).count()

Find the line with the most words:

    >>>text.map(lambda line: len(line.split())).reduce(lambda a, b: a if (a > b) else b)

Find the line with the biggest number of odd words:

    >>>text.map(lambda line: len(line.split())  if len(line.split()) % 2 != 0 else 0   ).reduce(lambda a, b: a if (a > b) else b)

We can create a python file named process_file.py with the code below:

    from operator import add
    from pyspark import SparkContext

    logFile = "/opt/spark-1.2.1-bin-hadoop2.4/README.md"  # Should be some file on your system
    sc = SparkContext("local", "Simple App")
    logData = sc.textFile(logFile).cache()

# get set of list with odd number of words
    odd_words_count = logData.filter(lambda line: len(line.split()) % 2 != 0).count()

#get number of unique words
    unique_words = logData.flatMap(lambda line: line.split(' ')).map(lambda x: (x, 1)).reduceByKey(add).collect()

    print "there are %s lines with odd number of words\n" % odd_words_count
    print "there are %s unique words\n" % len(unique_words)
    print "and the unique words are :" 
    for w in unique_words:
        print "%s %s times" % w

We can execute this program with the lines below:

    $bin/spark-submit \
    --master local[4] \