Apache Spark, Eclipse and Python

Good documentation at https://nosqlnocry.wordpress.com/2015/03/05/setup-eclipse-to-start-developing-in-spark-scala/. But in case you still need help:

Use Eclipse Juno release. L has some problems
Install scala plugin from http://scala-ide.org/download/current.html
Create maven build, sample pom.xml and you are good to go.

<project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"

xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">

<maven.compiler.source>1.6</maven.compiler.source>

<maven.compiler.target>1.6</maven.compiler.target>

<scala.tools.version>2.10</scala.tools.version>

<scala.version>2.11.0</scala.version>

</properties>

<id>scala-tools.org</id>

<name>Scala-tools Maven2 Repository</name>

<url>https://oss.sonatype.org/content/repositories/snapshots/</url>

</pluginRepository>

</pluginRepositories>

<groupId>org.scala-lang</groupId>

<artifactId>scala-library</artifactId>

</dependency>

<groupId>org.apache.spark</groupId>

<artifactId>spark-core_2.11</artifactId>

</dependency>

</dependencies>

<build>

</resource>

</resources>

<groupId>org.scala-tools</groupId>

<artifactId>maven-scala-plugin</artifactId>

<goals>

<goal>compile</goal>

<goal>testCompile</goal>

</goals>

</execution>

</executions>

</plugin>

<artifactId>maven-assembly-plugin</artifactId>

<descriptorRef>jar-with-dependencies</descriptorRef>

</descriptorRefs>

</configuration>

<id>make-assembly</id>

<phase>package</phase>

<goals>

<goal>single</goal>

</goals>

</execution>

</executions>

</plugin>

</plugins>

</build>

</project>

Ok, it took me a while to figure out how to configure eclipse to run python/spark so publishing steps to save bit of your time.

At the time of writing: 12/24/2104, I used following versions:

Eclipse Juno
Spark 1.2.0 (with Hadoop 2.4)
Python 2.7 (anaconda)
Windows

Steps:

Set environment variables

Path, it should only include path to root and root\scripts dir of python 2 version
PYTHONPATH, set it to python 2 root dir
JAVA_HOME, set it to your java root dir. I am using oracle JDK 1.8
SPARK_HOME, set it to your spark root dir
HADOOP_HOME, set it to your hadoop dir

On Window, hadoop doesn't have need common tools. Copy files from https://github.com/srccodes/hadoop-common-2.2.0-bin/archive/master.zip into your hadoop bin dir. This step is optional, without this step you will see Could not locate executable null\bin\winutils.exe in the Hadoop. and you will not be able to save data
Eclipse

Install and configure pydev. Google for instructions
Install py4j for python version you are using. I use "pip install py4j"
In pydev->Interpreters->pthon interpreter

add path to python dir in spark, python lib, python script in python installation. Below is the screenshot

And you are good to GO.

Apache Spark, Eclipse and Python

Tuesday, June 23, 2015

Getting started with Spark, Eclipse, maven and Scala

Wednesday, December 24, 2014

Getting started with eclipse, apache spark and python