Tuesday, June 23, 2015

Getting started with Spark, Eclipse, maven and Scala


Good documentation at https://nosqlnocry.wordpress.com/2015/03/05/setup-eclipse-to-start-developing-in-spark-scala/. But in case you still need help:


  1. Use Eclipse Juno release. L has some problems
  2. Install scala plugin from http://scala-ide.org/download/current.html
  3. Create maven build, sample pom.xml and you are good to go. 

<project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
  xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">
  <modelVersion>4.0.0</modelVersion>

  <groupId></groupId>
  <artifactId></artifactId>
  <version></version>
  <packaging>pom</packaging>

<properties>
<maven.compiler.source>1.6</maven.compiler.source>
<maven.compiler.target>1.6</maven.compiler.target>
<encoding>UTF-8</encoding>
<scala.tools.version>2.10</scala.tools.version>
<!-- Put the Scala version of the cluster -->
<scala.version>2.11.0</scala.version>
</properties>
<pluginRepositories>
<pluginRepository>
    <id>scala-tools.org</id>
    <name>Scala-tools Maven2 Repository</name>
    <url>https://oss.sonatype.org/content/repositories/snapshots/</url>
  </pluginRepository>
</pluginRepositories>

<dependencies>
  <dependency>
    <groupId>org.scala-lang</groupId>
    <artifactId>scala-library</artifactId>
    <version>2.11.0</version>
  </dependency>
  <dependency>
   <groupId>org.apache.spark</groupId>
<artifactId>spark-core_2.11</artifactId>
<version>1.4.0</version>
  </dependency>
</dependencies>

 <build>
  <resources>
      <resource>
        <directory>src</directory>
      </resource>
    </resources>
    
        <plugins>
<plugin>
  <groupId>org.scala-tools</groupId>
      <artifactId>maven-scala-plugin</artifactId>
      <version>2.15.2</version>
      <executions>
        <execution>
          <goals>
            <goal>compile</goal>
            <goal>testCompile</goal>
          </goals>
        </execution>
      </executions>
</plugin>
<plugin>
<artifactId>maven-assembly-plugin</artifactId>
<version>2.4.1</version>
<configuration>
<descriptorRefs>
<descriptorRef>jar-with-dependencies</descriptorRef>
</descriptorRefs>
</configuration>
<executions>
<execution>
<id>make-assembly</id>
<phase>package</phase>
<goals>
<goal>single</goal>
</goals>
</execution>
</executions>
</plugin>
</plugins>
</build>

</project>

Wednesday, December 24, 2014

Getting started with eclipse, apache spark and python



Ok, it took me a while to figure out how to configure eclipse to run python/spark so publishing steps to save bit of your time.

At the time of writing: 12/24/2104, I used following versions:

  • Eclipse Juno
  • Spark 1.2.0 (with Hadoop 2.4)
  • Python 2.7 (anaconda)
  • Windows 

Steps:
  1. Set environment variables
    1. Path, it should only include path to root and root\scripts dir of python 2 version
    2. PYTHONPATH, set it to python 2 root dir
    3. JAVA_HOME, set it to your java root dir. I am using oracle JDK 1.8
    4. SPARK_HOME, set it to your spark root dir
    5. HADOOP_HOME, set it to your hadoop dir
  2. On Window, hadoop doesn't have need common tools. Copy files from https://github.com/srccodes/hadoop-common-2.2.0-bin/archive/master.zip into your hadoop bin dir. This step is optional, without this step you will see Could not locate executable null\bin\winutils.exe in the Hadoop. and you will not be able to save data
  3. Eclipse
    1. Install and configure pydev. Google for instructions
    2. Install py4j for python version you are using.  I use "pip install py4j"
    3. In pydev->Interpreters->pthon interpreter
      1.  add path to python dir in spark, python lib, python script in python installation.  Below is the screenshot



And you are good to GO.