Apache Spark in Java : A Simple Word Counter Program

Discover how to use Apache Spark in Java to create an efficient word counter program! From project setup to execution – explained step by step. Dive into the world of big data processing with this informative post!

Introduction to Apache Spark

Apache Spark is an open-source data processing framework that can perform analytical operations on big data in a distributed environment. Originally an academic project at UC Berkeley, it was launched in 2009 by Matei Zaharia in the AMPLab at UC Berkeley. Apache Spark was developed based on a cluster management tool called Mesos and was later modified and updated to work in a cluster-based environment with distributed processing tasks.

Example Project Setup

For demonstration purposes, Maven is used to create an example project. Run the following command in a directory that you want to use as a workspace:

 
mvn archetype:generate -DgroupId=com.journaldev.sparkdemo -DartifactId=JD-Spark-WordCount -DarchetypeArtifactId=maven-archetype-quickstart -DinteractiveMode=false

Adding Maven Dependencies

Once the project is created, add the appropriate Maven dependencies. Here is the `pom.xml` file with the relevant dependencies.

 

    
    
        org.apache.spark
        spark-core_2.11
        1.4.0
    
    

Creating an Input File

To create a word counter program, place an example input file named `input.txt` in the root directory of your project. Use the following text or your own:

 
Hello, my name is Max, and I am a writer at JournalDev. JournalDev is a great website to read great lessons about Java, Big Data, Python, and many other programming languages.

Big Data lessons are hard to find, but at JournalDev, you will find some excellent lessons on Big Data.
Feel free to use any text in this file.

Implementing the Word Counter

Now we are ready to write our program. The main logic will reside in the `wordCount` method. Here is an overview of the structure of our class:

 
package com.journaldev.sparkdemo;

...import statements...

public class WordCounter {

    private static void wordCount(String fileName) {
        // Logic here
    }

    public static void main(String[] args) {
        // Entry point
    }
}

Running the Application

To run the application, go to the root directory of the project and run the following command:

 
mvn exec:java -Dexec.mainClass=com.journaldev.sparkdemo.WordCounter -Dexec.args="input.txt"

Conclusion

In this post, we have seen how to use Apache Spark in a Maven-based project to create a simple yet effective word counter program. For more information on big data tools and processing frameworks, check out our other posts.

Create a Free Account

Register now and get access to our Cloud Services.

Posts you might be interested in:

centron Managed Cloud Hosting in Deutschland

Standardize Configuration File Names for Log4j

Apache, Guide
Standardize Configuration File Names for Log4j Log4j expects configuration files to have standard names. If your files are named differently, such as myapp-log4j.xml or myapp-log4j.properties, log4j will not recognize them…