Distributed Thinking: A gentle introduction to distributed processing using Apache Storm and Apache Spark - Part 3
Welcome to Part 3 of Distributed Thinking! Here I discuss about How to look at data in a distributed environment with a more complex example, Word Count. In case you missed anything, click here for Part 0.
1. Distributed Thinking
1.3 How to look at data?
By now you must have a fair idea about how to look at and treat your data.
Let’s look at the Word Count problem in blocks:
Connector and the
Output remain the same, with the
Connector reading out of a file line by line, and the
Output printing out the result onto the screen.
We have two
Transform 1converts the data from a line into component token words
Transform 2groups words into their respective counters
This is a perfect example for a system implemented in
- Map the data into small immutable chunks, in this case Lines 0 to (N-1)
- Map the data further into smaller immutable tokens, in this case word_1 to word_n
- Reduce the tokens into one in-memory table of word counts, which is the result
Let’s look at a simple single threaded Java implementation:
Here I’ve labeled the transformations as
mapXYZ methods for mapping,
filterXYZ for filtering, and
reduce for reduction into the result. This format of code is very close to being ready for a distributed system. The Mapping and Filtering methods do not have any side effects, and the data being passed to the methods are treated as immutable blocks.
Once all the steps are free of side effects, and the data is treated as immutable, we’re ready to distribute the processing on different machines. But before that, let’s look at the current block-diagram version of the Word Count program:
Now to convert this to a multi-threaded application, the
map portion needs to be executed in parallel. Let’s look at that block diagram:
The map method is executed in multiple threads. We need a Queue that facilitates the delivery of the data to the mutliple threads with thread safety. The reduce method is out of these n threads to indicate that it runs on one machine and all data is collected into one final step. In reality, this might not be the case, and the reduction might happen in multiple threads and the data finally returned to the main thread. Let’s look at the code:
Here we have a
main method which reads from the FileIterator and dumps all the data into the
dataQueue. It then stars an
ExecutorService with a few threads which perform Transformations by reading from the
dataQueue and write the final result into a
As the method progresses, all the data is read from the FileIterator and dumped into the queue, and each thread processes whatever data is left on the queue and writes the data into the counter, whose method
reduce is now made
synchronized. There are much better ways to perform multi-threaded reduction, which we will discuss later when we look at implementing a system using Spark and Storm.
We now have all the building blocks to create a distributed system which can process your data concurrently on multiple threads. We will now look at solutions for the Word Count problem using Spark and Storm, and see how it helps us solve some of the problems that a single-machine multi-threaded solution poses.
Click on any of the links below to go directly to any part:
Part 0: Introduction
Part 1: When and Where
Part 2: How to look at data
Part 3: How to look at data (continued)
Part 4: Processing Data Streams
Part 5: Processing Large Data Chunks