Distributed Thinking: A gentle introduction to distributed processing using Apache Storm and Apache Spark - Part 3
Welcome to Part 3 of Distributed Thinking! Here I discuss about How to look at data in a distributed environment with a more complex example, Word Count. In case you missed anything, click here for Part 0.
1. Distributed Thinking
1.3 How to look at data?
By now you must have a fair idea about how to look at and treat your data.
Let’s look at the Word Count problem in blocks:
Here the Connector
and the Output
remain the same, with the Connector
reading out of a file line by line, and the Output
printing out the result onto the screen.
We have two Transform
methods:
Transform 1
converts the data from a line into component token wordsword_1
toword_n
Transform 2
groups words into their respective counters
This is a perfect example for a system implemented in Map-Reduce
:
- Map the data into small immutable chunks, in this case Lines 0 to (N-1)
- Map the data further into smaller immutable tokens, in this case word_1 to word_n
- Reduce the tokens into one in-memory table of word counts, which is the result
Let’s look at a simple single threaded Java implementation:
Here I’ve labeled the transformations as mapXYZ
methods for mapping, filterXYZ
for filtering, and reduce
for reduction into the result. This format of code is very close to being ready for a distributed system. The Mapping and Filtering methods do not have any side effects, and the data being passed to the methods are treated as immutable blocks.
Once all the steps are free of side effects, and the data is treated as immutable, we’re ready to distribute the processing on different machines. But before that, let’s look at the current block-diagram version of the Word Count program:
Now to convert this to a multi-threaded application, the map
portion needs to be executed in parallel. Let’s look at that block diagram:
The map method is executed in multiple threads. We need a Queue that facilitates the delivery of the data to the mutliple threads with thread safety. The reduce method is out of these n threads to indicate that it runs on one machine and all data is collected into one final step. In reality, this might not be the case, and the reduction might happen in multiple threads and the data finally returned to the main thread. Let’s look at the code:
Here we have a main
method which reads from the FileIterator and dumps all the data into the dataQueue
. It then stars an ExecutorService
with a few threads which perform Transformations by reading from the dataQueue
and write the final result into a counter
.
As the method progresses, all the data is read from the FileIterator and dumped into the queue, and each thread processes whatever data is left on the queue and writes the data into the counter, whose method reduce
is now made synchronized
. There are much better ways to perform multi-threaded reduction, which we will discuss later when we look at implementing a system using Spark and Storm.
We now have all the building blocks to create a distributed system which can process your data concurrently on multiple threads. We will now look at solutions for the Word Count problem using Spark and Storm, and see how it helps us solve some of the problems that a single-machine multi-threaded solution poses.
Click here to go to the next post which discusses Processing Data Streams.
Click on any of the links below to go directly to any part:
Part 0: Introduction
Part 1: When and Where
Part 2: How to look at data
Part 3: How to look at data (continued)
Part 4: Processing Data Streams
Part 5: Processing Large Data Chunks