Companies like to increase their revenues, profits, brand image, customer acquisition rates,
reduce customer attrition, fraud, gain deeper insights on their customers and so on.
Traditional BI analysis:
Multiple databases -> ETL – DW -> Analyze – Decide -> Act (show picture here)
Distributed parallel processing:
Share nothing approach. Processing is moved to data nodes.
The standard for big data distributed parallel processing. Its main components are HDFS and
HDFS is a distributed file system designed for processing large data on commodity servers with
linear scalability. HDFS has a master node (NameNode) that divides original data to slave
nodes based on some rules. Each slave node gets a portion of the original data. On each slave
node there is a DataNode that takes care of managing data on the slave node.
Each data block is replicated to multiple slave nodes to ensure high availability. By default, each
data bock is replicated three times. Usually primary data block and another copy exist on a
server in the same rack. An additional copy is saved on a server in another rack, and this way
even if an entire rack goes down, the NameNode can reach that data block on the other rack.
The NameNode managed the metadata, so its knows which datablocks belong to which files,
where each data block is located, what the capacity of each slave node is, how occupied each
of those nodes are, etc. Signals are transmitted periodically between NameNode and
DataNodes and thus NameNode knows which DataNodes are working and which have failed.
When a DataNode fails, NameNode will remove that node from the Hadoop cluster and will
assign a new DataNode for the job.
Hadoop MapReduce works the same way. Master node (JobTracker) divides the job into
multiple tasks (map tasks) and distributes these tasks to multiple slave nodes (TaskTracker).
This way data is worked on simultaneously on multiple nodes.
Data is shuffled between nodes after map tasks is completed. To reduce bottlenecks in shuffle
phase, a non-blocking switch network can be used between the nodes.
The goal is to keep data transfer between map and reduce nodes as low as possible. A
combine operation can also reduce these transfers.
Intermediate results of MapReduce operation are written to local files on DataNodes.
This architecture has one weak spot and that is the JobTracker as it is a single point of failure.
Hence it is important to have as much redundancy as possible on these nodes.
MapReduce can be used to transform data in addition to doing analytics on that data.
YARN is an improvement to MapReduce v1 framework. While in v1, JobTracker took care of
cluster resource management and application control, YARN splits them into two. An
ApplicationMaster is created that can be run on any slave node, while ResourceManager still
manages resources as a central manager. So a new ApplicationMaster is created for every job
to exclusively manage that job.
Hadoop has many sub projects:
Sqoop – to load large amounts of data as batch jobs between HDFS and RDBMS
Flume – to import data streams such as web logs into HDFS
Avro – to serialize structured data.
Hbase, Cassandra, Accumulo – NoSQL databases
Mahout – for machine learning and data mining
Zookeeper – modules to implement coordination and synchronization services in Hadoop cluster
Oozie – for workflow automation
Chukwa – monitors large Hadoop environments
Ambari – checks health status and resource consumptions of nodes. Simplifies management of Hadoop environments.
Spark – in-memory processing of data. Is of great speed advantage in machine learning where data goes through multiple iterations.
In-memory platforms carry the risk of losing everything when power is off. So memory contents are replicated to disk at regular intervals. Updates are logged to reduce risk of data loss between replication snapshots.
In-Memory Database (IMDB): all database is loaded into cluster memory. No disk I/O. Data can
be compressed 10 to 50 times.
In-Memory Datagrid (IMDG): helps reduce I/O.
Caching: cache frequently accessed tables. Reads will be from cache and writes go to disk.
Can consider RAM disks, Flash disk, Solid State disks for extra disk speed.
NoSQL (Not Only SQL) database: can help overcome limitations of RDBMS. Can scale linearly
with more nodes, no rigid schema. Data is replicated to several nodes for fault tolerance and
recovery. They implement a caching function to address my speed access needs.
Key-value stores. Use cases are internet click streams, search functions in email systems.
Document stores: where data is stored as documents. Key and document as value. Twitter
Columnar stores: used when structured data is too large in billions of rows. This causes all
rows to be scanned for various queries. Best for use cases where data is needed from large and
Graph Database: represented by vertices and edges. Has efficient traversal and good for
schedule optimization, GIS, web hyperlink structures, relationship is social networks, etc
High speed, real time event processing.