A reference architecture shows the building blocks using which one can draw out solutions for each specific use case by combining some of these blocks.
Extract/Collect -> Clean, Transform -> Analyze/Visualize -> Decide/Act
The main reason that we consider Big Data is to generate high quality information out of large amounts of data. Many times companies have too much data and they don’t know where to start. How many resources are needed, what to look for, how to analyze, which tools to use.
Work on right use cases, share results with business teams and get their feedback. You can
need to take care of security, privacy, compliance, liability, etc. There will be some cultural
aspects such as previous system owners giving up their systems.
Iaas (infra), Paas (Hadoop), Saas (tools), DaaS (datascience as a service). Find needle in the
The way data is stored in NOSql databases is usually governed by use cases that will use this
Hadoop is essentially a batch oriented architecture. For real time access, one needs Storm,
TerraCotta, Solr type approaches.
Visualization of aggregates is an interesting way of speeding up Visualization. First glimpse
pattern pulls data only when needed – lazy loading.
To speed up ingestion, Flume or similar agents can be used to follow a tree network pattern
where data is ingested in parallel tracks from data source into HDFS.
Chef and Puppet?
Hadoop monitoring tool – Nagios or ZenOS?
Minimum authorization support on Hadoop system? Latest Hadoop version has authentication
for HTTP web clients, authentication with Kerberos RPC, access control for HDFS files,
delegation tokens, network encryption, etc.
NFR: reliability, operability, maintainability, availability, security, scalability