Big data analytics – top frameworks to use (and 3 cases when you need to develop your own)

Share This Post

Share on facebook
Share on linkedin
Share on twitter
Share on email

I love Big Data, and I cannot lie  . . .

It’s one of the things a savvy sales guy can use to add a few zeros to any proposal or mid-flight project change request.

Suppose you have been reading any of the previous articles. In that case, you will know that I subscribe to the belief that a little bit of unexpected knowledge, deployed offensively at just the right time, will cause a rapid reduction and revision of the obscure worded change request before you.

So, Big Data, everyone knows what data is (I think), but what in this context does it have to be to become “Big” – The popular mechanism for determining this is The Six V’s of Big Data.

Let’s take a tour:-

Fun Fact: Previously, there were only 3 V's! Volume Variety and Velocity - but now three more have been added.

Volume: Big data will typically involve, you guessed it, a lot of data (WOW !) – often TeraBytes or even ExaBytes of the stuff. But not always.

Variety: Big data problems are often centered around the ability to deal with unstructured data (Or semi-structured) – such as free form text within a stream of tweets.

Velocity: This is one of my favorites. Big Data problems are often caused not by the volume of the data but by the speed of generation and the number of sources generating it.

Another Fun Fact: Often, with time-series data, an additional complication is the requirement to time stamp the data (Preferably at source), which increases the volume and presents the further difficulty of ensuring all of the acquisition nodes are suitably and accurately time-synchronized.

Veracity: To what extent can the data you are processing be trusted – Trust comes in several forms, is it time-stamped, is the sensor calibrated, is it complete or out of context.

Value: What is the business value of the collected data. Can we turn a BIG data problem into a non-big data problem by throwing away the junk within your collection pipeline?

Variability: How your collected and collated data set can be used and formatted to deliver business value.

As an example, Here at Umbric, unsurprisingly, we like to tinker and play with stuff in our spare time. One of the things we play with is IoT and home automation. We have a few rather old Raspberry Pi machines running some fairly standard open-source software frameworks. Scattered around the place we have:-

  • Ten temperature sensors – Temperature and Humidity and Battery state every 10 seconds from all of them (BLE beacon packets, captured, decoded, and forwarded over MQTT by several BLE hubs running on PI zeros – de-duplicated on the hub node)
  • Five electrical submeters – each generating readings for power, energy, voltage, phase, and frequency every 6 seconds (Read via MBUS, again, each forwarded over MQTT)
  • Many PIR sensors, linked to controls for floodlights and alarm systems

The Hub node for all of this is a rather elderly Raspberry Pi 3b. It is running:-

  • An MQTT broker
  • Node Red : Data mediation
  • Two databases: Influx &MySQL
  • Grafana : Dashboard engine
  • (A TOR HTTP proxy, the least said about that the better . . . )
  • PiHole (Adblocker honeypot)

Alongside the seven events per second minimum broker load, it is also storing approximately five years of the same data in the two databases and replicating this to a small NAS. All in all, about 18million records a month and 220million records a year. The venerable Pi is not even trying hard.

Let’s do a 6V assessment:-

  • Volume: 0.25Gb a year, current total 1.1Gb – Sizeable, but not huge.
  • Variety: Well, the data is coming from several sources, but we are normalizing at the edge into a standard JSON format – So the variety beyond the edge is LOW
  • Velocity: Pretty high (Compared to the size of the available computing resource)
  • Veracity: Pretty high (All the sensors are MID certified and of the industrial class. Data mediators are NTP synced against a robust time source, and the time stamp (Rather than commit time) is used as a record index)
  • Value: Fairly low – I’d be miffed but not devastated if I lost this data or the facility.
  • Variability: Pretty high – I have some beautiful dashboard of the building performance and automation set up to warn me if I leave the aircon or lights on overnight.

On a big data index, this rates a (Roughly) 3.5, and only one of the primary 3. This data is NOT BIG. Of course, it’s not an enterprise configuration, not remarkably resilient, BUT the example is clear.

There are many more resilient and capable enterprise frameworks for managing data that is Truly Big

Apache Hadoop: The most well-known and longest-lived popular frameworks, revolutionary when it first emerged and not to be disregarded now.

Under the hood, Hadoop has three primary building blocks.

  • HFDS: A file system structure
  • MapReduce: A system of processing large data volumes within a clustered environment – a product in its own right
  • YARN: The resource management core

Hadoop is excellent for customer analytics and enterprise projects, and the creation of data lakes.

Apache MapReduce: First seen as the “brains” of the Hadoop environment, then split away and launched as a stand-alone product. It is a highly efficient framework for processing data in automatically allocated parallel chunks.

Apache Spark: An open-source framework, more advanced than Hadoop, in-memory data retrieval modeling being the primary advantage. Spark supports four principle languages Scala, Java, Python, R.

Apache Hive : (A facebook product) Is a data analytics framework that takes SQL queries and parses these into a chain of MapReduce tasks.

Apache Storm: A framework for stream processing of substantial real-time data streams

We could go on, and the attentive will notice that the Apache foundation has an authoritative presence in this space. Different tools for different jobs and the distinction between them may seem esoteric but is essential. Choosing the proper framework for each application is a job where it pays to have skilled advice.

Umbric has extensive skills and experience in assessing your big-data requirements and choosing the correct framework to offer the best service for your needs. Grab a quick call with us, and we’ll be happy to get the ball rolling on your next project without all the headaches!



About Umbric Data Services

Forget knowledge; data is power – especially when hooked up to custom web applications leveraging the latest in big data, machine learning, and AI to deliver profitable results for business.

Umbric Data Services combines the latest in tech with good old-fashioned customer service to deliver innovative, efficient software that drives productive business insight and revenues in the click of a button. Isn’t it time you worked smart, not hard? Find out more about how we help businesses to grow – visit today.

Subscribe To Our Newsletter

Get updates and learn from the best

More To Explore