Credit : The Simplilearn Edge: Big Data and Analytics (2018)

Working with Big Data — Scaling Data Discovery

Big Data, as its name implies, involves enormous amount of data sets. The name “Big Data” first appeared in academic publications in the 1990s. As of 2008 it has been widely used and continued to do so with the spread of cloud infrastructure, machine learning and artificial intelligence. [1]

Today, there is even a more “rush to compute unlike anything we’ve ever seen before”, wrote Matt Day[2]. Big Data is increasingly sought-for to detect hidden patterns in data since the presence of patterns in any data is an indication for possibility for prediction and for discovery. The patterns, when detected, can be used to predict when for example a customer is ready to make a purchase, what sample is likely to yield novelty, like new genes of new drugs, or when an aircraft jet-engine needs servicing.

The predictions in healthcare are helping to discover new drugs and their effectiveness, prior to their release, using in silico or math-based virtual evaluation. The USA Food and Drug Administration announced July 7, 2017 that its scaling Big Data analytics using high-performance computing (HPC) to make drug development and testing more effective.[3]

In business, according to McKinsey, retailers who leverage the full power of Big Data are able to improve their operational margins by as much as 60%.[4]

The poll conducted on April 2017 by Big Data Zone[5] listed several real-world problems solved based on Big Data involving different sectors such as retail, healthcare, media, and tele-communications to finance, government, IT, and to fleet management. This survey of April 2017 by Big Data Zone involved executives from 22 companies who are working with Big Data or providing Big Data solutions to their clients.

Big Data — Opportunity for discovery

The patterns identified using Big Data based on in-silico evaluation (math-based virtual screening) are helping to discover new genes in crops. Some of the genes discovered, in recent years, have been searched for but in vain in the past. In-silico evaluation as a virtual screening approach applied to Big Data not only helped to tap its potential but also shortened the time to discovery, which is crucial for business as well as for research and development to keep pace of rapid global changes including climate change.

The 2016 OECD[6] Report considered Big Data as a driving force for knowledge acquisition and value creation, fostering research and innovation with a potential to transform most if not all sectors[7]. The report referred to Big Data as the new research and development (R&D) for 21st century innovation systems, highlighting Big Data and its analytics as fundamental inputs to innovation, akin to R&D. The report revealed, based on available evidence, that companies using data-driven innovation (DDI) have raised productivity faster, by approximately 5–10% when compared to non-users.

Big data is being used across sectors to speed up discovery and enhancing creattvivity.

Big Data’s global market is expected to grow with an average of 25% per year by 2020[8]. Industrial companies such as GE (General Electric) and Siemens are turning to Big Data and promoting their respective corporations as Big Data firms.[9]

In banking industry, banks are also shifting towards new computer applications that are oriented around customer experience using Big Data and ML, moving from exclusively more defensive applications involving security and risk. The recent BMO Canada‘s shift has saved over $100 million in data re-use and data warehouse rationalization according to a recent article by Tom Davenport & Randy Bean (2017). The Bank has also established a data science platform including analytics sandboxes and open source software for machine learning, as well as software for robotic process automation (RPA). On overall the bank has already achieved several times more value in additional revenues over what it has saved in data rationalization.[10]

According to Garner, during 2017 the world’s most successful companies use technology to scale and outcompete traditional organisations [11]. Big Data moved from the “peak of inflated expectations” in 2012 towards “Trough of Disillusionment” in 2014 and now entering the “plateau of productivity”. As was anticipated in 2012, Big Data was then in the 2 to 5 year horizon to reach the plateau of productivity reached today.

Figure 1: Gartner Hype Cycle consists of five key phases of a technology’s life cycle: 1) Innovation Trigger, 2) Peak of Inflated Expectations 3) Trough of Disillusionment (stage or refocus, 4) Slope of Enlightenment and 5) Plateau of Productivity.

Gartner’s Hype Cycle graph, above, helps to keep track of the maturity and the adoption of different technologies from their early proof-of-concept stages to their applications and adoption in production. The trend according to Gartner is that the core areas of the Nexus of Forces (cloud, mobile, social, and information) are rapidly moving toward the plateau of productivity wrote Hank Barnes.[12]

Big Data today is also empowering machine learning and artificial intelligence that are now becoming main stream and transforming entire sectors of the economy. Big Data is helping to fine-tune ML and AI algorithms requiring massive amounts of already-labelled data to better recognize patterns and be able to accurately predict.

Figure 2 : Deep Learning’ s performance is very dependent on the amount of data. More it is feed with, the better it gets (Nicola Jones, Nature 2014) [13].

To help in this process, of empowering machine learning and artificial intelligence, organisations are releasing large data sets to feed and fine tune ML and AI algorithms. The National Institutes of Health Clinical Center is providing the largest publicly available chest X-ray datasets (NIH September 27, 2017). Google is joining other high-quality datasets, Open Images and YouTube8-M that are providing millions of annotated links for researchers to train their ML/AI. [14] The Open Images set is a joint collaborative initiative involving Google, Carnegie Mellon and Cornel with 9 million entries tagged first by computers.

Big Data and the Internet of Things (IoTs)

Big Data is continuously growing rapidly as result of myriad of devices and the Internet of Things (IoT) generating almost 2.5 Exabytes (EB) or quintillion of data every day, which is the equivalent of 5 000 000 Laptops in terms of storage capacity. A quintillion refers to a million (106) raised to the power of five, which is equal to (106)5 or 1030 (Figure 1). According to Gartner Inc., there will be even more data with the expansion of IoT as the number of connected devices is expected to reach 20.8 billion by the year 2020.[15]

As a result of IoT and other devices Big Data is not only growing in volume, variety but also velocity at a rapid pace. There is also a growing interest from the public on Big Data. Gartner[16] foresees that citizen data scientists will foster greater depth of business analytics. Newspapers and media have sections on data visualization, such as “information is beautiful” initiative by The Guardian News media that was conceived and designed by David McCandless[17]. Because of cloud skills gap, large enterprises across the world are losing yearly more than $250 million.[18]

What is big data?

Big Data has been characterised by its volume, variety and velocity, generated through a myriad of devices at all times involving every digital process including people’s interaction with these devices (social media).

Most of Big Data generated, daily, is a raw and unstructured data, requiring laborious making work to put into structured and usable formats, such as table with rows and columns. To extract insights from this complex data, Big Data projects often rely on cutting edge analytics involving machine learning[19]. There is a “rush to compute unlike anything we’ve ever seen before” wrote Matt Day, technology reporter of the Seattle Times.[20]

Gartner defines Big Data as a “high-volume, high-velocity and high-variety information assets that demand cost-effective, innovative forms of information processing for enhanced insight and decision-making.”[21]

Big Data’s definition is not totally established yet. Steve Olenski (2015) referred to the work of Ward and Bakers who reviewed early definitions to suggest that: “Big Data is a term describing the storage and analysis of large and or complex data sets using a series of techniques including, but not limited to NoSQL, MapReduce, and machine learning.” [22]

No SQL means “Not only SQL” is a group of database management systems involving data that is not stored in a table format (Table 1), like a relational database. There is also suggestion for a NewSQL, which aims to provide the same scalable performance of NoSQL systems while still maintaining the guarantees of a traditional database system.

Table 1: Example of a table with rows (month) and columns (climate)

Montreal monthly average temperature
Figure 3 : Type of relational database — Linking locations with their climate data

Scaling Big Data processing

Big Data consists of vast amount of datasets that are too large to accommodate in a single computer or device, the approach of a distributed storage has become fundamental to store these enormous datasets across an almost infinite number of networked computer hard drives. Such an approach of distributed storage system allows scalability as more drives can be added to the network as Big Data continues to grow in size[23].

Along with the increase of storage capacity, the devices’ processing power has also increased dramatically, including the processing of power of mobile phones. The Apple-designed 64 bit Cortex A10 ARM architecture has more than 1 billion of transistors with multicore processor to process approximately ~3.5 billion instructions per second. A mobile phone CPU running at speeds of more than 1 GHz can execute millions of calculations each second. When compared with the processing rate of instructions of computers used to guide spacecraft to the moon 45 years ago, the mobile CPU today is more than ten thousands of times faster [24], [25].

Today’s smartphones have the processing potential of a vintage supercomputer[26]. The original Apollo 11 guidance computer (AGC) source code can be easily stored in 1 GB RAM of these phones. The code used as a Command Module (Comanche055) and Lunar Module (Luminary099) has been made available at —[27], [28]

To bank on the increase of CPU and GPU processing power, new tools are also emerging to scale either vertically or horizontally Big Data processing. Hadoop system, which is originally developed as a web search engine starting 2002, has emerged as an approach for Big Data processing accommodating both structured and unstructured data. It is becoming an ecosystem and a framework of open sources tools, libraries and methodologies for Big Data processing and analysis. The core part of it is supported by the Apache Software Foundation.[29]

Hadoop has evolved from single MapReduce methodology to embrace the emerging data lake concept and a paradigm shift and to evolve as data volume increases and new processing models become available. It is conceived to run on hardware and the cloud and to scale tasks from a single server to thousands of machines and tens of thousands of processor cores. Hadoop applications usage is not limited to any programming language.

Scaling Big Data Analytics

Scaling of analytics has been also a challenge for Big Data leading to the emergence of another framework to scale analytics, known as Apache Spark. Like Hadoop, Spark runs in a distributed processing style by combining a driver in- memory core processing, which partitions Spark application into tasks that are distributed across many executor processes. Thus allowing the process to be scaled up and down as required for the applications.

Spark platform overlaps with Hadoop platform and addresses the two limitations of Hadoop by limiting computation to memory for speed and tolerance to failure by using the Resilient Distributed Dataset (RDD) approach. This RDD concept allows Spark to process the partition required today in a computing cluster setting of scalable parallel processing.

As Spark has the possibility to scale and to speed it also harness R programming language ‘s analytical capabilities such R machine learning that go far beyond what is available on Spark or any other Big Data systems. R language and Spark new platform can share codes and data and carry out powerful large scale machine learning capabilities.[30]

Figure 4: Big Data working platform for scaling data processing and data analytics (In-memory analytics platform).

The article is the chpater of the book titled “Working with Big Data — Scaling Data Discovery”

Book available at

In Chapter 5 and 6 of this book you will learn more about Spark and R programming and how to install and set up a local R/Spark session and then connect Big Data to the data with R language. Once the connection to data is established you can apply different machine-learning algorithms to Big Data. Other possibilities involve querying, merging and aggregating Big Data stored in Spark that you can move into R environment for in-depth analysis and prediction.

[1] Tom Boellstorff (2013) Making big data, in theory. First Monday 18(10)

[2] Amazon cloud unit’s grand plan: data centers in every major country worldwide. By Matt Day , Seattle Times technology reporter. Seattle Times (2017)

[3] Avanade’s survey (2017) —

[4] Scott Gottlieb, M.D. (July 7, 2017) How FDA Plans to Help Consumers Capitalize on Advances in Science. FDA Voice

[5] Tom Smith (2017) Executive Insights on the State of Big Data. Big Data Zone

[6] The Organisation for Economic Cooperation and Development (2016)

[7] The Organisation for Economic Cooperation and Development (2016)

[8] Frost & Sullivan (2014).

[9] (Economist May 6th 2017)

[10] Tom Davenport & Randy Bean (2017) Setting The Table For Data Science And AI At Bank Of Montreal, Forbes.

[11] Gartner, Inc — 2017 Hype Cycles Highlight Enterprise and Ecosystem Digital Disruptions

[12] Hank Barnes (2017) Digital Disruption Demands Demystification (Hype Cycle Season).

[13] Nicola Jones (09 January 2014) Computer science: The learning machines. Nature 505, 146–148

[14] Richard Lawler (2016) Google releases massive visual databases for machine learning. En-gadget —

[15] Harriet Green (2017) You May Be Surprised About What The IoT Connects You With. Forbes

[16] Gartner

[17] The Guardian News — Data-

[18] Enterprise Innovation editors | 2017–10–15

[19] IBM —

[20] Amazon cloud unit’s grand plan: data centers in every major country worldwide. By Matt Day , Seattle Times technology reporter. Seattle Times (2017)

[21] Steve Olenski (2015). Big Data Solving Big Problems. Forbes

[22] Steve Olenski (2015). Big Data Solving Big Problems. Forbes

[23] Bernard Marr (2015) Spark Or Hadoop: Which Is The Best Big Data Framework? DataCentral

[24] Tibi Puiu (Sep 10, 2017) Your smartphone is millions of times more powerful than all of NASA’s combined computing in 1969. ZME Science newsletter

[25] A modern smartphone or a vintage supercomputer: which is more powerful?Posted: 14 Jun 2014, 20:28 , posted by Nick T.

[26] A modern smartphone or a vintage supercomputer: which is more powerful?Posted: 14 Jun 2014, 20:28 , posted by Nick T.

[27] The (Surprisingly Funny) Code for the Apollo Moon Landings Is Now on GitHub _ By Jay Bennett Jul 11, 2016.

[28] Keith Collins (July 09, 2016) The code that took America to the moon was just published to GitHub, and it’s like a 1960s time capsule

[29] Dougla Eadline (2016) Hadoop V2.

[30] The Hidden Biases in Big Data Kate Crawford — APRIL 01, 2013

OperAI develops IoTs with Math and AI Embedded Solutions to speed up and streamline operational processes at the edges of the cloud.