After gazing into my magic crystal ball for the first two months of 2016, I can now confidently (with 63.4% ± 42.3657 certainty) predict what’s going to be hot in 2016 in the world of data. Since it’s unlikely that anyone else in the entire tech community is writing an article like this, I feel compelled to share my prescient insights with you so that you won’t be surprised by what’s coming the rest of this year. So, without further ado (cue magician’s background music) my predictions for the next 12 months.
1. Big Data Strategy Beyond Hadoop. After years of rapid technology-focused adoption of Hadoop and related alternatives to conventional databases, we will see a shift toward more business-focused data strategies. These carefully crafted strategies will involve chief data officers (CDOs) and other business leaders, and will be guided by innovation opportunities and the creation of business value from data. The latest generation of exciting advances in data science and data engineering techniques will spark creative business opportunities and the data infrastructure will play a supporting role. Real benefits will best be achieved through the strategic alignment of high value opportunities with the right technologies to support innovative solutions.
2. Apache Spark Over MapReduce. In memory data processing, Apache Spark burst on the scene in 2014 as a Top-Level Apache Project and was the dominate buzz in 2015 as being enterprise ready, and saw significant early adoption. Expect that 2016 will see an explosion of Spark adoption by fast followers and organizations who are seeking to replace legacy data management platforms. Spark on Hadoop YARN is likely to dominate this explosion and will greatly reduce the need for MapReduce processing.
3. Deep Learning and Open Machine Learning. In late 2015, Google open-sourced TensorFlow, its machine learning platform. Just a few weeks later IBM released its machine learning technology, SystemML, into the open source community. These latest projects join a growing plethora of existing open source machine learning platforms such as DL4J (for implementing deep learning in Java). Data scientists and technologists now have, at their fingertips, the world’s leading algorithms for advanced predictive analytics. Expect this to propel the innovative creation of value from data in ways that we’ve never previously imagined.
4. World Enabled by AI. Out of favor since the 1970s, artificial intelligence (AI) is becoming hot again. Examples like autonomous vehicles, facial recognition, stock trading, and medical diagnosis are exciting the imaginations of the current generation of technologists. Moreover, the power of distributed, parallel computing is more accessible than ever before, making it possible to experiment with many novel ideas. At the same time the rich data needed to feed machine learning algorithms is more prolific, diverse, and readily available than ever before. While you may have to wait a few more years to get your self-driving car, you can expect your life to get a little bit better in 2016 because of the innovative uses of AI.
5. IoT Matures. As far back as 1999 Kevin Ashton coined the phrase “internet of things” (IoT), and the world has seen interesting advances in the use of sensors and interconnected devices. The IoT phenomenon has rapidly been gathering steam in recent years with companies like GE, Cisco Systems, and Ericsson contributing. According to Gartner, IoT will include 26 billion operational units by 2020, and IoT product and service providers will generate over $300 billion in incremental revenue as a result.
Expect 2016 to see the embracing of open standards that improve device monitoring, data acquisition and analysis, and overall information sharing. We will also see a divergence in the issues surrounding types of data collected by these devices. Personal, consumer-driven data will increase security and privacy complexities. Enterprise-driven data will increase the complexities of issues like knowledge sharing, storage architectures, and usage patterns.
All of these sensors and devices produce large volumes of data about many things, some of which have never before been monitored. The combination of ever cheaper sensors and devices and the ease with which the collected data can be analyzed will generate an explosion of innovative new products and concepts in 2016.
6. Analysis of “Unstructured” Content Becomes Routine. The analysis of free text, audio, video, images, spam, emojis, and other non-tabular data (it isn’t really unstructured) has been a specialty area within the data science world for some years now. The convergence of more accessible semantic analysis techniques, the explosion of free text content, and libraries such as word2vec and doc2vec (in DL4J) lead to more mainstream use of text mining techniques. Google’s FaceNet system is showing facial recognition accuracy of 99.96% (this DailyMail.com article), and Carnegie Mellon researchers have recently open sourced their OpenFace project, which they claim can recognize faces in real time with only 10 reference photos. The availability of free tools to analyze social media continues to grow (see this Butler Analytics report). These are just a few examples of the maturation of techniques, tools, and libraries that enable the analysis of non-tabular data by a more general community of technologists. Expect to see a more widespread use of these techniques in 2016, and of course the security and privacy debates that are sure to follow.
7. General Purpose GPUs for Distributed Computing. Unlike multiple core CPUs, which have a dozen or so cores, GPUs (Graphics Processing Units) integrate hundreds to thousands of computing cores. GPUs were originally developed to accelerate computationally expensive graphics functions. Recently, however, General Purpose GPU (GP-GPU) adaptations have stretched this technology to handle parallel and distributed tasks. The supercomputer segment has embraced GPU technology as an integral part of computational advancement.
Until 2015, general programming for GPU was intensive, requiring developers to manage the hardware level details of this infrastructure. Nvidia’s CUDA is a parallel computing platform and programming model, however, that provides an API that abstracts the underlying hardware from the program. Additionally, Khronos Group’s Open Computing Language (OpenCL) is a framework for writing programs that execute across heterogeneous platforms consisting of CPUs, GPUs, as well as digital signal processors (DSPs), field-programmable gate arrays (FPGAs) and other processors or hardware accelerators.
With these programming abstractions comes the realistic ability for many organizations to consider GPU infrastructure rather than (or in addition to) CPU compute clusters. Look for the combination of open source cloud computing software such as OpenStack and Cloud Foundry to enable the use of GPU hardware to build private and public cloud computing platforms.
8. Hybrid Transaction/Analytic Processing Adoption. The past three decades of analytical computing has emphasized the separation between operational and analytical concerns. Data warehouse architectures integrate copies of data from operational source systems and remodel those for analytical purposes. Similarly, modern “data lake” style architectures use Big Data technologies to replicate data into a integrated pool of operational data for exploration and discovery. The problem with these models is that they require the duplication of data, sometimes multiple copies, which does not accommodate the anticipated data explosion that we are currently experiencing.
In 2014 Gartner coined the acronym HTAP (Hybrid Transaction/Analytic Processing) to describe a new type of technology that supports both operational and analytical use cases without any additional data management infrastructure. HTAP enables the real-time detection of trends and signals that enables rapid and immediate response. HTAP can enable retailers to quickly identify items that are trending as best-sellers within the past hour and immediately create customized offers for that item.
Conventional DBMS technologies are not capable of supporting HTAP due to their inherent locking contention and inability to scale (I/O and memory). However, the emergence of NewSQL technologies couples the performance and scalability of NoSQL technologies with the ACID properties of tradtional DBMS technologies to enable this hybrid ability to handle OLTP, OLAP, and other analytical queries. HTAP functionality is offered by database companies, such as MemSQL, VoltDB, NuoDB and InfinitumDB. Expect to see the adoption of these technologies by organizations looking to avoid the complexities of separate data management solutions.
9. Data Security, Privacy, and Encryption. The ongoing whack-a-mole fight against cybercrime will continue to escalate in 2016 as cybercriminals and hacktivists continue to become more sophisticated. Consumers are growing increasingly aware that their personal data has value and that their data privacy is at risk. Likewise, corporations are increasingly concerned about the theft of sensitive data, the cost of recovery, and the corresponding reputational damage. Meanwhile, technology users are evermore hyper-connected thereby increasing the vulnerability of data. These factors mean that advanced data security strategies will continue to be a high priority for IT organizations worldwide.