Big data has evolved a lot of the past few years; from a happy buzzword to a hated buzzword, and from a focus on volume to a focus on variety and velocity. The term “big data” and the technologies that encompass it have been pored over, picked over and bastardized sometimes beyond recognition. Yet we’re at a point now where it’s finally becoming clear what all of this talk has been leading up to.
It’s a world of automation and intelligence, where it’s easier than ever to mine data, but also to build intelligence into everything from mobile apps to transportation systems. Big was never really the end goal, but the models driving this change generally feed on data to get smarter. Variety was never really a goal, it’s just that the more we can quantify, the more we can learn about the world around us.
It’s a world we’ll delve into in great detail at our Structure Data, which kicks off just a week from today (March 19) in New York. We have speakers from nearly every tech company that matters, as well from some of the biggest companies in the world and some of the smartest startups around. They’ll be talking about everything from fighting human trafficking to the future of Hadoop and the cutting edge in artificial intelligence.
Here are five of the big trends I have been watching that helped shaped who’ll be speaking and what they’ll be talking about. Hopefully, it gives you something to think about and, if you’re planning to attend, something to look forward to.
1. Hadoop’s march toward true platform status
Apache Hadoop might still be a distributed file system and the MapReduce processing framework, but Hadoop is so much more. Thanks to general advances such as YARN, Hadoop clusters can now run any number of different processing frameworks for any number of different workloads, all taking advantage of the same underlying storage infrastructure. What was once a MapReduce cluster for ETL jobs, for example, can now also operate simultaneously as a Spark cluster for machine learning, a Storm cluster for stream processing and a Tez cluster for interactive SQL.
Essentially, Hadoop is transforming from a tool useful for certain tasks into a bona fide platform capable of supporting all sorts of applications. Early adopters such as Airbnb and Twitter are already taking advantage of this new reality, and efforts by Hadoop vendors such as Cloudera, Hortonworks and MapR to build new capabilities into their products and support new frameworks suggest mainstream Hadoop users also will at some point. Startups such as Continuuity, Mortar Data and WibiData should speed this evolution as they make it easier to build big data applications, all the while open sourcing some of their technological underpinnings and thus giving tools to even more developers.
Of course, it won’t just be developers feeling the effects of Hadoop as a platform, but incumbent software vendors, as well. Traditional data warehouse, database, and even statistics software companies will have to find ways to cope with the fact that Hadoop can now store a lot more data they they can for a lot less money, and also analyze it a variety of ways.
2. The rise of artificial intelligence, finally
We have the computers, we have the data, and we have the algorithms: so we now have the artificial intelligence. No, it’s not yet the fear-mongering stuff of science-fiction or the human-replacing stuff of Her, but AI is finally for real. Thanks to advances in machine learning, we have smartphones that can recognize voice commands, media services that predict what movies we’ll like, software that can identify the relationships among billions of data points, and apps that can predict when we’re most fertile.
We have IBM’s Watson system putting together ingredient lists for chefs.
Looking forward, the work being done in areas like deep learning will make our AI systems more useful and more powerful. Set loose on complex datasets, these models are able to extract and identify the features of what they’re analyzing at a level that can’t be programmed. Without human supervision, deep learning projects have figured out what certain objects look like, mapped word usage across languages and even learned the rules of video games. All of a sudden, it looks possible to automate certain tasks such as tagging content to make it searchable, or predicting with high accuracy what someone’s words means or what they’ll type next.
Applied to new types of content in new areas, these methods could prove even more valuable. What are the features that comprise a certain type of cancer cell? Can we help nurses know as much as doctors? What combination of previously unmeasurable variables might signal a suicide risk among teenagers? How do we make self-driving cars and drone delivery services commercial realities? We can’t call it a savior yet, but AI does seem to hold a lot of promise.
3. Analytic power to the people
It might not seem like a big deal compared with the really hard infrastructure and algorithm work being done elsewhere, but efforts to make data analysis a standard and easy-to-achieve skill could prove transformative to our society. Just giving everyday people the power to visualize the data around them in new ways can open up entirely new ways of thinking about our lives.
Yesterday, for example, I used free software to build a network graph of my iTunes library and compare the words Edward Snowden used in a recent interview to those used by NSA boss Gen. Keith Alexander. I wasn’t doing data science or deep learning, but I was able to perform simple analysis on, and then visualize, data that I found interesting. Previously, I’ve mapped my Twitter followers, analyzed Gigaom writers’ headlines and even visualized my food intake and exercise. Who knows, getting young people interested in analyzing their own data with interesting visualizations might actually help spur that data-savvy workforce everyone seems to think we need.
And as the tools available to laypeople get more advanced, and as we accumulate even more data about ourselves via fitness trackers, connected cars and the internet of things, in general, being able to get a sense of our quantified selves will become much more important. We are, for many purposes, becoming numbers fed into and spit out of algorithms. Our personal data will influence everything from the ads we see to the job offers we get, and it will behoove individuals to see at least a modicum of what companies, institutions and the government are seeing.
4. The cloud
I’ve said three years ago that cloud computing and big data are on a collision course, and it’s finally happening — only in a much broader way than I predicted. In fact, the biggest impacts of this convergence might have little to do with being able to consume Hadoop, business intelligence suites or any other sort of analytic software as a service. Those things are happening, and they’ll make life easier for startups and established companies that want to move new workloads into the cloud, but the beauty of the cloud to me is now its ability to democratize hard computer science.
Already, some of the technologies and techniques I’ve highlighted are being delivered as services, often via API, and the list will only grow. If you’re a developer and you want to learn Hadoop and use Elastic MapReduce, that’s available. But if you just want to connect to a service like IBM’s Watson cloud or the MindMeld API and have someone else’s algorithms provide a layer of artificial intelligence to your data, that’s an option, too. The work being done at places from Google to Pinterest to Netflix assures that many of these techniques will just be embedded into the service we consume.
Assuming these approaches really work and let developers deliver real intelligence (as opposed to, say, generic recommendation features that become more of a plague than a benefit) it will raise the bar for what consumers expect, even for mundane tasks. Many of us will expect to know not just that our shopping list contains lettuce, but also what recipes it’s good for, what are good alternatives if the store is out and where we can get the lowest prices. Paired with the available processing power and data capacity of our smartphones and other computers, well-designed apps can actually make this a reality whatever signal we’re picking up from AT&T’s towers.
5. The law
And, finally, the legal system is the potential rain — or, depending on how you look at it, parental chaperone — on this big data party. Already, judges, legislators, regulators and even the president are trying to get their heads around what all this data collection means and then carve out some semblance of order. It is not easy terrain to navigate, especially with all the competing interests at play.
One of the trickiest areas to govern will be the consumer privacy area, where there’s great potential to elevate the consumer experience but also great risk of invading individuals’ privacy. Oh, and lots of lobbyist money is now coming into play. We want to get the best deals on food or new clothes, and we want to be able to have our DNA sequenced for $99. We also need to make sure the potentially sensitive information we supply isn’t used in unexpected ways or doesn’t pop up in places we didn’t expect it to, such as in banner ads on a shared computer.
It will be a big challenge for lawmakers and others in the legal field to craft a framework of laws, regulations, and case law that lets consumers have their cake and it too when it comes to privacy. Frankly, I’m not sure they can without understanding the technology and where it’s headed, and I’m not sure we’ll ever really be happy with the results.
Sure, we don’t want Facebook, Google and ultimately even someone like Geico analyzing the heck out of all our data, but we also don’t want to go back to a world of weirdly designed websites, waiting for taxis, and generally inefficient, not-personalized lives.