INTERVIEWS

‘Spark is becoming more important than Hadoop’

... min read

Brian Pereira
8th March, 2016

— Vinita Gupta Malu

Q. What are the Big Data trends that you observe in 2016?

Moshe: I’ve observed the following trends:

Spark is becoming more important than Hadoop: Hadoop has been around since the late ‘90’s, and has evolved to the point where it can efficiently and reliably perform big data analytics. Spark has the advantage of being a fast follower, able to learn from and avoid Hadoop’s mistakes. Spark has a more generic and extensible programming model, which makes it easier to use for analytics. It also can handle big data in Motion, via Spark Streaming, and serves as the basis for a powerful graph database (GraphX) and a full-featured data science library (MLib). Spark’s closest relative in the Hadoop world is Tez, which, like Spark, can execute algorithms organized as directed acyclic graphs. The open source community, recognizing the similarity, has crowned Spark as the converged platform of choice, and it will soon replace Tez in the Hadoop platform. Spark is the future of big data computing.

There will be fewer big data startups: Venture capital investors view big data as last year’s trend. They have already doubled down on a variety of startups, and want to see those investments pan out. Hence, this year it would be difficult for the big data startups to convince the investors.

Oracle continues to lose market share to open source big data technologies: The mainstream software giants have adopted various strategies to cope with the competition from open source big data platforms. Some have formed alliances (e.g., Microsoft and HortonWorks), some have embraced and extended (e.g., IBM Watson). The company that least seems to get this brave new world is Oracle, which continues to sell Exadata (an expensive alternative for big data analytics), and has launched their own proprietary NOSQL database that has no advantage over open source alternatives. Oracle is having trouble understanding that in 2016, most customers prefer to avoid vendor lock-in.

Cassandra is becoming a dominant player in the NOSQL space: Cassandra was always the fastest NOSQL database, especially for write-heavy applications, and it provides an active-active distributed datacenter topology out of the box. The knock on Cassandra was that it was hard to deploy, maintain and program. Datastax, the commercial vendor for Cassandra, seems to have noticed: The CQL language makes Cassandra far easier to program, and the OpsCenter management tool makes maintenance a lot simpler.

Q. What according to you would be the big data challenges in 2016? Explain the ways to tackle it.

Moshe: If there is one common thread that links 95% of Ness’s big data customers, it is FUD – fear, uncertainty, doubt. The field is full of competing or overlapping products, each of which claims to be the Holy Grail for big data. The big boys (Oracle, IBM, SAP, Terradata, etc.) all want to steer you towards their (costly) offering. The upstarts (Cloudera, HortonWorks, DataBricks, DataStax, etc.) all talk about use cases with dramatic savings, but do not tell about use cases where their product fails miserably. Depending on the vertical, there are dozens of one stop shops who offer to take the data and come back with vertical-specific insights.

Left to evaluate this cacophony of conflicting voices is your organization, with little experience in the big data minefield. No wonder we read so often about big data failures. The real culprit is not Hadoop – the real culprit is the silo approach that drove the company to make a choice without the benefit of outside advice or experience. The only way to cut through the hype around a product is to try it yourself, and/or talk to someone you trust who is using it. Another option is to partner with a company like Ness Software Engineering Services that has seen a broad range of big data projects and technologies, and has a proven track record of success.

Q. Will spark overtake Hadoop?

Moshe: Hadoop as a concept revolutionized the world of data processing, and ushered in the era of big data. But, Hadoop as a product ecosystem is certainly showing its age, and, for many use cases, it has been upstaged by more modern technologies. Therefore, Hadoop is not necessarily the safe choice for your big data use case. In the long run, Hadoop may lose out to newer products like Spark and Cassandra, which had the benefit of learning from Hadoop’s growing pains.

Q. What are the best practices for migrating a project from Hadoop to Spark?

Do the migration in stages.
Get your code working on Spark as quickly as possible: If you are using Hive, find the transition to Spark SQL fairly painless. If you are using MapReduce thenoften reuse your mapper and reducer functions and just call them in Spark, from Java or Scala.
Once your code is running on Spark, tune the performance, and consider re-coding your algorithm to take advantage of Spark’s ability to execute algorithms structured as Directed Acyclical Graphs.

————————————————————————————————————————

The views and opinions expressed in this article are entirely those of Moshe Kranc, Chief Technology Officer, Ness Software Engineering Services.

———————————————————————————————————————

INTERVIEWS