6 and 7 February 2017

Gent, Belgium

Big Data Track: Focus on MapReduce and Spark Jobs, Not Deploying

Tuesday, February 7, 14:40 - B4.039

The evolution of big data has increased the complexity of the respective software. Big data infrastructures, such as Hadoop and Spark clusters, require progressively more time and effort to set up, configure, maintain and integrate with existing systems. In absence of a big data "expert", users are often discouraged from using such solutions. The option of consuming big data infrastructures as a service seems to be a viable one, yet it is not without drawbacks. Such an option a) is costly, b) often locks users down to a vendor, and c) is limited to what the vendor decides to make available.

In this talk we discuss how we capture the knowledge to operate production-grade multi-node Hadoop and Spark clusters in an Open Source manner. This enables us to focus on the science, and as a community build operational knowledge in a shareable, repeatable, and executable method. In doing so we distill best practices into software so we all can deploy a recommended production-grade Hadoop and Spark cluster. Attendees will have the knowledge and code to deploy their own Hadoop or Spark clusters on their choice of infrastructure (cloud, bare metal, VMs, or containers).

  • Konstatinos Tsakalozos is currently employed by Canonical Ltd on the Juju Big Software team. Prior to this, he worked on Big Data for Microsoft. He holds a Ph.D. on IaaS cloud resource allocation from the Department of Informatics at the University of Athens. His interests include cloud computing, distributed architectures and multidimensional indexing.