Welcome to our next meeting. We have three interesting talks and a new hot location in the centre of Berlin. The details:
• Talk 1: "Meet Emma: A quotation-based Scala DSL for Scalable Data Analysis", by Alexander Alexandrov
• Talk 2: "Spark and Flink at the limit: Benchmarking Data Flow Systems for Scalable Machine Learning" by Christoph Boden
• Talk 3: "Kafka Streams Test-Drive" by Christoph Bauer
The location is: DATA SPACE by SAP, Rosenthaler Straße 38, 10178 Berlin Mitte More info: Data-Space, IoT-StartUp
Detailed talk info 1:
Title: Meet Emma: A quotation-based Scala DSL for Scalable Data Analysis
Scala DSLs for data-parallel collection processing are usually embedded through types (e.g., RDD, DataFrame, Dataset in Spark; Dataset, Table in Flink). This approach introduces a design trade-off between two important DSL features: deep reuse of syntactic constructs from the host language (e.g., for comprehensions, while loops, conditionals, pattern matching) on the one side, and the ability to lift DSL terms to an intermediate representation (IR) suitable for automatic optimizations. We argue that a different embedding approach based on quotations allows for reconciling these features. As a proof-of-concept, we present Emma - a Scala DSL for scalable data analysis based on quotations.
Bio: Alexander Alexandrov is a PhD candidate at the Database and Information Management (DIMA) group at Technische Universität Berlin. His main research interest is in bridging the gap between the demands of modern data analysis platforms and the need for high-level, declarative analytics languages. In addition, he is also interested in methods and techniques for scalable data generation and benchmarking of data analysis platforms.
He is the lead developer in two open-source projects:
- Peel - a framework that helps you to define, execute, analyze, and share experiments for distributed systems and algorithms.
- Emma - a quotation-based Scala DSL for scalable data analysis.
Detailed talk info 2:
Title: Spark and Flink at the limit: Benchmarking Data Flow Systems for Scalable Machine Learning
Abstract. Distributed data flow systems such as Apache Spark or Apache Flink are popular choices for scaling machine learning algorithms in production. Industry applications of large scale machine learning such as click-through rate prediction rely on models trained on billions of data points which are both highly sparse and high-dimensional. This talk will shed light on the performance of both systems for scalable machine learning workloads. Rather than relying on existing library implementations, we implemented a representative set of distributed machine learning algorithms suitable for large scale distributed settings which have close resemblance to industry-relevant applications and provide generalizable insights into system performance. We tuned relevant system parameters and ran a comprehensive set of experiments to assess the scalability of Apache Flink and Apache Spark for data up to four billion data points and 100 million dimensions. This talk will present the results and insights of these experiments as well as lessons learned and pitfalls encountered during tuning of relevant system parameters.
Bio: Christoph Boden is currently a Computer Science Research Associate at the TU Berlin Database Systems and Information Management Group (DIMA) where he contributes to the coordination and management of the Berlin Big Data Center (BBDC). His research focus is on benchmarking distributed data processing platforms such as Apache Spark and Apache Flink for scalable machine learning workloads, Large Scale Data Analysis and Text Mining. He studied Industrial Engineering at Technische Universität Dresden, Technische Universität Berlin and the University of California, Berkeley and received a masters degree ("Dipl-Ing.") from TU Berlin in 2011. Christoph teaches a graduate level course on Scalable Data Analytics at TU Berlin and is a laureate of the Software Campus program. He published numerous peer-reviewed scientific papers at prestigious international conferences, workshops and journals in the field of distributed data processing systems and data mining.
Title: Kafka Streams Test-Drive
Abstract: In March 2016 Confluent, Inc. introduced yet another Stream Processing API. With Apache Spark, Apache Storm, Apache Flink, etc out there the question arises - Why another one? This talk will give a general introduction to the library and will try to answer some fundamental questions.
We will cover - what it is, - what it looks like, - how it works, - deployments and - what it's good for.
Bio: Christoph is a Big Data Engineer and consultant based in Berlin. His first visit to the zoo was in 2010 and he has been with the elephants, pigs, bees, ... ever since. He has seen weather conditions like storms and electrical discharges in form of sparks a lot.