Is your company interested in hosting meetups? Get in touch so we can help.

Data Wrangling & Spark^2

February 28, 19:00

Berlin, Germany
#openspace CoBa, Bülowstraße 80, 10783 Berlin, Berlin

External Registration

Open Registration Page

Welcome to the next meeting with three exciting new talks! 
Join us for some of the latest tech buzz, good drinks and a pleasant get-together. 

The talks are:
Talk 1: "Data wrangling - The Key to Successful Data Science" by Lars Grammel (Trifacta)
Talk 2: "Tips & tricks for making Spark lightning fast - selected case studies on caching and shuffle avoidance"
Talk 3: "Visual framework for Spark" Talk 2&3 by Adam Jakubowski and Michał Iwanowski (
We are also happy to visit the marvellous new venue #openspace this time.

Detailed Description Talk 1:
Data preparation can take up between 60% and 80% of the time spent in data science projects and is crucial for the success of such projects. In this talk, the different data preparation activities such as data discovery, structuring, cleaning, enriching and validating are examined, challenges are highlighted, and combinations of intelligent algorithms and user interfaces to speed up data preparation are explored.
Lars is the manager of the German office of Trifacta, a San Francisco-based startup that is the leader in the data transformation space. At Trifacta, he has led the development of central functionality for each major Trifacta release starting with the first private beta release. In 2015, he has started and built out the Trifacta office in Berlin. Lars holds a PhD in computer science (specialising in data visualization) from University of Victoria, Canada, and a Master's degree in computer science (specialising in software engineering) from RWTH Aachen University, Germany.
Detailed Description Talk 2:
Our experience has shown numerous examples of suboptimal usage of Spark that leads to severe performance bottlenecks. Such issues very often crystallise around two specific problem areas: caching and shuffling. Having helped multiple companies to regain the lost performance, we’ll share our common observations. In this talk, we’ll describe ● what are the most commons misconceptions about Spark’s caching, ● when and how caching should be used, ● why is it user’s responsibility to cache (instead of Spark’s), ● the difference between caching and checkpointing, ● how to substitute shuffling with aggregation.
Detailed Description Talk 3:
In this session, we will present a tool for building Spark applications visually, with limited coding skills required. A graphical user interface, based on manipulating directed acyclic graphs of operations speeds up the process of building data processing pipelines and helps avoid writing boilerplate code. During the presentation, we will give a live demo of Seahorse and show how to build, test, and productionize a distributed computing application on Spark.
Biography Speakers Talk 2 & 3:
Michał Iwanowski: Michal earned a Master’s degree in Computer Science from Warsaw University of Technology, specializing in software engineering and machine learning. Prior to he worked with Big Data processing, predictive analytics and data warehousing at IBM. Being an author of a number of publications and invention disclosures, he has collaborated with medical researchers on statistical analysis of medical data and built systems for computer-aided experiment design.
Adam Jakubowski is a Software Engineer at He studied Computer Science on the Warsaw University of Technology. He is passionate about creating clean, simple and scalable solutions. While working at he significantly contributed to two mainDeepsense products - Seahorse and Neptune.
We wish to thank #openspace for sponsoring this event!

Big Data Beers Big Data Beers

Propose talk to Big Data Beers