SCALE@NTU Invited Talk: Understanding Data Flow in Entity-Relationship Diagrams
This research seminar is organized by Singtel Cognitive and Artificial Intelligence Lab for Enterprises (SCALE@NTU). Please find below the registration information:
To attend the seminar physically (limited seats available), click here
To attend the seminar on Teams, click here
Abstract : In data science, data pre-processing and data exploration require various convoluted steps such as creating variables, merging data sets, filtering records, value transformation, value replacement and normalization. By analyzing the source code behind analytic pipelines, it is possible to infer the nature of how data objects are used and related to each other. We defend the idea of analyzing data science source code to provide a data-centric view. On the other hand, two important diagrams have proven to be essential to manage database and software development projects: (1) Entity-Relationship (ER) diagrams (to understand data structure and data interrelationships) and (2) flow diagrams (to capture main processing steps). These two diagrams have historically been used separately, complementing each other. We propose combining these two diagrams in a unified view of data pre-processing and data exploration. Heeding such motivation, we present a hybrid diagram called FLOWER (FLOW+ER) that combines modern UML notation with data flow symbols, in order to understand complex data pipelines embedded in source code (most commonly Python). The goal of FLOWER is to assist data scientists by providing a reverse-engineered analytic view, with a data-centric angle. We present a preliminary demonstration of the concept of FLOWER, where it is incorporated into a prototype that traces a representative data pipeline and automatically builds a diagram capturing data relationships and data flow.
Speaker : Carlos Ordonez studied at UNAM University in Mexico, getting a B.Sc. in actuarial science (applied math, similar to data science degrees) and an M.S. in computer science. He continued PhD studies at the Georgia Institute of Technology advised by Edward Omiecinski, focusing on accelerating machine learning algorithms, getting the PhD in 2000. Carlos worked at NCR from 1998 to 2006, collaborating in the optimization of machine learning and cube query processing algorithms on the Teradata parallel DBMS. In 2006 Carlos joined the Department of Computer Science at the University of Houston, where he currently leads the Big Data Systems (BDS) lab. From 2013 to 2015 Carlos regularly visited MIT and collaborated with Michael Stonebraker, working on new-generation parallel DBMSs (columnar, arrays). From July 2014 to July 2015 Carlos worked as a visiting researcher with ATT Labs-Research (formerly ATT Bell Labs), where he conducted research on stream analytics, extending the R language and data quality with Divesh Srivastava. His research projects have been funded by 3 NSF grants.