Working with the WiDS Datathon dataset over the past week has been a thrilling exercise. This dataset presents an opportunity to learn about interesting and real-world modeling challenges, and is different from other curated datasets in textbooks and classic machine learning exercises. For that reason, I discuss some of the challenges you may experience around missing data, multicollinearity and linear/ nonlinear approaches. I will also provide resources to help you on these topics.
In this podcast episode, Margot interviews Telle Whitney, a highly accomplished woman in the tech industry. Telle is best known for her 15 years as CEO of the Anita Borg Institute for Women and Technology, also known as AnitaB.
We spoke to Elena Grewal when she was Airbnb’s head of data science, where she discovered the key to making successful career jumps: “Often it’s more about words being different than about skills being different”. Elena founded and runs Data 2 the People, a community of data scientists working on projects for the public good.
Janet George spoke to us when she was Western Digital’s chief data officer and first female Fellow. In this episode Janet explains that from manufacturing to product development, data science plays an important role in the storage industry. Janet is now the Group Vice President, Autonomous Enterprise, Advanced Analytics, Machine Learning & Artificial Intelligence at Oracle.
Panel: Putting our values into practice in data science work
Moderator:
Megan Price, Executive Director, Human Rights Data Analysis Group (HRDAG). As the Executive Director of the Human Rights Data Analysis Group, Megan drives the organization’s overarching strategy, leads scientific projects, and presents HRDAG’s work to diverse audiences. Her scientific work includes analyzing documents from the National Police Archive in Guatemala and contributing analyses submitted as evidence in multiple court cases in Guatemala. Her work in Syria includes collaborating with the Office of the United Nations High Commissioner of Human Rights (OHCHR) and Amnesty International on several analyses of conflict-related deaths in that country. In 2022 she was named a Fellow in the American Statistical Association.
Panelists:
Jennifer Pan is a Professor of Communication and Senior Fellow at the Freeman Spogli Institute at Stanford University. Her research resides at the intersection of political communication and authoritarian politics. Using large-scale datasets on political activity in China and other authoritarian countries, her work answers questions about how autocrats perpetuate their rule; how political censorship, propaganda, and information manipulation work in the digital age; and how preferences and behaviors are shaped as a result. Her papers have appeared in peer-reviewed publications such as Science, the American Political Science Review, the American Journal of Political Science, and Journal of Politics. She graduated from Princeton University, summa cum laude, and received her Ph.D. from Harvard University’s Department of Government.
Trina Reynolds-Tyler, Data Director, Invisible Institute, an abolitionist, and a native of south side Chicago. She leads Beneath the Surface, a project employing machine learning to identify gender based violence at the hands of Chicago police. Trina works to document how communities unable to depend on the police are creating safety and accountability outside of the carceral state. As a data scientist, she centers the practice of narrative justice in her inquiries.
Trina organizes with Not Me We, and is serving on a University of Chicago council attempting to measure the institution’s impact on the south side population. She developed the skills to use data science for real world problems as a Pozen Center for Human Rights intern with the Human Rights Data Analysis Group (HRDAG), and was a Pearson Institute Fellow. Trina holds a masters degree in public policy from the University of Chicago.
Wendy Ku, Computer Vision Tech Lead, Senior Data Scientist, Getty Images presents the Technical Vision Talk “ML through a wide-angle lens: Real World Successes and Lessons Learned in Deploying ML Models”. Image search has been a well-established problem area across industries, with a wide range of applications including e-commerce, social media and search engines. As we collectively create and consume more visual content, image search capabilities are becoming increasingly more important. In recent years, multiple large-scale image-text models have been released, reinventing the performance of image-text understanding tasks. However, applying these generalized models out-of-the-box often results in less than desired performance. In practice, deploying and maintaining an image search system presents a different set of challenges.
Wondering what else is involved in a machine learning solution besides training and deployment? Or how real world model evaluations differ from Kaggle scoreboards? This talk will cover the less discussed journey of bringing language and image-text models to production.
Biography:
Wendy is a Senior Data Scientist at Getty Images, where she develops multilingual and visual-language representation models to improve users’ search experience. She leads Getty Images’ efforts on diagnosing bias and improving fairness in machine learning systems. Prior to joining Getty Images, Wendy was involved in product and operations optimization projects in cybersecurity, consumer finance and restaurant companies. When she’s not working, Wendy enjoys working on her art and running.
—
Panel: Preparing for a career in data science
Moderator:
Sanne Smith, Director of Master’s Program, Education Data Science, Stanford University, is the director of the master‚Äôs program Education Data Science and a lecturer at the Stanford Graduate School of Education. She teaches courses that introduce students to coding, data wrangling and visualization, various statistical methods, and the interpretation of quantitative research. She studies social networks and thriving, diverse contexts.
Panelists:
Montse Cordero, Mathematics Designer, youcubed, is a mathematics designer for youcubed, a center at Stanford University that aims to inspire, educate and empower teachers of mathematics, transforming the latest research on maths learning into accessible and practical forms. He is a co-author and professional development provider for youcubed’s Explorations in Data Science high school curriculum and has participated in multiple national summits for the advancement of data science in K-12 education (Data Science 4 Everyone Coalition, National Academies of Sciences Engineering and Medicine). Montse is also a mathematician interested in work at the intersection of combinatorics, algebra, and geometry. In all facets of their work, Montse endeavors to change the ways our culture thinks and talks about mathematics.
Adriana Velez Thames, Geophysicist-Data Scientist, Springboard Alumni. Adriana recently completed a transition to Data Science after many years in the Oil and Gas industry as a Senior Geophysicist. Her primary focus was in seismic data processing for imaging the Earth’s subsurface to guide energy exploration projects. From 2012-2019, she worked at TGS where her responsibilities included QC of deliverables, testing of internal software updates, and conducting test projects and benchmarks. This involved extensive analysis and manipulation of terabyte-sized digital subsurface data using sophisticated algorithms. She believes that data-driven decisions are the best way to solve problems in any industry. Having been born in Colombia and attained post-graduate degrees in Russia, she is fluent in English, Spanish, and has working proficiency in Russian. Currently she continues educational studies in data science and spatial data science.
Elaine Yi Xu, Staff Business Data Analyst, Intuit, is a passionate data analytics and data science practitioner, putting her undergrad degree in Statistics and MS in Info Sys and DS into everyday business decision-making. She’s been working in-house in web analytics, product analytics, and marketing analytics for multiple industries, including retail (lululemon), automotive (Kelley Blue Book), and most recently at Intuit, the global technology platform. She specializes in the measurement of Go-To-Market marketing strategies, assessment of marketing campaign effectiveness, optimization of user experience, and A/B Testing. She thrives to be the connective tissue between business, analytics, engineering, and data science, combining all facets of science to help arrive at the most optimal business decisions.
Panel: Data democratization: a powerful means for creating sustainable and equitable communities
Moderator:
Michela Taufer is an ACM Distinguished Scientist and holds the Dongarra Professorship in High-Performance Computing in the Department of Electrical Engineering and Computer Science at the University of Tennessee Knoxville (UTK). She earned her undergraduate degree (Laurea) in Computer Engineering from the University of Padova (Italy) and her doctoral degree (Ph.D.) in Computer Science from the Swiss Federal Institute of Technology or ETH (Switzerland). From 2003 to 2004, she was a La Jolla Interfaces in Science Training Program (LJIS) Postdoctoral Fellow at the University of California San Diego (UCSD) and The Scripps Research Institute (TSRI), where she worked on interdisciplinary projects in computer systems and computational chemistry.
Michela is well-known for her work in establishing trustworthy scientific discoveries on heterogeneous cyberinfrastructures. Throughout her career, she has put the principle of trustworthiness into practice. She has promoted scientific computing for the general population through volunteer computing, defined accurate scientific applications on accelerators and GPUs, and developed in situ analysis methods for scientific workflows on converging HPC and Cloud platforms. She has been serving as the principal investigator of several NSF collaborative projects. She has significant experience in mentoring a diverse population of students on interdisciplinary research and establishing long-lasting workforce development.
Panelists:
Priya Donti, Co-Founder and Executive Director, Climate Change AI (CCAI). Climate Change AI, a global non-profit initiative to catalyze impactful work at the intersection of climate change and machine learning, which she is currently running through the Cornell Tech Runway Startup Postdoc Program. She will also join MIT EECS as an Assistant Professor in Fall 2023. Her research focuses on developing physics-informed machine learning methods for forecasting, optimization, and control in high-renewables power grids. Priya received her Ph.D. in Computer Science and Public Policy from Carnegie Mellon University, and is a recipient of the MIT Technology Review’s 2021 “35 Innovators Under 35” award, the ACM SIGEnergy Doctoral Dissertation Award, the Siebel Scholarship, the U.S. Department of Energy Computational Science Graduate Fellowship, and best paper awards at ICML (honorable mention), ACM e-Energy (runner-up), PECI, the Duke Energy Data Analytics Symposium, and the NeurIPS workshop on AI for Social Good.
Julia Stewart Lowndes, Director, Openscapes is a marine ecologist working at the intersection of actionable environmental science, data science, and open science. Julia’s main focus is mentoring teams to develop technical and leadership mindsets and skills for data-intensive research, grounded in climate solutions, inclusion, and kindness. She founded Openscapes in 2018 as a Mozilla Fellow and Senior Fellow at the National Center for Ecological Analysis and Synthesis (NCEAS) at the University of California Santa Barbara (UCSB), having earned her PhD from Stanford University in 2012 studying drivers and impacts of Humboldt squid in a changing climate.
Nikki Tulley, Doctoral Student, University of Arizona; Indigenous Researcher, NASA Ames Research Center. Nikki is from the Navajo Nation (NN), an Indigenous Nation located in the United States. The work and research Nikki does is influenced by her upbringing. Born and raised on the NN Reservation, she has seen firsthand the impacts of water access and water quality challenges rural communities face. The NN has wicked water problems related to anthropogenic activities and climate change. Now, as an Indigenous Scientist, she recognizes that opportunity to braid traditional ecological knowledge and western science together to address water challenges. Taking a step beyond braiding the two knowledge systems together she has begun to use Earth Observation satellite imagery to tell a story of the changes being monitored from space and those observed from the landscapes. Nikki’s passion is empowering communities through data access and capacity building. She believes that community involvement in research can significantly aid in seeking solutions for resilient and sustainable communities.
Momona Yamagami, Incoming Assistant Professor, Electrical and Computer Engineering, Rice University presents the Technical Vision Talk on “Making Biosignal Interfaces Accessible”. Biosignal interfaces that use electromyography sensors, accelerometers, and other biosignals as inputs provide promise to improve accessibility for people with disabilities. However, generalized models that are not personalized to the individual‚Äôs abilities, body sizes, and skin tones may not perform well. Individualized interfaces that are personalized to the individual and their abilities could significantly enhance accessibility.
In this talk, I discuss how continuous (i.e., 2-dimensional trajectory-tracking) and discrete (i.e., gesture) electromyography (EMG) interfaces can be personalized to the individual. For the continuous task, we used methods from game theory to iteratively optimize a linear model that mapped EMG input to cursor position. For the discrete task, we developed a dataset of participants with and without disabilities performing gestures that are accessible to them. As biosignal interfaces become more commonly available, it is important to ensure that such interfaces have high performance across a wide spectrum of users.
Biography:
Momona will be an Assistant Professor at Rice University Electrical & Computer Engineering starting summer 2023 as part of the Digital Health Initiative. Her research focuses on modeling and enhancing human-machine interaction (HMI) to support accessibility and health using biosignals and control theory applied to the field of HCI (human-computer interaction). I am currently a CREATE postdoctoral scholar at the University of Washington in Seattle, WA, advised by Prof. Jennifer Mankoff.
Momona’s dissertation research leveraged control theory methods to model and enhance continuous HMIs and explore biosignals like electromyography (EMG) as accessible machine inputs for people with and without disabilities. Her current research interests include how multi-input biosignals can improve HMI accessibility for new and emerging technology like virtual reality and support the health of people with disabilities.
—
Gayatree Ganu, Vice President, Data Science, Facebook presents Keynote Address “Put the horse before the cart: Why ‚Äúusers first‚Äù is important for a good monetization strategy”.
Meta has over 3B users on our platform engaging with our different products and services. Meta also makes over $100B annually through advertising. There is a strong connection between user engagement on our platform and how we build a sustainable business. Our mission statement for ads at Meta is “Make meaningful connections between people and businesses”. Connecting users to monetization or ads is an important part of Meta‚Äôs long term success. In this talk I will describe the frameworks to connect user engagement and revenue potential, allowing us to focus our products and services. We will also discuss how high quality and relevant ads can actually bring more engagement to our platform, making it a win-win situation. We will cover a lot of fun and challenging data science topics from weighted metrics, producer-consumer experimental setups, counterfactuals, incrementality, all at an extraordinary scale of 3B users and $100B!
Biography:
Gayatree Ganu leads the Engagement Ecosystem and Monetization Data Science teams at Facebook. The Engagement Ecosystem team’s mission is to inform Facebook’s strategy through better understanding and forecasting the health of the app. The Monetization team’s mission is to give everyone a voice and to champion economic prosperity. Gayatree leads a Data Science team with a diverse portfolio spanning modeling and machine learning, product optimizations of user experience, and strategic innovations. Gayatree has a PhD in Computer Science in Search and Recommendations from Rutgers University. She joined Facebook (now Meta) in 2013 and has worked on several problems and product areas through the last 10 years.
Gayatree believes deeply in fairness and equality in opportunity and is passionate about bringing more representation and providing sustained support to women and under-represented minorities in Tech. She leads recruiting for all Data Science roles at Meta, and is helping build an organization that values diverse perspectives as well as strong technical and analytical skills.
—
Priya Donti, Co-Founder and Executive Director, Climate Change AI presents Technical Vision Talk “Optimization-in-the-loop machine learning for energy and climate”. Addressing climate change will require concerted action across society, including the development of innovative technologies. While machine learning (ML) methods have the potential to play an important role, these methods often struggle to contend with the physics, hard constraints, and complex decision-making processes that are inherent to many climate and energy problems. To address these limitations, I present the framework of ‚Äúoptimization-in-the-loop ML,‚Äù and show how it can enable the design of ML models that explicitly capture relevant constraints and decision-making processes. For instance, this framework can be used to design learning-based controllers that provably enforce the stability criteria or operational constraints associated with the systems in which they operate. It can also enable the design of task-based learning procedures that are cognizant of the downstream decision-making processes for which a model‚Äôs outputs will be used. By significantly improving performance and preventing critical failures, such techniques can unlock the potential of ML for operating low-carbon power grids, improving energy efficiency in buildings, and addressing other high-impact problems of relevance to climate action.
Biography:
Priya Donti is the Co-founder and Executive Director of Climate Change AI, a global non-profit initiative to catalyze impactful work at the intersection of climate change and machine learning, which she is currently running through the Cornell Tech Runway Startup Postdoc Program. She will also join MIT EECS as an Assistant Professor in Fall 2023. Her research focuses on developing physics-informed machine learning methods for forecasting, optimization, and control in high-renewables power grids. Priya received her Ph.D. in Computer Science and Public Policy from Carnegie Mellon University, and is a recipient of the MIT Technology Review’s 2021 “35 Innovators Under 35” award, the ACM SIGEnergy Doctoral Dissertation Award, the Siebel Scholarship, the U.S. Department of Energy Computational Science Graduate Fellowship, and best paper awards at ICML (honorable mention), ACM e-Energy (runner-up), PECI, the Duke Energy Data Analytics Symposium, and the NeurIPS workshop on AI for Social Good.
—
Lisa Martin, Tracy Zhang, & Hannah Freitag kickoff WiDS 2023 at the Arrillaga Alumni Center at Stanford University.
Rhonda Crate, Principal Data Scientist, ATF, Boeing talks with Lisa Martin & Tracy Zhang at WiDS 2023 at Stanford University.
Gabriela de Queiroz, Principal Cloud Advocate Manager, Microsoft, talks with Lisa Martin & Tracy Zhang at WiDS 2023 at Stanford University.
Kelly Hoang, Data Scientist, Gilead talks with Lisa Martin & Tracy Zhang at WiDS 2023 at Stanford University.
What key principles of design and data viz do you need to know to create effective and clear graphs? This talk will cover preattentive attributes, Gestalt principles, and principles of color use. It will provide the key concepts from design and data viz research that you need to know to communicate data effectively. The talk will include examples to demonstrate applying the concepts and comparing data viz effectiveness.
This workshop was conducted by Jenn Schilling, Founder of Schilling Data Studio.
Linear regression is a fundamental tool in statistics and data science for modeling the relationship between different parameters. It can be used for prediction, forecasting and error reduction by fitting a predictive model between a response variable and a collection of explanatory variables based on an observed data set. Through linear regression analysis, we can quantify the strength of the linear relationship between the response and different explanatory variables, and we can identify parameters that may contain redundant information.
This workshop introduces the basics of simple and multiple linear regression. We will present both mathematical theory and applications in the context of real data sets — ranging from survey results collected by the US National Center for Health Statistics (NHANES), to real estate listings in Sacramento, CA. After the talk, the R code used will be provided, so attendees can revisit examples of how to apply this foundational modeling method.
This workshop was conducted by Laura Lyman, Instructor of Mathematics, Statistics, and Computer Science (MSCS) at Macalester College
The 6th Annual Women in Data Science (WiDS) Datathon launches in January 2023, in the lead up to the WiDS conferences in March 2023. In this year’s datathon challenges participants…
Precision medicine aims to learn from data how to match the right treatment to the right person at the right time. One common goal in precision medicine is the estimation of optimal dynamic treatment regimens (DTRs), sequences of decision rules that recommend treatments to patients in a way that, if followed, would optimize outcomes for each individual and overall, in the targeted population. In this presentation, we will describe how the precision medicine framework formalizes sequential clinical decision-making and briefly review a subset of the most popular strategies for learning optimal dynamic treatment regimes. We will then invite the workshop group to ideate and discuss the critical opportunities and challenges for the translation of DTRs to clinical and community care, the role of stakeholder engagement and cross-disciplinary collaboration, and considerations for evaluating DTRs in practice.
This workshop was conducted by Nikki Freeman and Anna Kahkoska from the University of North Carolina at Chapel Hill.
Slides and resources used in this workshop: https://bit.ly/precision_medicine_slides
In this workshop, I would like to share my journey transitioning from an electrical engineer focusing on ultra-low power integrated circuit design to an AI Solution Architect. Through specific examples of how the two fields connect, I will discuss the fundamentals of deep learning and data-driven hardware design. I will start with my experience in the semiconductor industry designing application-specific and data-dependent hardware for IoT systems and then discuss how this experience led to my career in AI specializing in areas including high-performance computing, edge computing, and more recently, federated learning.
I hope the attendees will not only find the technical content informative but also see how a growth mindset truly helped me find my career passion. Having a broad knowledge of the eco-system that supports AI applications – such as the hardware stack, hardware level optimization, and application-specific hardware design – can be very helpful to understanding and choosing the right platform for operational AI. I also hope to use this opportunity to connect with fellow AI/hardware enthusiasts in WiDS.
This workshop was conducted by Chu Lahlou, AI Specialized Cloud Solution Architect at Microsoft.
The least squares method is one of the most widely used techniques in data science and is used to fit a linear model to data. In this workshop, we will study least squares problems from a linear algebraic perspective and discuss the techniques to solve them.
This workshop assumes that you have a basic understanding of linear algebra including concepts such as matrices, rank, range space, orthogonality, and matrix decompositions (Cholesky, QR, SVD).
This workshop was conducted by Abeynaya Gnanasekaran, a Senior Research Engineer at Raytheon Technologies Research Center.
While there have been amazing achievements with machine learning in recent years, reproducing results for state-of-the-art deep learning methods is seldom straightforward. Three leading data scientists share their views at recent WIDS conferences on the importance of establishing structures, standards and best practices to guide us towards consistently producing high quality science and reliable findings.
This workshop aims to enable young data scientists to start their first ML project. It would help them understand the process from gathering data to building their ML model. Building an ML model is easy, but building it the correct way is a lot harder than known.
This workshop was conducted by Manogna Mantripragada, Data Scientist at Greenlink Analytics.
Access resources for this workshop: https://bit.ly/energy_burden_analysis…
During the workshop, we show a simple exploratory data analysis using Deepnote. We will focus on personal data from Camino de Santiago pilgrimage which we retrieved from our Strava API and show you how to get it from your own device. Using this data we explain a theory about Exploratory Data Analysis and show some use cases.
This workshop was conducted by Tereza Vaňková and Alleanna Clark of Deepnote.
Resources used in this workshop:
– https://bit.ly/deepnote_notebook
– https://bit.ly/deepnote_slides
Make answering ‘what if’ analysis questions a whole lot easier by learning about state-of-the-art, end-to-end applied frameworks for causal inference.
We will cover:
Microsoft’s “Do Why” Package Causal Impact in Python – DoWhy | An end-to-end library for causal inference — DoWhy | An end-to-end library for causal inference documentation (microsoft.github.io)
Bayesian Causal Impact in R
MLE Causal Impact in Python
Bonus: AA Testing, when to use and why it matters
We will apply these models in the context of understanding the impact of a marketing rewards campaign, as well as understand the impact from a product/feature upgrade
This workshop was conducted by Jennifer Vlasiu, Data Science & Big Data Instructor at York University
Useful resources for this workshop:
– https://bit.ly/github_casual_impact
Image classification is a task in the Computer Vision domain that takes in an image as input and outputs a label for that image. Deep learning is the most effective modern method for modeling this task. In this interactive workshop, we will walkthrough a Jupyter Notebook which will overview how to perform multi-class image classification in Python using the PyTorch library. The intention is to give the audience a broad overview of this task of classification and inspire participants to explore the vast fields of visual recognition and computer vision at large.
This workshop was conducted by Cindy Gonzales, Data Science Team Lead for the Biosecurity and Data Science Applications Group at Lawrence Livermore National Laboratory
Useful resources for this workshop:
– https://bit.ly/deep_learning_files
– https://bit.ly/deep_learning_notebook
As data scientists, the ability to understand our models’ decisions is important, especially for models that could have a high impact on people’s lives. This may pose several challenges, as most models used in the industry are not inherently explainable. Today, the most popular explainability methods are SHAP (SHapley Additive exPlanations) and LIME (Local Interpretable Model-Agnostic Explanation). Each method offers convenient APIs, backed by solid mathematical foundations, but falls short in intuitiveness and actionability.
In this workshop/article, I will introduce a relatively new model explanation method – Counterfactual Explanations (CFs). CFs are explanations based on minimal changes to a model’s input features that lead the model to output a different (mostly opposite) predicted class. CFs have been shown to be more intuitive for humans to comprehend and provide actionable feedback, compared to traditionalSHAP and LIME methods. I will review the challenges in this novel field (such as how to ensure that the CF proposes changes which are feasible), provide a birds-eye view of the latest research and give my perspective, based on my research in collaboration with Tel Aviv University, on the various aspects in which CFs can transform the way data science practitioners understand their ML models.
This workshop was conducted by Aviv Ben Arie, Data Science Manager at Intuit
Markov chains are a special type of random process which can be used to model many natural processes. This workshop will be a gentle introduction to Markov chains, giving basic properties and many examples. The second part of the workshop will focus on one specific application of Markov chains to data science: Sampling from posterior distributions in Bayesian inference. Introductory background in probability, statistics, and linear algebra is assumed.
This workshop was conducted by Mackenzie Simper, PhD Student at Stanford University.
Slides for this workshop: https://bit.ly/markov_chains_ppt
A propensity model attempts to estimate the propensity (probability) of a behavior (e.g., conversion, churn, purchase, etc.) happening during a well-defined time period into the future based on historical data. It is a widely used technique by organizations or marketing teams for providing targeted messages, products or services to customers. This workshop shares an open-sourced package developed by Google, for building an end-to-end Propensity Modeling solution using datasets like GA360, Firebase or CRM and using the propensity predictions to design, activate and measure the impact of a media campaign. The package has enabled companies from e-commerce, retail, gaming, CPG and other industries to make accelerated data-driven marketing decisions.
This workshop was conducted by Lingling Xu, Bingjie Xu, Shalini Pochineni and Xi Li, data scientists on the Google APAC team.
Useful resources for this workshop:
– Workshop #1: https://youtu.be/rQhQca8RCuM
– https://bit.ly/propensity_modeling_pa…
– https://bit.ly/bigquery_export_schema
– https://bit.ly/ga_sample_dataset
– https://bit.ly/ml_windowing_pipeline
Neural networks have been widely celebrated for their power to solve difficult problems across a number of domains. We explore an approach for leveraging this technology within a statistical model of customer choice. Conjoint-based choice models are used to support many high-value decisions at GM. In particular, we test whether using a neural network to model customer utility enables us to better capture non-compensatory behavior (i.e., decision rules where customers only consider products that meet acceptable criteria) in the context of conjoint tasks. We find the neural network can improve hold-out conjoint prediction accuracy for synthetic respondents exhibiting non-compensatory behavior only when trained on very large conjoint data sets. Given the limited amount of training data (conjoint responses) available in practice, a mixed logit choice model with a traditional linear utility function outperforms the choice model with the embedded neural network.
This workshop was conducted by Kathryn Schumacher, Staff Researcher in the Advanced Analytics Center of Expertise within General Motor’s Chief Data and Analytics Office.
The workshop would focus on the basic to intermediate levels of SQL. We will start with querying a database, using filters to clean the data. Joining different tables. Aggregate functions and use of ‘CASE WHEN’ for better query performances. Subqueries and Common Table Expressions (CTEs) and a comparison between them. Use of window functions. Lead and lag functions and the scenarios when they can be used. Pivot tables and when not to use them!
This workshop was conducted by Sreelaxmi Chakkadath, Data Science Master’s student at Indiana University Bloomington.
Useful resources for this workshop:
– PostgreSQL install link: https://www.postgresql.org/
– https://bit.ly/sql_workshop_script
– https://bit.ly/sql_workshop_codes
– https://bit.ly/sql_ppt_slides
Maria Gargiulo, Statistician, Human Rights Data Analysis Group, talks with theCUBE’s Stephanie Chan for WiDS 2022
License
Creative Commons Attribution license (reuse allowed)
Show les
Cecilia Aragon, Professor, Human Centered Design & Engineering, University of Washington, presents a Keynote at the WiDS Worldwide conference.
Very often, the words ‘rigorous’ and ‘human-centered’ have been used as opposites in technical fields, with the implication that a focus on human aspects makes science ‘soft’ or ‘insufficiently technical’. This is a false dichotomy that Cecilia will argue in this talk.
While extraordinary advances in our ability to collect, analyze, and interpret vast amounts of data have been transforming the fundamental nature of data science, the human aspects of data science, including how to support scientific creativity and human insight, how to address ethical concerns, and the consideration of societal impacts, have been less studied. Yet these human issues are becoming increasingly vital to the future of data science. Cecilia will reflect on a 30-year career in data science in industry, government, and academia, discuss what it means for data science to be both rigorous and human-centered, and speculate upon future directions for data science.
Maria Gargiulo, Statistician, Human Rights Data Analyst Group, presents a Technical Vision Talk at the WiDS Worldwide conference.
Collecting data on human rights violations in conflict settings is difficult and dangerous, and the data that results is often incomplete on multiple levels. Some victims� stories are never recorded, and those whose stories are documented may still be missing critical information about the victim, the perpetrator, or other contextual details about the violation. Furthermore, the data that is documented may not be statistically representative of the victim population as a whole. Drawing population-level inferences from this data without correcting for the missingness risks incorrectly answering questions about patterns of violence.
This talk will demonstrate how multiple systems estimation and multiple imputation can be used together to address both levels of missingness in order to draw population level inferences that are statistically valid and include a measure of uncertainty.
WiDS Worldwide panel: Data Science in Healthcare: Opportunities & Challenges
Moderated by Tina Hernandez Boussard, Associate Professor, Stanford University
Panelists:
– Sylvia K. Plevritis, Chair of Biomedical Data Science, Stanford University
– Tanveer Syeda-Mahmood, IBM Fellow, IBM Research Center
– Jinoos Yazdany, Chief of Rheumatology, Zuckerberg San Francisco General Hospital
WiDS Worldwide panel: Algorithms and Data for Equity
Moderated by Jenny Suckale, Associate Professor, Stanford University
Panelists:
– Tierra Bills, Assistant Professor of Civil and Environmental Engineering and Public Policy, UCLA
– Jessica Granderson, Director for Building Technology, White House Council on Environmental Quality
– Ling Jin, Research Scientist, Lawrence Berkeley National Laboratory
Nadia Fawaz, Senior Staff Applied Research Scientist – Tech Lead Inclusive AI at Pinterest, presents a Technical Vision Talk at the WiDS Worldwide conference.
Through this tech talk one can gain knowledge of how machine learning technologies are paving the way for more inclusive inspirations in Search and in our augmented reality technology Try-On, and are also driving advances for more diverse recommendations across the platform. Developing inclusive AI in production requires an end-to-end iterative and collaborative approach.
Tammy Kolda, Mathematical Consultant at MathSci.ai, presents a Technical Vision Talk at the WiDS Worldwide conference.
A (trained) machine learning model, such as a deep neural network, operates loosely as follows: it takes features as an input and produces a classification as an output. Watch Tammy argue that �more data� and �bigger models� are not a panacea, and instead develop mathematical methodology for understanding how to move beyond the current limits of machine learning.
WiDS 2022 Career Panel
Moderated by Suzanne Weekes, Executive Director, SIAM
Panelists:
– Cecilia Aragon, Professor, Human Centered Design & Engineering, University of Washington
– Sharon Hutchins, VP & Chief of Operations, Intuit AI+Data
– Tamara Kolda, Mathematical Consultant, MathSci.ai
– Maggie Wang, Robotics Software Engineer, Skydio
Join us online on March 7, 2022, for the Women in Data Science (WiDS) Worldwide conference, a technical conference featuring outstanding women doing exceptional work in data science and related fields, in a wide variety of domains. Everyone is welcome and encouraged to attend. Broadcasted LIVE from Stanford University 8am – 5pm PST.
Forecasting using time series data is a hot topic of research and is applied to a variety of use-cases to make important decisions – wherever there are changes with time (seasonal or trend) such as e-commerce orders, stock market prices, weather prediction, demand and usage of products, etc. This workshop will cover time series analysis that attempts to understand the nature of the series and is useful for future forecasting along with the overview of popular forecasting models such as ARIMA, SMA, SES, Prophet followed by a case-study walk-through.
This workshop was conducted by Apurva Sinha & Sinduja Subramaniam at Walmart Global Tech.
Cognitive neuroscientists are often interested in broad research questions, yet use overly narrow experimental designs by considering only a small subset of possible experimental conditions. This limits the generalizability and reproducibility of many research findings. In this workshop, I present an alternative approach, “The AI Neuroscientist”, that resolves these problems by combining real-time brain imaging with a branch of machine learning, Bayesian optimization. Neuroadaptive Bayesian optimization is an active sampling approach that allows to intelligently search through large experiment spaces with the aim to optimize an unknown objective function. It thus provides a powerful strategy to efficiently explore many more experimental conditions than is currently possible with standard brain imaging methodology. Alongside methodological details on non-parametric Bayesian optimization using Gaussian process regression, I will present results from a clinical study where we applied the method to map cognitive dysfunction in stroke patients. Our results demonstrate that this technique is both feasible and robust also for clinical cohorts. Moreover, our study highlights the importance of moving beyond traditional ‘one-size-fits-all’ approaches where patients are treated as one group. Our approach can be combined with brain stimulation or other therapeutics, thereby opening new avenues for precision medicine targeting a diverse range of neurological and psychiatric conditions.
In this workshop, we focus on temporal domain from perspective of both traditional recommender systems and deep neural networks. We first start with the classic latent factor model. We introduce temporal dynamics in the latent factor model and show how this improves performance. We then move into sequential modelling using deep neural networks by presenting state-of-the-art in the field and discuss the advantages and disadvantages.
This workshop was conducted by Romy Lorenz, Postdoctoral Fellow at Stanford University and University of Cambridge
In this third workshop in linear algebra, we will investigate the link between Principal Component Analysis and the Singular Value Decomposition. Along the way, we are introduced to several linear algebra concepts including linear regression, eigenvalues and eigenvectors and conditioning of a system. We will use shared python scripts and several examples to demonstrate the ideas discussed.
This workshop builds on the previous 2 workshops in linear algebra (Part I and Part II), and we will assume that the linear algebra concepts introduced in those workshops are familiar to the audience. They include: vector algebra (including inner products, angle between vectors), matrix-vector multiplications, matrix-matrix multiplications, matrix-vectors solves, singularity, and singular values.
Links:
1. Code is available for viewers to follow along: https://github.com/lalyman/lin-alg-wo…
2. The covariance matrix is defined for centered X, and the inequality n 1 given is strict.
This workshop was conducted by Laura Lyman, phD student at Stanford University, ICME.
In this workshop, you will learn about the core concepts of BML – how it is different from the frequentist approaches, building blocks of Bayesian inference and what known ML techniques look like in a bayesian set-up. You will also learn how to use various sampling techniques for bayesian inference and why we need such techniques in the first place. The workshop will also provide links and materials to continue your Bayesian journey afterwards.
This workshop is meant as an introduction to select BML modules – we strongly recommend you to continue exploring the world of bayesian once you have taken this first step.
This workshop was conducted by Ashwini Chandrashekharaiah & Debanjana Banerjee at Walmart Global Tech.
We live in an era of big data with data sets that require computational analysis to gain insights and knowledge. The volume of big data has been increasing steadily, and will only continue to climb. Since we started the WiDS initiative in 2015, Statistica estimates that the volume of data has increased from 15.5 to 74 zetabytes, and they forecast that data volume will double again by 2024.
Yet with all of this data, one of the biggest challenges that data scientists and researchers face is dealing with missing data. In some cases, the missing data is due to not readily having access to the data sets that are required to perform the analysis, while other cases involve data sets that are incomplete and not uniformly populated.
Welcome to the world of artificial intelligence (AI) and augmented reality (AR)! This workshop explains AI and AR via hands on exercises where you will interact with your augmented world. You will learn about applications where the technologies of AI+AR are combined, their limitations, and their impacts in society. You’ll leave armed with code, inspiration, and an ethical framework for your own projects!
Artificial intelligence (AI) is used in a variety of industries for many applications. AI can be combined with other technologies to assist with understanding implications of certain aspects of applications. In this workshop, you explore how pose estimation results implemented using Deep Learning are impacted based on a location which is provided using augmented reality. These combined technologies provide insight into how poses could be interpreted differently based on a scene. This workshop also raises awareness regarding consequences of using AI for applications that are different from its originally intended use, which could lead to both technical and ethical challenges.
Specific topics that will be covered in this workshop are listed below:
• understand how AI and AR can be used for applications
• explore how to implement AI and AR
• discover what tools can be used to implement AI and AR
• review code that implements pose estimation using AI and changing background scenes using AR
• gain guidance regarding challenges to address societal impacts of the results from applications that use AI and AR
In addition to receiving an overview of terminology and an understanding of the workflows for each topic, code will be provided to demonstrate how to implement these workflows with tools from MathWorks.
This workshop was conducted by Louvere Walker-Hannon, Shruti Karulkar, & Sarah Mohamed from MathWorks.
Graph theory provides an effective way to study relationships between data points, and is applied to everything from deep learning models to social networks. This workshop is part I in a series of three workshops. Throughout the series we will progress from introductory explanations of what a graph is, through the most common algorithms performed on graphs, and end with an investigation of the attributes of large-scale graphs using real data.
And in particular for Part III:
Many of the systems we study today can be represented as graphs, from social media networks to phylogenetic trees to airplane flight paths. In this workshop we will explore real-world examples of graphs, discussing how to extract graphs from real data, data structures for storing graphs, and measures to characterize graphs. We will work with real examples of graph data to create a table of values that summarize different example graphs, exploring values such as the centrality, assortativity, and diameter of each graph. Python code will be provided so that attendees can get hands-on experience analyzing graph data.
This workshop was conducted by Stanford ICME PhD student, Julia Olivieri.
Natural language processing has direct real-world applications, from speech recognition to automatic text generation, from lexical semantics understanding to question answering. In just a decade, neural machine learning models became widespread, largely abandoning the statistical methods due to its requirement of elaborate feature engineering. Popular techniques include use of word-embeddings to capture semantic properties of words. In this workshop, we take you through the ever-changing journey of neural models while addressing their boons and banes.
The workshop will address concepts of word-embedding, frequency-based and prediction-based embedding, positional embedding, multi-headed attention and application of the same in unsupervised context.
This workshop was conducted by Riyanka Bhowal, Senior Data Scientist at Walmart Gobal Tech.
In this workshop, Dora Demszky, a Stanford PhD student, illustrates how natural language processing (NLP) can be used to answer social science questions. The workshop will focus on applying NLP to analyze the content of 15 US history textbooks used in Texas, to analyze the representation of historically marginalized people and groups.
The workshop is based on a paper (https://journals.sagepub.com/doi/pdf/…) that also has an associated toolkit, and it will provide examples of how this toolkit can be used using a Jupyter notebook that will be made available.
Want to learn more about trends like AI, IoT and wearable tech? In less than one hour, we will cut through the hype by building a “smart” fitness tracker using your own mobile device.
We’ll do hands-on exercises: you’ll acquire data from sensors, design a step counter and train a human activity classifier. You will leave motivated and ready to use machine learning and sensors in your own projects!
This workshop was conducted by Louvere Walker-Hannon, Shruti Karulkar, & Sarah Mohamed from MathWorks.
Emily Miller, Senior Data Scientist at Drivendata.org hosts a workshop on ‘Actionable Ethics for Data Scientists’ in which she illustrates the different types of ethical concerns that arise in the course of data science work, grounding these in concrete examples of times where things have gone wrong.
Julia Ling, CTO at Citrine Informatics hosts a workshop on ‘Machine Learning for Scientific R&D: Why it’s Hard and Why it’s Fun’ in which she covers some of the key challenges in machine learning for R&D applications: the small, often-messy, sample-biased datasets; the exploratory nature of scientific discovery; and the curious, hands-on approach of scientific users. Julia discusses potential solutions to these challenges, including transfer learning, integration of scientific domain knowledge, uncertainty quantification, and machine learning model interpretability.
Madeleine Udell, Assistant Professor at Cornell hosts a workshop on ‘Automating Machine Learning’ in which she surveys interesting strategies for automated machine learning.
Sita Syal, Ph.D. Candidate of Mechanical Engineering at Stanford University hosts a workshop on ‘Design Thinking for Data Science Problems’.
Margot Gerritsen, Professor at Stanford university hosts a workshop on ‘Linear Algebra: What Would We be Without It’ in which she explores the beauty and power of linear algebra, and discuss the most critical linear algebra concepts and algorithms used in data science.
Debanjana Banerjee, Data Scientist and Sinduja Subramaniam, Staff Data Scientist with Walmart host a workshop ‘Evolution of Applied Recommender Systems’ where they take you through the whirlwind journey of the recommender system from GroupLens in the 1990s, Content Based Filtering, Matrix Factorization and Hybrid Recommender Systems in the late 2000s all the way to DeepLearning based recommenders of today. The workshop will address foundational concepts such as user-item interaction matrix, user/item profiles, cold-start problem, sparsity, scalability, etc. along with mathematical formulation for different types of recommender systems using applications in Retail.
Zhamak Dehghani, Director, Emerging Technologies, North America at Thoughtworks hosts a workshop on ‘An introduction to Data Mesh: a paradigm shift in analytical data management’ in where Zhamak shares her observations on the failure modes of a centralized paradigm of a data lake, and its predecessor data warehouse. She introduces Data Mesh, a paradigm shift in big data management that draws from modern distributed architecture: considering domains as the first class concern, applying self-sovereignty to distribute the ownership of data, applying platform thinking to create self-serve data infrastructure, and treating data as a product.
Have an opportunity to Meet-the-Speakers from WiDS Worldwide! Speaker Geetha Manjunath, Founder and CEO of Niramai is interviewed by Radhika Kannan, Staff Technical Program Manager, Intuit
Have an opportunity to Meet-the-Speakers from WiDS Worldwide! Speaker Kalinda Griffiths, Scientia Lecturer at Centre Big Data Research in Health, UNSW is interviewed by Margot Gerritsen, Professor at Stanford University.
Have an opportunity to Meet-the-Speakers from WiDS Worldwide! Speaker Maria Schuld, Senior Researcher at University of KwaZulu-Natal is interviewed by Margot Gerritsen, Professor at Stanford University.
Best of WiDS features Ema Rie on her talk ‘The Fusion of Science and Fashion’ from WiDS Tokyo @ Yokohama City University, 2020!
Best of WiDS features Sanghamitra Bandhyopdhyay in her ‘Fireside Chat’ from WiDS Bengaluru @ Intuit, 2020!
Best of WiDS features Agate Ponder-Sutton on her talk ‘Data Slayer – How historical sword principles apply to data science’ from WiDS Auckland 2020!
Best of WiDS features Daphne Koller on her talk ‘Machine Learning: A New Approach to Drug Discovery’ from Stanford 2020!
Best of WiDS features Claudia Perlich on her talk ‘The Secret Life of Predictive Models’ from Stanford 2017!
Best of WiDS features Marzyeh Ghassemi on her talk ‘Improving Healthcare with Machine Learning’ from Stanford 2019!
Best of WiDS features Hila Gonin on her talk ‘Gender Bias in Words Embeddings’ from Tel Aviv 2019!
The WiDS Datathon is a 100% community driven challenge, where participants get experience and trainings from partners, ambassadors and data enthusiasts in social impact challenges! See how it has evolved in the last four years.
Andrea Goldsmith, Dean of Engineering and Applied Science at Princeton University discusses why data scientists with diverse perspectives, experiences, and knowledge are needed for the field to thrive and achieve maximum impact. She paints a vision for a diverse and inclusive culture in data science, and propose how to achieve that vision.
Best of WiDS features Megan Price on her talk ‘Machine Learning to Determine How Many People Have Been Killed in Syria’ from Stanford 2017!
Best of WiDS features Latanya Sweeney on her talk ‘Data Science to Save the World’ from Stanford 2018!
Susan Athey | Driven Marketplace Design: Experiments, Machine Learning, Econometrics | Stanford 2015
Best of WiDS features Susan Athey on her talk ‘Driven Marketplace Design: Experiments, Machine Learning and Econometrics’ from Stanford 2015!
Best of WiDS features Leda Braga on her talk ‘When Data Science IS the Business’ from Stanford 2018!
Best of WiDS features Alicia Carriquiry on her talk ‘Machine Learning and the Evaluation of Criminal Evidence’ from Stanford 2019!
Emily Glassberg-Sands | Data Science for Unlocking Teaching & Learning at Scale | WiDS Stanford 2019
Best of WiDS features Emily Glassberg-Sands on her talk ‘Data Science for Unlocking Teaching & Learning at Scale’ from WiDS Stanford 2019!
Best of WiDS features Caitlin Smallwood on her talk ‘Netflix: A confluence of metrics, algorithms, and experimentation’ from WiDS Stanford 2019!
Join us in a fireside chat with Noble Prize Winner and Professor of Physics and Astronomy, UCLA, Andrea Ghez.
Cindy Orozco Bohorquez, Ph.D. Candidate in Computational and Mathematical Engineering at Stanford University studies which choice is the correct one for a classical problem in computer graphics and satellite communication called point-set registration. She focuses on the special case of recovering the rotation that aligns two data sets that belong to the d-dimensional sphere. She explores combining results from statistics, optimization, and differential geometry, to compare the solutions given by these algorithms.
Panel: Energy and Sustainability | Rosalind Archer, Xin Ma, Lesly Goh, Nida Rizwan Farid | WiDS 2021
Panel Discussion on ‘Energy and Sustainability’
Moderator: Rosalind Archer, Professor, University of Auckland
Panelists:
-Xin Ma, Managing Director, Asia Platform, TOTAL
-Lesly Goh, Senior Fellow, National University of Singapore Lee Kuan Yew School of Public Policy
-Nida Rizwan Farid, Aerospace Engineer and Energy Efficiency Consultant, Save Joules
Kalinda Griffiths, Scientia Lecturer at Centre for Big Data Research in Health, UNSW Sydney discusses priority issues when identifying Indigenous people in the national data in Australia’s colonial context.
Maria Schuld, Senior Researcher at Xanadu, and the University of KwaZulu-Natal provides an overview of quantum machine learning research and illustrate that quantum algorithms can be trained like neural nets, but look formally very similar to kernel methods.
Panel discussion on ‘Ethics and Responsible Data Science’
Moderator: Shir Meir Lador, Data Science Group Manager, Intuit
Panelists:
-Andrea Martin, Leader IBM Watson Center Munich & EMEA Client Centers, IBM Distinguished Engineer, IBM
-Monica Scannapieco, Head of the Division “Information and Application Architecture”, Italian National Institute of Statistics
-Nazareen Ebrahim, AI Ethics Officer, Socially Acceptable – South Africa
Jöelle Pineau, Computer Scientist and Associate Professor of McGill University and Lead of Facebook’s Artificial Intelligence Research lab talks about challenges that arise in experimental techniques and reporting procedures in deep learning, with a particular focus on reinforcement learning and applications to healthcare. She describes several recent results and guidelines designed to make future results more reproducible, reusable and robust!
Leveraging Data & Analytics to Solve Challenges in the Pandemic and Beyond | Gina Papush | WiDS 2021
Gina Papush, Global Chief Data and Analytics Officer at Evernorth, speaks on the importance of data science improves people’s well-being- helping them stay healthy or get back to healthyCOVID-19 is straining our health care system unlike any other event in modern history. We are leveraging data and analytics to solve challenging problems for our customers during this pandemic, as well as addressing current and emerging needs in healthcare. We are focused on tackling disparities in healthcare, improving prediction and enabling improved outcomes. Gina shares obstacles and solutions when working with massive amounts of data and algorithms. She connect how data science improves people’s well-being- helping them stay healthy or get back to healthy.
Shafi Goldwasser, Director of the Simons Institute for the Theory of Computing, Professor of Electrical Engineering and Computer Science at the University of California Berkeley, Professor of Electrical Engineering and Computer Science at MIT and Professor of Computer Science and Applied Mathematics at the Weizmann Institute of Science Israel, speaks on how cryptographic models and tools can and should play a role in ensuring the trustworthiness of AI and machine learning and address problems such as privacy of training input, model verification and robustness against adversarial examples.
Fernanda Viégas, Principal Scientist at Google discusses a variety of ways in which data visualization can help people effectively engage with data: from generating scientific insight and enabling public debate to boosting artistic expression. Fernanda presents projects that illustrate how the coupling of visualization technique and design thinking not only empowers experts, but also welcomes lay viewers into the world of data and statistics.
Hulya Emir-Farinas, Director of Data Science at FitBit discusses how machine learning is a key capability in making any solution smart and more personalized.
Kristian Lum, a statistician and research assistant professor at the University of Pennsylvania, describes how following her interests has led her on an everchanging career path across business, public service, and academia, on a recent episode of the WiDS Podcast.
The WiDS Next Gen program inspires secondary school students to consider careers involving data science, artificial intelligence (AI), and related fields. We particularly encourage young women and girls by showing examples of successful women who are having an impact in the field.
This introductory video provides an overview of what data science is and how data science is ‚ÄÇ‚ÄÇ‚ÄÇ‚ÄÇ being applied in the real world. The video also features the day-in-the-life of four data scientists who are having a positive impact in the field today.
Karen Matthys and Caroline Blair Bauhaus of Stanford University announce the WiDS High School Outreach program at WiDS Stanford on March 2, 2020.
Daphne Koller, CEO and Founder at insitro delivers a Keynote presentation at WiDS Stanford University on March 2, 2020:
Modern medicine has given us effective tools to treat some of the most significant and burdensome diseases. At the same time, it is becoming consistently more challenging to develop new therapeutics: clinical trial success rates hover around the mid-single-digit range; the pre-tax R&D cost to develop a new drug (once failures are incorporated) is estimated to be greater than $2.5B; and the rate of return on drug development investment has been decreasing linearly year by year, and some analyses estimate that it will hit 0% before 2020. A key contributor to this trend is that the drug development process involves multiple steps, each of which involves a complex and protracted experiment that often fails.
We believe that, for many of these phases, it is possible to develop machine learning models to help predict the outcome of these experiments, and that those models, while inevitably imperfect, can outperform predictions based on traditional heuristics. The key will be to train powerful ML techniques on sufficient amounts of high-quality, relevant data.
To achieve this goal, we are bringing together cutting edge methods in functional genomics and lab automation to build a bio-data factory that can produce relevant biological data at scale, allowing us to create large, high-quality datasets that enable the development of novel ML models. Our first goal is to engineer in vitro models of human disease that, via the use of appropriate ML models, are able to provide good predictions regarding the effect of interventions on human clinical phenotypes. Our ultimate goal is to develop a new approach to drug development that uses high-quality data and ML models to design novel, safe, and effective therapies that help more people, faster, and at a lower cost.
Moderated by Margot Gerritsen, WiDS Co-Director, Stanford University
Panelists:
– Aslihan Demirkaya, Research Scientist, Vianai Systems, Inc
– Lucy Bernholz, PhD, Senior Research Scholar, Stanford Center on Philanthropy + Civil Society
– Lynn Kirabo, PhD student, Carnegie Mellon University
Moderated by Martina Lauchengco, Operating Partner, Board Member, Costanoa Ventures
Panelists:
– Rukmini Iyer, Distinguished Engineer, Bing Advertising Marketplace & Serving, AI & Research, Microsoft
– Talithia Williams, Associate Dean, Associate Professor of Mathematics, Harvey Mudd
– Denice Ross, Senior Fellow, National Conference on Citizenship Fellow, Georgetown University
– Lillian Carrasquillo, Insights Manager, Spotify
Nhung Ho, Directory of Data Science at Intuit delivers a Technical Vision Talk at WiDS Stanford University on March 2, 2020:
In today’s digital world, cloud adoption is mainstream. Whether for business or higher education, organizations are on an accelerated migration path to achieve greater flexibility, speed and cost efficiencies. For data scientists, the cloud can serve as an underlying platform to speed AI innovation by offering the processing capability needed to manage massive amounts of data, sophisticated algorithms and complex models that must be as performant as possible. In this talk, Nhung Ho, Director of Data Science for Intuit AI, will draw upon real-world experiences in academia and building and modernizing Intuit Mint’s categorization models to describe what every data scientist needs to know about data science in a cloud world.
Talithia Williams, Associate Dean and Associate Professor of Mathematics at Harvey Mudd College delivers at WiDS Stanford University on March 2, 2020:
The leading edge breed of high-tech, wearable health technology is changing how we monitor personal data. We can quantify everything from heart rate and sleep patterns to body temperature and sex life. But, what is the average person to do with the massive amounts of data being collected? This talk makes a compelling case that all of us should be recording simple data about our bodies and will help you begin to analyze and understand your body’s data. Surprisingly, your own data can reveal much more than even your doctors may know!
Emily Glassberg Sands, Head of Data Science at Coursera delivers a Technical Vision Talk at WiDS Stanford University on March 2, 2020:
Coursera is the world’s largest platform for higher education, providing 50 million learners access to life-transforming skills and credentials. With the rich data generated as over 50 million learners engage on the platform, we have the unique opportunity to use data science and machine learning to unlock high-quality teaching and learning at scale. This talk will take you behind-the-scenes of some of our latest data products — from the personalized coaching that motivates and unblocks learners, to the algorithmic skill scores that track real-time progress against career goals, to the human-in-the-loop systems accelerating grading and student support. We’ll touch on the math, the product, the impact, and our own learnings along the way.
Talithia Williams, Associate Dean and Associate Professor of Mathematics at Harvey Mudd College delivers at WiDS Stanford University on March 2, 2020:
The leading edge breed of high-tech, wearable health technology is changing how we monitor personal data. We can quantify everything from heart rate and sleep patterns to body temperature and sex life. But, what is the average person to do with the massive amounts of data being collected? This talk makes a compelling case that all of us should be recording simple data about our bodies and will help you begin to analyze and understand your body’s data. Surprisingly, your own data can reveal much more than even your doctors may know!
Moderated by Martina Lauchengco, Operating Partner, Board Member, Costanoa Ventures
Panelists:
– Rukmini Iyer, Distinguished Engineer, Bing Advertising Marketplace & Serving, AI & Research, Microsoft
– Talithia Williams, Associate Dean, Associate Professor of Mathematics, Harvey Mudd
– Denice Ross, Senior Fellow, National Conference on Citizenship Fellow, Georgetown University
– Lillian Carrasquillo, Insights Manager, Spotify
Emily Glassberg Sands, Head of Data Science at Coursera delivers a Technical Vision Talk at WiDS Stanford University on March 2, 2020:
Coursera is the world’s largest platform for higher education, providing 50 million learners access to life-transforming skills and credentials. With the rich data generated as over 50 million learners engage on the platform, we have the unique opportunity to use data science and machine learning to unlock high-quality teaching and learning at scale. This talk will take you behind-the-scenes of some of our latest data products ‚Äî from the personalized coaching that motivates and unblocks learners, to the algorithmic skill scores that track real-time progress against career goals, to the human-in-the-loop systems accelerating grading and student support. We‚Äôll touch on the math, the product, the impact, and our own learnings along the way.
Talithia Williams, Host of NOVA Wonders PBS & Associate Professor of Mathematics, Harvey Mudd College | @Dr_TalithiaW sits down with Sonia Tagare for WiDS 2020 in Stanford, CA.
#WiDS2020 #WomenInTech #theCUBE
https://siliconangle.com/2020/03/05/i…
Harvey Mudd College professor highlights importance of personal health data, diversity in tech
There’s no doubt that the use of data is valuable for businesses. But it’s not just companies that can benefit from data insights.
Individuals can and should also collect their own body data and use it to have a better life, according to Talithia Williams (pictured), associate dean and associate professor of mathematics at Harvey Mudd College.
“We have so many devices that collect data automatically for us, and often we don’t pause long enough to actually look at that history,” she said. “It’s really challenging people to think about how they can use data that they collect about their bodies to help make better health decisions.”
Williams spoke with Sonia Tagare, host of theCUBE, SiliconANGLE Media’s mobile livestreaming studio, during the Women in Data Science conference in Stanford, California. They discussed the ways in which people can obtain their own data, the importance of including women of color in the technology industry and the privacy challenges related to using data for business purposes.
Understanding the information
Among the information that people can collect about themselves are, for example, blood pressure, blood sugar, and temperature. But just as important as collecting the data is to be active in interpreting it, according to Williams.
“It’s not like if you take this data, you will be healthier or you will live to 100,” she added. “It’s really a matter of challenging people to own the data that they have and get excited about understanding it.”
Data is also important in enabling individuals to set goals to change their lifestyle practices.
“When I take my heart rate data or my pulse, I’m really trying to see if I can get lower than how it was before,” Williams said. “So, the push is really how my exercise and my diet are changing so that I can bring my resting heart rate down.”
Diversity in STEM fields
With a doctorate in statistics, in addition to her role as a professor, Williams is host of a PBS program called “NOVA Wonders,” which “follows researchers as they tackle unanswered questions about life and the cosmos.” She also wrote the book “Power in Numbers: The Rebel Women of Mathematics,” which aims to inspire women of color to work in technology-related industries.
“I really wanted to highlight sort of where we have been, but also where we are going and the amazing women that are doing work on it,” she explained.
It’s the responsibility of those in STEM fields to find ways to advocate for women and especially for women of color, according to Williams.
“Often it takes someone who’s already at the table to invite other people to the table,” she said. “I think the onus is more on people who occupy those spaces already to think about how they can be more intentional in bringing diversity.”
Emily Glassberg Sands, Head of Data Science, Coursera sits down with Sonia Tagare for WiDS 2020 at Stanford, CA.
#WiDS2020 #WomenInTech #theCUBE
https://siliconangle.com/2020/03/13/q…
Q&A: Coursera uses student skill tracking data to help companies create a more diverse workforce
Distance learning has been around since the late 18th century, when students received assignments via mail, completed them, and sent them back for grading. Today, massive open online courses, known as MOOC’s, can have hundreds of thousands of students. Some of the most popular free lectures on YouTube, such as Stanford University’s lecture on Einstein’s theory of relativity, have millions of views.
MOOC’s started to gain popularity back in 2012, when Stanford professor’s Daphne Koller and Andrew Ng decided to make their lectures available online. Those courses became the foundation for Coursera Inc., which today has around 50 million students and is the world’s largest platform for higher education.
Emily Glassberg Sands (pictured), senior director of data science at Coursera, joined Sonia Tagare, host of theCUBE, SiliconANGLE Media’s mobile livestreaming studio, during the Women in Data Science conference in Stanford, California. They discussed the changes happening in MOOC structure, and how tracking student skill data can help companies hire a more diverse workforce.
[Editor’s note: The following content has been condensed for clarity.]
How has Coursera changed from when it started in 2012?
Glassberg Sands: It’s evolved a lot. We’ve moved from partnering exclusively with universities to recognizing that a lot of the most important education for folks in the labor market is being taught within companies. So, we’ve expanded to including education that’s provided not just by top institutions like Stanford, but also by top institutions that are companies like Amazon and Google.
The second big change is we’ve recognized that while for many learners and individual course or an MOOC is sufficient, some learners need access to a full degree — a diploma bearing credential. We now have 14 degrees live on the platform, including master’s degrees in computer science and data science.
The third major change is that we launched Coursera enterprise, which is about providing learning content through employers and through governments so we can reach a wider swath of individuals who might not be able to afford it themselves.
Could you explain how Coursera use data science to track individual user preferences and user behavior?
Glassberg Sands: We personalize throughout the learner journey. So, in discovery up-front when you first join the platform, we ask: What’s your career goal? What role are you in today? And then we help you find the right content to close the gap.
As you’re moving through courses, we predict whether or not you need some additional support. So, we identify for each individual what type of human touch might they need and we serve up to support staff recommendations for who they should reach out to, whether it’s a counselor reaching out to a degree student who hasn’t logged in for a while or a teaching assistant reaching out to a degree student who’s struggling with an assignment. Data really powers all of that, understanding someone’s goals, their backgrounds, the content that’s going to close the gap, as well as understanding where they need additional support and what type of help we can provide.
Tell us about Coursera’s latest data products.
Glassberg Sands: We’ve launched three data products over the last couple of years. The first is predicting when learners are going to need additional nudges and intervening in fully automated ways to get them back on track.
The second is about identifying learners who need human support and serving up really easily interpretable insights to support staff so they can reach out to the right learner with the right help.
Then the third is a little bit different. It’s about once learners are out in the labor market, how can they credibly signal what they know so that they can be rewarded for that learning on the job. And this is a product called skill scoring, where we’re actually measuring what skills each learner has up to what level so I can, for example, compare that to the skills required in my target career or show it to my employer so I can be rewarded for what I know.
That would be really helpful when people are creating resumes, by ranking the level of skills that they have.
…
Watch the complete video interview below, and be sure to check out more of SiliconANGLE’s and theCUBE’s coverage of the Women in Data Science conference.
Newsha Ajami, Director of Urban Water Policy, Stanford University sits down with Sonia Tagare for WiDS 2020 at Stanford, CA.
#WiDS2020 #WomenInTech #theCUBE
https://siliconangle.com/2020/03/16/c…
Creating resilient, sustainable water supplies means flipping the management paradigm
Humanity is dependent on water, but modern methods have major flaws. Perhaps today’s technology can help.
For millennium, settlements were centered on water supplies and drought signaled disaster. The Egyptians were the first to manage the critical resource; diverting the Nile flood waters using a series of dams and canals in a major irrigation project that turned a seasonal lake into a water reservoir. The concept was exploited to great success by the Romans; the empire became known for its sophisticated aqueduct system that transported water from rural areas to towns.
What was basically the same idea continued into the twentieth century, and dams and reservoirs are still being built to supply our ever-expanding cities. However, despite being the definitive method for water management for most of human civilization, the top-down model has a major flaw.
“People were not part of the loop. The way that they behaved, their decision-making process, what they use, how they use it, wasn’t necessarily part of the process,” said Newsha Ajami (pictured), director of urban water infrastructure and policy at Stanford University’s Water in the West program. “We assume there’s enough water out there to bring water to people and they can do whatever they want with it.”
As well as Water in the West, Ajami works with Stanford-based National Science Foundation Engineering Research Center’s Re-Inventing the Nation’s Urban Water Infrastructure (ReNUWIt), and is in her second term serving on the San Francisco Bay Regional Water Quality Control Board.
Ajami spoke with Sonia Tagare, host of theCUBE, SiliconANGLE Media’s mobile livestreaming studio, during the Women in Data Science conference in Stanford, California. They discussed how Ajami is working to bridge the gap between science and policy in water management, building solutions for water resilient cities, and changing the traditional top-down water management model to a more collaborative bottom-up approach.
This week theCUBE spotlights Newsha Ajami in its Women in Tech feature
Girls and boys are equal in STEM
Ajami was born in Tehran, Iran, to a family of engineers. She was encouraged in her love of math and problem solving and recalls spending hours building Legos and playing mathematical games. She credits her mother with being her biggest fan and mentor and is quoted as saying she was “raised gender blind” and taught “to be fearless, open-minded, and resilient.”
Thanks to her family’s recognition and support of her math and science abilities, Ajami attended Amirkabir University of Technology, one of Iran’s top universities. She graduated with a bachelor of science in civil and environmental engineering.
“We are all equal. Our brains are all made the same way. It doesn’t matter what’s on the surface,” is Ajami’s message to those who want to study for a career in math, science, technology or engineering. “I encourage all girls to study hard and not get discouraged. Fail as many times as you can, because failing is an opportunity to become more resilient and learn how to grow,” she said.
After living in Tehran during the Iran-Iraq war, Ajami personally understands how water shortages can affect daily life for a city’s inhabitants. “Demand management and public awareness was a centerpiece in dealing with scarcity,” she said, recalling how the experience inspired her to focus on sustainable resource management.
Ajami moved to the United States to attend graduate school at George Washington University, but soon switched to the University of Arizona’s hydrology and water resources program to study under Soroosh Sorooshian, founding director of the university’s National Science Foundation center on sustainability of semi-arid hydrology and riparian areas. “It was one of the best decisions I made,” she said
Her time at the University of Arizona inspired a love for public policy and applied interdisciplinary research, and after obtaining her master’s degree, Ajami followed Sorooshian to the University of California at Irvine to pursue a doctorate in civil and environmental engineering. She continued her education with post-doc research at the University of California at Berkeley, in “the impacts of hydrological uncertainty on efficient and sustainable water resources management and planning.”
…
Here’s the complete video interview, part of SiliconANGLE’s and theCUBE’s coverage of the Women in Data Science conference:
Talithia Williams, Host of NOVA Wonders PBS & Associate Professor of Mathematics, Harvey Mudd College | @Dr_TalithiaW sits down with Sonia Tagare for WiDS 2020 in Stanford, CA.
#WiDS2020 #WomenInTech #theCUBE
https://siliconangle.com/2020/03/05/i…
Harvey Mudd College professor highlights importance of personal health data, diversity in tech
There’s no doubt that the use of data is valuable for businesses. But it’s not just companies that can benefit from data insights.
Individuals can and should also collect their own body data and use it to have a better life, according to Talithia Williams (pictured), associate dean and associate professor of mathematics at Harvey Mudd College.
“We have so many devices that collect data automatically for us, and often we don’t pause long enough to actually look at that history,” she said. “It’s really challenging people to think about how they can use data that they collect about their bodies to help make better health decisions.”
Williams spoke with Sonia Tagare, host of theCUBE, SiliconANGLE Media’s mobile livestreaming studio, during the Women in Data Science conference in Stanford, California. They discussed the ways in which people can obtain their own data, the importance of including women of color in the technology industry and the privacy challenges related to using data for business purposes.
Understanding the information
Among the information that people can collect about themselves are, for example, blood pressure, blood sugar, and temperature. But just as important as collecting the data is to be active in interpreting it, according to Williams.
“It’s not like if you take this data, you will be healthier or you will live to 100,” she added. “It’s really a matter of challenging people to own the data that they have and get excited about understanding it.”
Data is also important in enabling individuals to set goals to change their lifestyle practices.
“When I take my heart rate data or my pulse, I’m really trying to see if I can get lower than how it was before,” Williams said. “So, the push is really how my exercise and my diet are changing so that I can bring my resting heart rate down.”
Diversity in STEM fields
With a doctorate in statistics, in addition to her role as a professor, Williams is host of a PBS program called “NOVA Wonders,” which “follows researchers as they tackle unanswered questions about life and the cosmos.” She also wrote the book “Power in Numbers: The Rebel Women of Mathematics,” which aims to inspire women of color to work in technology-related industries.
“I really wanted to highlight sort of where we have been, but also where we are going and the amazing women that are doing work on it,” she explained.
It’s the responsibility of those in STEM fields to find ways to advocate for women and especially for women of color, according to Williams.
“Often it takes someone who’s already at the table to invite other people to the table,” she said. “I think the onus is more on people who occupy those spaces already to think about how they can be more intentional in bringing diversity.”
Daphne Koller, CEO and Founder, insitro sits down with Sonia Tagare at Stanford University for WiDS 2020.
#WiDS2020 #WomenInTech #theCUBE
https://siliconangle.com/2020/03/12/a…
AI works to slash drug development costs as technology and biology join forces to defeat Eroom’s Law
The convergence of previously discrete fields is a hallmark of the digital era. Remember the divide between development and operations teams? That gap vanished into the cloud, as DevOps became the new way of working.
Now technology is becoming incorporated into other disciplines. In the 1990s, quantitative biology took a leap from a descriptive science to gene sequencing, thanks to technology such as microarrays. At the same time, big data was revolutionizing information technology.
“What I think is coming now, 30 years later, is the convergence of those two fields into one field that I like to think of as digital biology,” said Daphne Koller (pictured), founder and chief executive officer of Insitro Inc.
Koller spoke with Sonia Tagare, host of theCUBE, SiliconANGLE Media’s mobile livestreaming studio, during the Women in Data Science conference in Stanford, California. They discussed how applying machine-learning techniques to traditionally biological research fields — such as drug research — could bring down the costs of medicine.
Applying ML models to drug development
Measuring biology has taken on new levels of detail, fidelity and scale thanks to new technology, according to Koller. Artificial intelligence and machine learning allow scientists to interpret what they are seeing and engineer new solutions that “will have implications in biomaterials, in energy, in the environment, in agriculture, and I think also in human health,” Koller said.
One of the biggest problems in the health field is the negative trend in number of drugs approved versus dollars spent on research. This is known as Eroom’s Law, because it is the opposite of Moore’s Law.
“Despite many important advancements, the costs just keep going up and up and up,” Koller said.
Approach problems with diversity
Machine learning could hold the key to breaking this trend, but it requires a cross-discipline approach, according to Koller. “One needs to really build a culture of people who work together from different disciplines, each bringing their own insights and their own ideas into the mix,” she said.
The team she has created at Insitro is half life scientists and half machine learning and data science experts.
“They start from the very beginning to understand what are the problems that one could solve together: How do you design the experiment? How do you build the model? And how do you drive insights that can help us make better medicines for people?” she said.
Using a data-driven approach, collecting and analyzing huge amounts of data will reveal new hypotheses, according to Koller.
“Hopefully, we’ll be able to create enough data and apply machine learning to address key bottlenecks in the drug discovery and development process,” she said. “[Then] we can bring better drugs to people, and we can do it faster and, hopefully, at much lower cost.”
Watch the complete video interview below, and be sure to check out more of SiliconANGLE’s and theCUBE’s coverage of the Women in Data Science conference.
Cynthia Dwork, Gordon McKay Professor of Computer Science at Harvard University.
Differential privacy is a mathematically rigorous definition of privacy tailored to statistical analysis of large datasets. Differentially private systems simultaneously provide useful statistics to the well-intentioned data analyst and strong protection against arbitrarily powerful adversarial system users — without needing to distinguish between the two. Differentially private systems “don’t care” what the adversary knows, now or in the future. Finally, differentially private systems can rigorously bound and control the cumulative privacy loss that accrues over many interactions with the confidential data. These unique properties, together with the abundance of auxiliary data sources and the ease with which they can be deployed by a privacy adversary, led the US Census Bureau to adopt differential privacy as the disclosure avoidance methodology of the 2020 decennial census.
This talk will motivate the definition of differential privacy, describe some of the techniques to be used in the 2020 census, and explain some of the utility gains over previous methods and the sense in which differential privacy provides a best-possible solution. Finally, the talk will highlight a few of the many remaining challenges.
Hilary Parker, Data Scientist, Stitchfix
Data is the lifeblood of every organization. Whatever our job title, each of us uses data to get our job done — from observing a running system to improving performance to building a machine-learned model. This talk is about approaches and techniques to collect the most useful data we can, analyze it in a scientific way, and use it most effectively to drive actions and decisions.
Is using data effectively an art or a science? It is both. Theart‚Äù helps us decide theright‚Äù way to approach an analysis or an algorithm. Thescience‚Äù applies statistical rigor to our inferences. But with only the art and the science, we miss something critically important. In this talk, I suggest that, beyond both art and science, the fundamental questions we need to ask of our data should be informed by the field of design and design thinking. Every designer needs to understand their intended user, whether they are designing a physical object, experimenting on a recommendation system, or making a launch decision about a product. That focus on the why — instead of just the how and the what — takes us to the next level.
This talk with leave you with actionable insights about how to apply the lens of design thinking to help you use data more effectively in your everyday job.
Marzyeh Ghassemi, Assistant Professor, University of Toronto
Professor Marzyeh Ghassemi tackles part of this puzzle with machine learning. This talk will cover some of the novel technical opportunities for machine learning in health challenges, and the important progress to be made with careful application to domain.
Anima Anandkumar, Professor of Computing and Mathematical Sciences at CalTech and Director of Research in Machine Learning, NVIDIA.
Standard deep-learning algorithms are based on a function-fitting approach that do not exploit any domain knowledge or constraints. This makes them unsuitable in applications that have limited data or require safety or stability guarantees, such as robotics. By infusing structure and physics into deep-learning algorithms, we can overcome these limitations. There are several ways to do this. For instance, we use tensorized neural networks to encode multidimensional data and higher-order correlations. We infuse symbolic expressions into deep learning to obtain strong generalization. We utilize spectral normalization of neural networks to guarantee stability and apply it to stable landing of quadrotor drones. These instances demonstrate that building structure into ML algorithms can lead to significant gains.
Emma Brunskill, Assistant Professor, Computer Science, Stanford University
There is increasing excitement about reinforcement learning– a subarea of machine learning for enabling an agent to learn to make good decisions. Yet numerous questions and challenges remain for reinforcement learning to help support progress in important high stakes domains like education, consumer marketing and healthcare. I’ll discuss some recent advances in these areas, and our work towards creating transparent, accountable reinforcement learning approaches that can interact beneficially with people.
Yinglian Xie is the CEO and co-founder of DataVisor, the leading AI and big data analytics company that protects consumer-facing enterprises from a variety of fraud, abuse, and money laundering activities.
She shares her insights into the growing problem of highly sophisticated fraud, and how DataVisor‚Äôs innovative technology helps companies fight back. Yinglian founded the AI-based fraud-detection company with an ambitious vision: to stop fraudsters in their tracks and to restore online trust with the help of big data and machine learning. She is set to present DataVisor’s quarterly Fraud Index Report, focusing on how cyber-criminals‚Äô techniques are ever-evolving ‚Äî from basic attacks to the most sophisticated and highly-organized attacks. She will also speak on fraud detection and share which methods work best at each stage, as well as share her vision of how fraud detection will unfold in the coming decade. Yinglian, who is passionate about helping entrepreneurs succeed, will also share her business advice and success tips to help others launch, operate, and grow their own businesses.
Alicia Carriquiry Distinguished Professor and President’s Chair in Statistics | Director of CSAFE, Iowa State University
In the US criminal justice system, jurors choose between two competing hypothesis: the suspect is the source of the evidence found at the crime scene or s/he is not. The likelihood ratio framework, which relies on Bayes’ theorem for assessing the probative value of evidence, is difficult to implement in practice, when evidence is in the form of an image. Machine learning provides a good alternative for determining whether the evidence supports the proposition that the suspect may have been its source. We illustrate these ideas using information about the surface topography of bullet lands.
Srujana Kaddevarmuth, Senior Manager, Data Science , Accenture & Women in Data Science Ambassador, Bengaluru | @Srujanadev sits down with Lisa Martin at Stanford University for WiDS 2019.
#WiDS2019 #Accenture #theCUBE
https://siliconangle.com/2019/03/11/w…
WiDS Datathon mixes up data science with collaborative teams
If only a data set and some pre-packaged data-analytics software were all it takes to solve real-world problems. The reality is that tools require hands to ply them. And just like a comprehensive data set is better than a limited one, a comprehensive set of skills helps people design better solutions.
Looking at the problem from different perspectives and collaboration are the keys to be able to be successful in data science,” said Srujana Kaddevarmuth (pictured), data science and analytics executive at Accenture LLP and ambassador for the Women in Machine Learning & Data Science team in Bengaluru (formerly Bangalore).
Take a problem like deforestation from palm-oil plantations. Consider all the factors that might be involved: agriculture, climate, ecology, economics, politics, etc. What are the odds that one random data expert can ask all the right questions, pull together all the necessary data, and derive actionable insight? Probably not great.
This is the thinking behind collaborative data-science projects, like the Women in Data Science, or WiDS, Datathon. This year, it organized several teams to collaborate and use data and satellite imagery to analyze this particular problem.
Kaddevarmuth spoke with Lisa Martin (@LisaMartinTV), host of theCUBE, SiliconANGLE Media’s mobile livestreaming studio, during the Stanford Women in Data Science event in Stanford, California. They discussed this year’s Datathon and why collaboration results in better outcomes for data scientists.
From clueless to Kaggle code in three weeks
At the WiDS Bengaluru regional event, organizers set up a community workshop. The goal was to form teams to participate in the Datathon. They would submit the fruits of their endeavors to something called Kaggle, a platform for data-science projects and competitions. In India, Kaggle participation is very male heavydespite that region having amazing female data scientists who are innovators in their space with multiple patents, publications and innovations to their credit,” Kaddevarmuth said.
WiDS teamed mentors with participating teams to work together for three weeks. One team from the engineering division who was brand new to Kaggle learned new concepts, honed skills in deep learning and neural networks, and submitted original code to the Kaggle leaderboard.
They were not the top-scoring team, but this entire experience of being able to collaborate, look at the problem from different perspectives, and be able to submit the code despite a lot of these challenges — and also navigate the platform in itself — was a decent achievement from my perspective,” Kaddevarmuth concluded.
Watch the complete video interview below, and be sure to check out more of SiliconANGLE’s and theCUBE’s coverage of the Stanford Women in Data Science event.
License
Creative Commons Attribution license (reuse allowed)
Show les
Madeleine Udell, Assistant Professor, Cornell University, @madeleineudell sits down with Lisa Martin at Stanford University for WiDS 2019.
#WiDS2019 #CornellUniversity #theCUBE
https://siliconangle.com/2019/03/08/t…
This professor is cleaning up tech’s ‘messy data’ problem
Strong data sets are table stakes for any organization today. Data insights can provide the tentpoles for building a strategic roadmap and offer unexpected learnings for businesses to leverage as new market opportunities. But even the most valuable data set can prove worthless if its insights are entangled in the unstructured digital void.
An estimated 80 percent of all data is unstructured, which renders the intel buried in its complex documents and media files inaccessible without an alternative method of analysis. As information floods the tech industry faster than new talent is prepared to make sense of it, the unstructured data challenge is posing a formidable hurdle for businesses in the digital age.
Madeleine Udell (pictured), assistant professor of operations research and information engineering at Cornell University, is educating a new era of technologists to decode this so-calledmessy data” with a more effective approach to tech collaboration.
Oftentimes people only learn about big, messy data when they go to industry,” Udell said.I’m interested in understanding low dimensional structure in large, messy data sets [to] figure out ways of … making them seem cleaner, smaller and easier to work with.”
Udell spoke with Lisa Martin, host of theCUBE, SiliconANGLE Media’s mobile livestreaming studio, during the recent Stanford Women in Data Science event at Stanford University.
This week, theCUBE spotlights Madeleine Udell in its Women in Tech feature.
The unstructured data challenge
The rise of messy data can be attributed in large part to the influx of information from a growing number of digital endpoints. Internet of things devices deliver a stream ofmessy” data, but the clutter can also come from images, videos, social media, emails, and other data sets not already formatted for simple analysis.
Though more complex and tedious to decipher, these data sources are some of the most highly valued in a market focused on individual user targeting. That gap between ability and potential innovation is what drives Udell’s interest in unstructured data, an area of technology the assistant professor says people entering the tech industry are not adequately prepared for. In her own classes, Udell teaches optimization for machine learning from a messy data perspective.
[The class] introduces undergraduates to what messy data sets look like, which they often don’t see in their undergraduate curriculum, and ways to wrangle them into forms they could use with other tools they have learned as undergraduates,” she said.
Udell’s interest in messy data was piqued when she met the challenge head on working in the Obama 2012 presidential campaign. She was tasked with analyzing voter information but found the unstructured data sets too cumbersome to yield valuable insight.
They had hundreds of millions of rows, one for every voter in the United States, and tens of thousands of columns about things that we knew about those voters,” Udell said.Gender … education level, approximate income, whether or not they had voted in the last elections, and much of the data was missing. How do you even visualize this kind of data set?”
When Udell returned to work on her Ph.D., she was intent on discovering a more efficient method for parsing out value from unstructured data sets.I wanted to figure out the right way of approaching this, because a lot of people will just sort of hack it,” she saidI wanted to understand what’s really going on.”
Making an impact with communication
Udell is as interested in the technical architectures that enable data analysis as she is in supporting organizations through the implementation processes that will allow them to benefit from her work. A comprehensive answer to data management requires both math and communication, and Udell says her broad skill set is part of what has enabled her to make sense of messy data.
If you want your technical work to have an impact, you need to be able to communicate it to other people,” Udell stated.
The social aspect of her role is crucial to finding solutions that actually address user problems and work within existing processes.You need to make … sure you’re working on the right problems, which means talking with people to figure out what the right problems are,” she said.This is … fundamental to my career, talking to people about problems they’re facing that they don’t know how to solve.”
…
Here’s the complete video interview, part of SiliconANGLE’s and theCUBE’s coverage of the Stanford Women in Data Science event:
Interview with Ziya Ma, Vice President of Software and Services Group and Director of Big Data Technologies, Intel Corporation
Interview with Daniela Witten, Associate Professor of Statistics and Biostatistics, University of Washington
Interview with Bhavani Thuraisingham, Professor of Computer Science, University of Texas at Dallas
Leda Braga, Chief Executive Officer at Systematica Investments, delivers a Keynote presentation at the WiDS 2018 Conference held at Stanford University.
Objective analysis of relevant data can improve the execution of most businesses. From the simple client feedback form through to production statistics, listening to the data helps. In the investment management industry, by contrast, data analysis IS the business. Investment management is information management and data science is not an aid to decision making, but rather the essence of it.
This talk will explore the reality of investment management, how recent developments in data and AI are shaping the fund management industry and the challenges of dealing with financial data. Also in the context of the WiDS forum and its clear focus on diversity, trends such as ethical investing (or Socially responsible Investing SRI) will also be discussed.
Jia Li, Head of R&D Cloud AI at Google, presents From Insights to Solutions: An Invitation to the AI Journey at the WiDS 2018 Conference held at Stanford University on March 5, 2018.
AI is a journey that begins with a problem to be solved, confronts the challenges of collecting data and innovating algorithms, and ultimately delivers a solution. As a researcher and Google’s Head of R&D for Cloud AI, Jia Li has experienced this journey at each step. Now, she’s working to open it up to the widest possible audience, helping new industries harness the power of AI to face the challenges that matter most to them.
Bhavani Thuraisingham, Professor of Computer Science at University of Texas at Dallas, presents Integrating Data Science and Cyber Security at the WiDS 2018 Conference held at Stanford University on March 5, 2018.
The collection, storage, manipulation, analysis and retention of massive amounts of data have resulted in serious security and privacy considerations. Various regulations are being proposed to handle big data so that the privacy of the individuals is not violated. For example, even if personally identifiable information is removed from the data, when data is combined with other data, an individual can be identified. While collecting massive amounts of data causes security and privacy concerns, big data analytics applications in cyber security is exploding. For example, an organization can outsource activities such as identity management, intrusion detection and malware analysis to the cloud. The question is, how can the developments in data science techniques be used to solve security problems? Furthermore, how can we ensure that such techniques are secure and adapt to adversarial attacks? This presentation will first describe our research in data science including in stream data analytics and novel class detection and discuss its applications to insider threat detection. Second, it will discuss the emerging research area of adversarial machine learning. Finally, it will discuss why women should pursue careers in data science.
Dawn Woodard, Senior Data Science Manager of Maps at Uber presents Dynamic Pricing and Matching in Ride-Sharing at the WiDS 2018 Conference held at Stanford University on March 5, 2018.
Ride-sharing platforms like Uber, Lyft, Didi Chuxing, and Ola are transforming urban mobility by connecting riders with drivers via the sharing economy. These platforms have achieved explosive growth, in part by dramatically improving the efficiency of matching, and by calibrating the balance of supply and demand through dynamic pricing. The dynamic adjustment of prices ensures a reliable service for riders, and incentivizes drivers to provide rides at peak times and locations. Dynamic pricing is particularly important for ride-sharing, because pricing too low causes pickup ETAs to get very long, which reduces the efficiency of the platform and causes a poor experience for riders and drivers. We review the literature on matching and pricing techniques in ride-sharing. We also discuss how to estimate several key inputs to those algorithms: predictions of demand, supply, and travel time in the road network.
Risa Wechsler, Associate Professor of Physics at Stanford University and SLAC National Accelerator Laboratory presents A Universe of Data Challenges at the WiDS 2018 Conference held at Stanford University on March 5, 2018.
Latanya Sweeney, Professor of Government and Technology in Residence at Harvard University, delivers a Keynote presentation, Data Science to Save the World, at the WiDS 2018 Conference held at Stanford University on March 5, 2018.
Technology designers are the new policy makers. No one elected them and most people do not know their names, but the arbitrary decisions they make when producing the latest gadgets and online innovations dictate the code by which we conduct our daily lives and govern our countries. As technology progresses, every societal value and every state rule comes up for grabs and will likely be redefined by what technology enables or not. Data science allows us to do experiments to show how it all fits together or falls apart. Come to this talk and see how data science can help save the world.
Daniela Witten, Associate Professor of Statistics and Biostatistics at University of Washington, presents More Data, More (Statistical) Problems at the WiDS 2018 Conference held at Stanford University on March 5, 2018.
By now, virtually every field has become inundated with big data. We have been promised that this data will usher in a new era of previously unimaginable societal and scientific progress. While it is certainly true that more data brings with it incredible opportunities, it is also true that more data can bring new and previously unimaginable statistical challenges. I will talk about some of those statistical challenges, as well as statistical ways to solve them. Examples will be taken from biomedical research.
Mala Anand, EVP, President of Leonardo, Data & Analytics at SAP, presents Healthcare Beyond the Horizon — Going Digital to Improve People’s Lives at the WiDS 2018 Conference held at Stanford University on March 5, 2018.
Never before have there been so many promising breakthrough technologies available – and with it, opportunities to dramatically change the way we live every day. Nowhere is this more evident than in Healthcare, where we see technologies like Analytics, IoT, Machine Learning, Big Data and Blockchain playing significant roles in transforming people’s lives all over the planet. Mala Anand, President of SAP Leonardo, Data & Analytics, will provide some insight on how the healthcare industry is going digital and how far we can possibly go to improve patient outcomes.
Maria Klawe, President at Harvey Mudd College welcomes the WiDS 2018 Conference attendees held at Stanford University on March 5, 2018.
Nathalie Henry Riche, Researcher at Microsoft Research, presents Data-Driven Storytelling at the WiDS 2018 Conference held at Stanford University on March 5, 2018.
Data visualization is a powerful medium to makes sense of large amounts of data and communicate insights gained from analyses to a general audience. Research in the field of information visualization aims at designing interactive visual interfaces to augment human cognition for exploring and communicating with data.
In this talk, I will present our latest research efforts in the field of information visualization and data-driven storytelling. Stories supported by facts extracted from data analysis (data-driven storytelling) proliferate in many different forms from static infographics shared on social media to dynamic and interactive applications available on leading news media outlets. I will present research shedding light on what makes visual stories compelling and share insights on how to empower people to build these experiences without programming.
Hear about why Margot founded the Global Women in Data Science (WiDS) Conference, at Stanford. The WiDS conference is now at Stanford and 150+ locations worldwide, and is available on livestream and Facebook Live.
The upstream oil & gas industry (i.e. the exploration for and production of hydrocarbons) needs to reap the benefits of new technology to improve efficiency. Making more effective use of increasing amounts of collected data is on the verge of transforming the business.
Transformation through data analytics is equally relevant on both the operational and financial sides of the business.
On the upstream operational side: for decades now, we have been inventing new and increasingly sophisticated tools (both hardware and software) to generate new data types that extend the boundaries of geoscience knowledge, and allow us to understand our hydrocarbon reservoirs in ever increasing detail. Historically, we have processed only a fraction of the data collected, but that is changing. Now, among the most important criteria governing the efficiency of oil and gas companies are the hugely increased volume of data collected but also the variety, velocity and veracity of information that can be extracted from that data. That’s data science! Data analytics as a discipline is now increasingly integrated within our upstream workflows in drilling, reservoir characterization and the actual production (extraction) of hydrocarbons in the most economically efficient ways possible. To this end, one goal is the development of an analytics platform that will perform a key role in increasing productivity through the simultaneous optimization of drilling planning and execution, the improvement of asset utilization and the overall reduction of non-productive time.
On the financial side: the oil and gas industry has a long history of being secretive and, as a result, judgment of the quality and accuracy of non-technical data has proved very difficult. In general, insufficient attention has been paid to addressing these challenges leading to unnecessary volatility in price movements through inadequate or conflicting data, and this volatility impacts decision-making within companies. In the information age, where markets react instantaneously to a multitude of data sources, it is time to understand better this key driver of our industry. Decision-enabling information is extremely critical to the efficient functioning of an industry that is driven by the signals coming from commercial markets. Understanding the quality and accuracy of that information through data science is a key enabler in filling a major gap currently preventing more effective management of oil and gas company assets.
Digital transformation, implying the transition from desktop to the cloud and mobile devices, easy access to information, new scalable online services and automated industrial workflows, is about to radically change the way we work in any industry (oil and gas, defense, transport, automotive, medicine, telecom, logistics, etc). This is no longer a trend, but a reality clearly demonstrated by the world’s most valuable companies adopting expanded and enhanced data analytics in response to common drivers of operational efficiency, operational safety and accuracy of real-time decision-making. That’s the promise of Big Data, to really understand the systems that make our technological industry. As you begin to understand the interactions of all the constituent components then you can build systems that are better and more effective at addressing the key industry drivers, irrespective of the industry. New technology is increasingly playing a huge new role. Data is the new oil!
Dr. Gottlib-Zeh describes how data science is transforming the oil and gas industry for better planning and efficiency, for both drilling and production.
Dr. Holmes shares a survey of the current challenges in the analyses of heterogeneous biological data. Combining networks, contingency tables and data from multiple omics domains provides the analysts with multiple choices. The result can be an erroneous p-value or a complicated workflow, both can be irreproducible. I will survey some of the recent approaches to this challenge.
Dr. Susan Holmes, Professor of Statistics, describes processes for analyzing large messy microbiome data sets, and the importance of reproducibility.
Healthcare is an area where data science and artificial intelligence have tremendous potential to improve lives and where significant methodological advances are needed to achieve that promise. In this talk, I will highlight clinical needs in which data science can help — accurate diagnosis, long-term disease management, and personalized treatment — and also the hard, interesting methodological challenges — particularly in robust inference and interpretability — that will be part of the solution. I will do so by sharing examples of work from our group, which focuses on learning timeseries and sequential decision-making models for health applications ranging from better understanding autism spectrum disorder to managing patients with HIV or in the ICU.
Dr. Finale Doshi-Velez from Harvard University describes how machine learning is optimizing treatment for HIV patients, and beyond.
Predictive modeling and its variants are at the core of an increasing number of technical advances that touch us in every aspect of our life. Today, nobody doubts the ability of machines to learn from historical data and predict with far higher accuracy than any human. But real world applications of machine learning are often a far cry from the well understood academic assurances of how these algorithms should behave. In this talk I will share some practical lessons when models had a surprising secret life and did something very different from what I thought I had asked them to do. As the creators of machine learning solutions it is our responsibility to pay attention to the often subtle symptoms and to let our human intuition be the gate keeper deciding when our models are ready to be released ‘into the wild’.
Claudia Perlich, Chief Scientist at Dstillery, talks about how data scientists need to use a combination of data science and intuition to deliver accurate insights from data sets.
The Human Rights Data Analysis Group (HRDAG) uses methods from statistics and computer science to quantify mass violence. As part of that work, we rely on open source tools, including python and R, for data processing, management, analysis, and visualization. This talk will highlight how we use those methods and tools to estimate how many people have been killed in the ongoing conflict in Syria.
Megan Price, Human Rights Data Analysis Group
Lori Sherer, Partner, Bain & Co. and Caitlin Smallwood, VP Science and Algorithms, Netflix
Susan Athey, the Economics of Technology Professor at Stanford Graduate School of Business, has always been interested in the intersection of economics and computer science. As an undergraduate she was a math, computer science and economics triple major. She explains that combining economics, social science, engineering and machine-learning tools allows you to answer questions in a way that wasn’t possible before.
Panel Discussion: Career Paths in Data Science
Moderator: Kara Swisher, Re/code
Panelists:
Jennifer Tour Chayes, Microsoft Research
Aleksandra Korolova, USC
Shubha Nabar, Salesforce
Bin Yu, UC Berkeley
Driven Marketplace Design: Experiments, Machine Learning, and Econometrics | Susan Athey | WiDS 2015
Susan Athey, Stanford Graduate School of Business
Celeste is a data science expert with more than 28 years of service to the Lawrence Livermore National Laboratory’s (LLNL) Computation Directorate.
Google it: The evolution of search | #WiDSConference
by Marlene Den Bleyker | Nov 13, 2015
https://siliconangle.com/2015/11/13/g…
Google, the trailblazer of all search engines, has made a powerful impact over the years in the way users and businesses interact online.
Carrie Grimes, distinguished engineer at Google, has been with the company since its early days, and she caught up with Jeff Frick, host of theCUBE, from the SiliconANGLE Media team, at the Women in Data Science Conference held at Stanford University to explain Google’s processes and the evolution of search.
The new questions
In the past, to extract meaningful data meant using algorithms; however, Grimes stated, “Now we actually have to ask questions, such as how do you make a business decision based on this data.” She believes that blending the computational ability of computer science with the confidence gained from statistics is critical for companies such as Netflix, Amazon or Google to make business-based decisions.
Grimes describes how traditionally the goal is to look at customer and user data and derive value, but now it is also essential to focus on the backend, where data science is really important to help make business decisions. According to Grimes, some of the considerations are: “How do I use data science to pick the optimal point? How much computational power do I put into indexing a tweet versus the value of going out and getting a whole new set of content that is more static?”
Grimes pointed out that you can’t force the algorithms Google uses to index, to crawl and to find the right meaning of content and keywords. She referenced the amount of content people are creating and having to decide whether to invest in compute power or invest in a feature, a cool new rendering or the ability to understand structured data. ‚ÄúThat’s the kind of tradeoff we have internally,‚Äù she said.
New data issues to tackle
Grimes has worked for Google since 2003 and has seen the steady need to progress and adapt. She commented on how industry outsiders don’t understand the pressure from users. As search becomes more intelligent, customer expectations become higher.
When she began her career at Google, she recalled that a great deal of effort went into content that was static, managing scale and moving data around. Now, personalized recommendations and understanding the nuances about knowing who is searching and what the searches mean are a much larger focus for data scientists today than it was 10 years ago.
Whether it’s personalizing recommendations of movies and TV shows or optimizing the streaming of video bits to peoples’ households, Netflix relies heavily on data science techniques. We believe in continuous learning through predictive modeling & algorithms, experimentation, and principled metric design. This talk will highlight Netflix’s core data science strategies and uses, with particular focus on our successes and challenges in experimenting with personalization algorithms.
Everywhere we turn these days, we find that networks can be used to describe relevant interactions. In the high tech world, we see the Internet, the World Wide Web, mobile phone networks, and a variety of online social networks. In economics, we are increasingly experiencing both the positive and negative effects of a global networked economy. In epidemiology, we find disease spreading over our ever-growing social networks, complicated by mutation of the disease agents. In biomedical research, we are beginning to understand the structure of gene regulatory networks, with the prospect of using this understanding to manage many human diseases. In this talk, I look quite generally at some of the models we are using to describe these networks, processes we are studying on the networks, algorithms we have devised for the networks, and finally, methods we are developing to indirectly infer network structure from measured data. I’ll discuss in some detail particular applications to cancer genomics, applying network algorithms to suggest possible drug targets for certain kinds of cancer.
One of the challenges in big data analytics lies in being able to reason collectively about extremely large, heterogeneous, incomplete, and noisy interlinked data. We need data science techniques that can represent and reason effectively with this form of rich and multi-relational graph data. In this talk, I will describe some common inference patterns needed for graph data including: collective classification (predicting missing labels for nodes), link prediction (predicting potential edges), and entity resolution (determining when two nodes refer to the same underlying entity). I will describe some key capabilities required to solve these problems, and finally I will describe a highly scalable open-source probabilistic programming language being developed within my group to solve these challenges.