Data Science Storytelling: Quantitative UX Research in Google Cloud with Randy Au (Part 2 of 2)

In this second part of my interview with Randy Au, he discusses the techniques he used to teach himself to code and his approach to programming and data science as a social scientist.

Here is Part 1 of our interview.

Prior to joining Google, he spent a decade as a mixture of a data analyst, data scientist, and data engineer at various startups in New York City and before that, studied Communications. In his newsletter, he discusses data science topics like data collection and data quality from a social science perspective. Outside of work he often engages in far too many hobbies, taken to absurd lengths.

Click here to learn more about the Interview Series this is a part of.

More about Randy:

Data Science Storytelling: Quantitative UX Research in Google Cloud with Randy Au (Part 1 of 2)

Randy Au, a Quantitative UX Researcher at Google, explains how he leverages his backgrounds in communication, statistics, and programming as a quantitative UX researcher in Google Cloud to analyze and improve Cloud Storage products.

Here is Part 2 of our interview.

Prior to joining Google, he spent a decade as a mixture of a data analyst, data scientist, and data engineer at various startups in New York City and before that, studied Communications. In his newsletter, he discusses data science topics like data collection and data quality from a social science perspective. Outside of work he often engages in far too many hobbies, taken to an absurd lengths.

Click here to learn more about the Interview Series.

More about Randy:

User-Centric Thinking in Data Science: Conversation with Anna Wu at Google Cloud (Part 3 of 3)

I interviewed Anna Wu, a UX researcher and data scientist overseeing Google Cloud’s Compute Engine. In this final part of the conversation, we discuss how design thinking may useful within data science and machine learning.

Here is the first interview if you would like to start from scratch, and here is more information about Interview Series that this is a part of.

Here is Part 1 and Part 2 of our interview.

Anna Wu, established leader in building and leading high-performing data teams to drive changes impacting hundreds of millions of users. Currently as a research manager at Google, she leads a team of quantitative UX researchers applying UX methods and large scale analytics to inform Cloud product development. 

Before this recent chapter, Anna had 10+ years practicing UX and data science at top IT companies and research labs as a UX researcher, data scientist, research scientist at Microsoft, IBM Research and Palo Alto Research Center. She got her PhD in HCI from Penn State and master/bachelor degrees from Tsinghua University.

Resources:

User-Centric Thinking in Data Science: Conversation with Anna Wu at Google Cloud (Part 1 of 3)

I interviewed Anna Wu, a UX researcher and data scientist overseeing Google Cloud’s Compute Engine, as the next installment of my Interview Series,. In this first part of our conversatoin, she discusses her journey from mechanical engineering into UX research and data science and the importance of effective storytelling within these two fields.

Here is Part 2 and Part 3 of our interview.

Anna Wu, established leader in building and leading high-performing data teams to drive changes impacting hundreds of millions of users. Currently as a research manager at Google, she leads a team of quantitative UX researchers applying UX methods and large scale analytics to inform Cloud product development. 

Before this recent chapter, Anna had 10+ years practicing UX and data science at top IT companies and research labs as a UX researcher, data scientist, research scientist at Microsoft, IBM Research and Palo Alto Research Center. She got her PhD in HCI from Penn State and master/bachelor degrees from Tsinghua University.

Resources mentioned:

User-Centric Thinking in Data Science: Conversation with Anna Wu at Google Cloud (Part 2 of 3)

In this second part of my interview with Anna Wu, she describes the interconnections between data science and qualitative UX research.

Here is the first interview if you would like to start from scratch, and here is more information about Interview Series that this is a part of.

Here is Part 1 and Part 3 of our interview.

Anna Wu, established leader in building and leading high-performing data teams to drive changes impacting hundreds of millions of users. Currently as a research manager at Google, she leads a team of quantitative UX researchers applying UX methods and large scale analytics to inform Cloud product development. 

Before this recent chapter, Anna had 10+ years practicing UX and data science at top IT companies and research labs as a UX researcher, data scientist, research scientist at Microsoft, IBM Research and Palo Alto Research Center. She got her PhD in HCI from Penn State and master/bachelor degrees from Tsinghua University.

Resources mentioned:

Anti-Corruption Anthropologist in Kazakhstan: Interview with Olga Shiyan (Interview #7 in the Interview Series)

I interviewed Olga Shiyan as part of my Interview Series. In it, she discusses her anti-corruption work in Kazakhstan with Transparency International. In particular, she highlights various projects that have integrated anthropology with data science and statistics. 

Olga Shiyan is the Executive Director of the Transparency International’s chapter in Kazakhstan. She specializes in advocacy, legislation and draft laws, and democratic training programs. For this, she has developed research methods that combine anthropology and data science and statistics. In 2019, the Kazakhstan Geographic Society awarder for a medal for anti-corruption work.

To learn more about Olga, feel free to check out the following:

1. Monitoring the state of corruption in Kazakhstan for 2020, presentation

2. Presentation of the research in the media, speaking on a TV show, talk TV show

3. Monitoring the state of corruption in Kazakhstan for 2019

4. The index of civic participation and influence on lawmaking in Kazakhstan

5. 13 stories about lawmaking in Kazakhstan                                             

6. Development of local self-government in Kazakhstan: analysis of fourth-level budgets

7. Ethno-confessional monitoring. Kazakhstan. 2018.

8. Customs corruption in Kazakhstan: mirror analysis of trade

9. Opportunities for Civil Control in Kazakhstan: Experience of Ethnological Research

10. The thorny path of labor migrants to Russia: the experience of participant observation

11. Anthropological approach to the study of the influence of gift exchange on informal socio-economic relations

12. Anthropological approach in the interdisciplinary study of the phenomenon of corruption

13. Summer anti-corruption school of Transparency Kazakhstan

14. Transparency Kazakhstan School of Investigative Journalism

EPIC Data Scientists + Ethnographers Group

I recently organized a professional group called EPIC Data Scientists + Ethnographers along with a few others who are both data scientists and ethnographers. Our goal is to form a virtual community to discuss ways to incorporate ethnography and data science, just like I strive to do on this website.

If you are interested in working with others on this or simply interested in learning more, feel free to join. Whether you are both a data scientist and ethnographer, only one of them, or neither, we would love to hear your perspective.

Thank you, EPIC, for helping to develop this and giving us a platform.

Photo credit: deepak pal at https://www.flickr.com/photos/158301585@N08/46085930481/

Data Visualization 101: The Most Important Rule for Developing a Graph

I suspect everyone has seen a bad graph, a mess of bars, lines, pie slices, or what have you that you dreaded having to look at. Maybe you have even made one, which you look at today and wonder what on earth you were thinking.

These graphs violate the most basic graph-making rule in data visualization:

A graph is like a sentence, expressing one idea.

This rule applies to all uses of graphs, whether you are a data scientist, data analyst, statistician, or just making graphs for your friends for fun.

In grade school, your grammar teachers likely explained that a sentence, at its most basic, expresses on thought or idea. Graphs are visual sentences: they should state one and only one thought or idea about the data.

When you look at a graph, you should be able to say, in one sentence, what the graph is saying: such as “Group A is greater than Group B,” or “Y at first improved but is now declining.” If you cannot, then you have yourself a run-on graph.

For example, the above graph is trying to say too many statements: trying to depict the immigration patterns of twenty-two different countries over the course of nearly a century. There are likely useful statements in this data, but the representation as one graph prevents a viewer/reader from being able to easily decipher them.

Likewise, this graph shows way too many lens sizes to meaningfully express a single, coherent idea, leaving the reader/viewer struggling to determine which fields to focus on.

Potential Objection #1: But I have more to say about the data than a single statement.

 Great! Then provide more than one graph. Say everything you need to say about the data; just use one graph for each of your statements.

            Don’t fall into the One-Graph-to-Rule-Them-All Fallacy: trying to use one graph to express all your statements about the data that ends up a visual mess of incomprehensibility. Create multiple easy-to-read graphs where each graph demonstrates one of your points at a time. Condensing everything into one graph just prevents your viewers from determining what you have to say at all.

Bar Chart, Chart, Statistics, Analytics, Data Analytics
One-Graph-to-Rule-Them-All Fallacy: Trying to use one graph to express all your thoughts about the data that ends up a visual mess of incomprehensibility
Statistics, Graph, Chart, Data, Information, Growth
Instead, use one graph for each of your points

Potential Objection #2: I want the viewers to interpret the findings for themselves, not just impart my own ideas/conclusions.

Fair point. When presenting/communicating data, there is a time for showing your own insights and a time to open-endedly display the information for your viewers/readers to interpret for themselves. Graphs are tools for the former, and for the latter, use tables. Tables, among other potential uses, convey a wide scope of information for the reader/viewer to interpret on their own.

Remember that first example above about U.S. immigration from various parts of Europe? A table (see below) would convey that information much more easily and allow readers to track whatever places, patterns, or questions they would to learn about. Are you in a situation where you would like to report a large amount of information that your readers can use for their own purposes? Then tables are a much better starting point than graphs.

 Some situations require that I lean towards sharing my insights/analysis and others towards encouraging my readers/viewers to form their own conclusions, but since most situations require a combination of the two, I generally combine graphs and tables. I try, when I can, to put smaller tables in the document or slides themselves and, when I cannot, include full tables in an Appendix.

Potential Objection #3: My main idea/point has multiple subpoints.

            Many sentences have multiple subpoints needed to express the single idea as well, which does not prevent the sentence structure from meaningfully capturing those ideas. The fancy grammar word for such a subpoint is a claus. Even though some sentences are simple and straightforward with only one subject and predicate, many (like this very sentence) require multiple sets of subjects and predicates to express its thought.

            Likewise, some graphical ideas require multiple subordinate or compounded subpoints, and there are types of graphs that allow this. Consider Joint Plots, like the one below. To present the relationships and combined distribution between the two variables adequately, they also display each variable’s individual distributions above and to the right. That way, the viewer can see how both distributions might be influencing the combined distribution. Thus, it displays each variable’s distribution on the side like a subordinate clause.

The darker colors in this graph signify a higher density of data points, showing the combined joint distribution of the variables.

These are advanced graphs to make, since like with multi-part sentences, one must present the subpoints carefully to make clear what the main point is. Multi-part sentences, likewise, require carefulness in how to organize multiple clauses cohesively. I intend to write a post later describing how to develop these multi-part graphs in more detail.

The general rule still applies for these more complicated graphs:

Can you summarize what the graph is saying in one coherent sentence?

If you cannot, do not use/show that graph. Our brains are very good at intuiting whether a sentence carries one thought, so use this to determine whether your graph is effective.

Photo/Graph credit #1: kreatikar at https://pixabay.com/illustrations/statistics-graph-chart-data-3411473/

Photo/Graph credit #2: Linux Screenshots at https://www.flickr.com/photos/xmodulo/23635690633/

Photo/Graph credit #3: Andrew Guyton at https://www.flickr.com/photos/disavian/4435971394/

Photo/Graph credit #4: TymonOziemblewski at https://pixabay.com/illustrations/bar-chart-chart-statistics-1264756/

Photo/Graph credit #5 (the first graph again): kreatikar at https://pixabay.com/illustrations/statistics-graph-chart-data-3411473/

Photo/Graph credit #6: Michael Waskom provides a helpful tutorial that formed the inspiration behind the random graph I created.

Three Key Differences between Data Science and Statistics

woman draw a light bulb in white board

Data science’s popularity has grown in the last few years, and many have confused it with its older, more familiar relative: statistics. As someone who has worked both as a data scientist and as a statistician, I frequently encounter such confusion. This post seeks to clarify some of the key differences between them.

Before I get into their differences, though, let’s define them. Statistics as a discipline refers to the mathematical processes of collecting, organizing, analyzing, and communicating data. Within statistics, I generally define “traditional” statistics as the the statistical processes taught in introductory statistics courses like basic descriptive statistics, hypothesis testing, confidence intervals, and so on: generally what people outside of statistics, especially in the business world, think of when they hear the word “statistics.”

Data science in its most broad sense is the multi-disciplinary science of organizing, processing, and analyzing computational data to solve problems. Although they are similar, data science differs from both statistics and “traditional” statistics:

DifferenceStatistics Data Science
#1 Field of Mathematics Interdisciplinary
#2 Sampled Data Comprehensive Data
#3 Confirming Hypothesis Exploratory Hypotheses

Difference #1: Data Science Is More than a Field of Mathematics

Statistics is a field of mathematics; whereas, data science refers to more than just math. At its simplest, data science centers around the use of computational data to solve problems,[i] which means it includes the mathematics/statistics needed to break down the computational data but also the computer science and engineering thinking necessary to code those algorithms efficiently and effectively, and the business, policy, or other subject-specific “smarts” to develop strategic decision-making based on that analysis.

Thus, statistics forms a crucial component of data science, but data science includes more than just statistics. Statistics, as a field of mathematics, just includes the mathematical processes of analyzing and interpreting data; whereas, data science also includes the algorithmic problem-solving to do the analysis computationally and the art of utilizing that analysis to make decisions to meet the practical needs in the context. Statistics clearly forms a crucial part of the process of data science, but data science generally refers to the entire process of analyzing computational data. On a practical level, many data scientists do not come from a pure statistics background but from a computer science or engineering, leveraging their coding expertise to develop efficient algorithmic systems.

laptop computer on glass-top table

Difference #2: Comprehensive vs Sample Data

In statistical studies, researchers are often unable to analyze the entire population, that is the whole group they are analyzing, so instead they create a smaller, more manageable sample of individuals that they hope represents the population as a whole. Data science projects, however, often involves analyzing big, summative data, encapsulating the entire population.

 The tools of traditional statistics work well for scientific studies, where one must go out and collect data on the topic in question. Because this is generally very expensive and time-consuming, researchers can only collect data on a subset of the wider population most of the time.

Recent developments in computation, including the ability to gather, store, transfer, and process greater computational data, have expanded the type of quantitative research now possible, and data science has developed to address these new types of research. Instead of gathering a carefully chosen sample of the population based on a heavily scrutinized set of variables, many data science projects require finding meaningful insights from the myriads of data already collected about the entire population.

stack of jigsaw puzzle pieces

Difference #3: Exploratory vs Confirming  

Data scientists often seek to build models that do something with the data; whereas, statisticians through their analysis seek to learn something from the data. Data scientists thus often assess their machine learning models based on how effectively they perform a given task, like how well it optimizes a variable, determines the best course of action, correctly identifies features of an image, provides a good recommendation for the user, and so on. To do this, data scientists often compare the effectiveness or accuracy of the many models based on a chosen performance metric(s).

In traditional statistics, the questions often center around using data to understand the research topic based on the findings from a sample. Questions then center around what the sample can say about the wider population and how likely its results would represent or apply to that wider population.

In contrast, machine learning models generally do not seek to explain the research topic but to do something, which can lead to very different research strategy. Data scientists generally try to determine/produce the algorithm with the best performance (given whatever criteria they use to assess how a performance is “better”), testing many models in the process. Statisticians often employ a single model they think represents the context accurately and then draw conclusions based on it.

Thus, data science is often a form of exploratory analysis, experimenting with several models to determine the best one for a task, and statistics confirmatory analysis, seeking to confirm how reasonable it is to conclude a given hypothesis or hypotheses to be true for the wider population.

A lot of scientific research has been theory confirming: a scientist has a model or theory of the world; they design and conduct an experiment to assess this model; then use hypothesis testing to confirm or negate that model based on the results of the experiment. With changes in data availability and computing, the value of exploratory analysis, data mining, and using data to generate hypotheses has increased dramatically (Carmichael 126).

Data science as a discipline has been at the forefront of utilizing increased computing abilities to conduct exploratory work.

person holding gold-colored pocket watch

Conclusion

 A data scientist friend of mine once quipped to me that data science simply is applied computational statistics (c.f. this). There is some truth in this: the mathematics of data science work falls within statistics, since it involves collecting, analyzing, and communicating data, and, with its emphasis and utilization of computational data, would definitely be a part of computational statistics. The mathematics of data science is also very clearly applied: geared towards solving practical problems/needs. Hence, data science and statistics interrelate.

They differ, however, both in their formal definitions and practical understandings. Modern computation and big data technologies have had a major influence on data science. Within statistics, computational statistics also seeks to leverage these resources, but what has become “traditional” statistics does not (yet) incorporate these. I suspect in the next few years or decades, developments in modern computing, data science, and computational statistics will reshape what people consider “traditional” or “standard” statistics to be a bit closer to the data science of today.

   For more details, see the following useful resources:

Ian Carmichael’s and J.S. Marron’s “Data science vs. statistics: two cultures?” in the Japanese Journal of Statistics and Data Science: https://link.springer.com/article/10.1007/s42081-018-0009-3
“Data Scientists Versus Statisticians” at https://opendatascience.com/data-scientists-versus-statisticians/ and https://medium.com/odscjournal/data-scientists-versus-statisticians-8ea146b7a47f
“Differences between Data Science and Statistics” at https://www.educba.com/data-science-vs-statistics/

Photo credit #1: Andrea Piacquadio at https://www.pexels.com/photo/woman-draw-a-light-bulb-in-white-board-3758105/

Photo credit #2: Carlos Muza at https://unsplash.com/photos/hpjSkU2UYSU

Photo credit #3: Hans-Peter Gauster at https://unsplash.com/photos/3y1zF4hIPCg

Photo credit #4: Kendall Lane at https://unsplash.com/photos/yEDhhN5zP4o


[i] Carmichael 118.

Data Visualization 102: The Most Important Rules for Making Data Tables

In a previous post about data visualization in data science and statistics, I discussed what I consider the single most important rule of graphing data. In this post, I am following up to discuss the most important rules for making data tables. I will focus on data tables in reporting/communicating findings to others, as opposed to the many other uses of tables in data science say to store, organize, and mine data.

To summarize, graphs are like sentences, conveying one clear thought to the viewer/reader. Tables, on the other hand, can function more like paragraphs, conveying multiple sentences or thoughts to get an overall idea. Unlike graphs, which often provide one thought, tables can be more exploratory, providing information for the viewer/reader to analyze and draw his or her own conclusions from.

Table Rule #1: Don’t be afraid to provide as much or as little information as you need.

Paragraphs can use multiple sentences to convey a series of thoughts/statements, and tables are no different. One can convey multiple pieces of information that viewers/readers can look through and analyze at his or her own leisure, using the data to answer their own questions, so feel free to take up the space as you need. Several page long tables are fair game and, in many cases, absolutely necessary (although often end up in appendices for readers/viewers needing a more in-depth take).

In my previous data visualization post, I gave this bar chart as an example of trying to say too many statements for a graph:

This is a paragraphs-worth of information, and a table would represent it much better.[i] In a table, the reader/viewer can explore the table values by country and year themselves and answer whatever questions he or she might have. For example, if someone wanted to analyze how a specific country changed overtime, he or she could do so easily with a table, and/or if he or want to analyze compare the immigration ratios between countries of a specific decade, that is possible as well. In the graph above, each country’s subsegment starts in a different place vertically for each decade column, making it hard to compare the sizes visually, and since each decade has dozens of values, that the latter analysis is visually difficult to decipher as well.

But, at the same time, do not be afraid to convey a sentence- or graphs-worth of data into a table, especially when such data is central for what you are saying. Sometimes writers include one-sentence paragraphs when that single thought is crucial, and likewise, a single statement table can have a similar effect. For example, writing a table for a single variable does helps convey that that variable is important:

Gender Some Crucial Result
Male 36%
Female 84%

Now, sometimes in these single statement instances, you might want to use a graph instead of a table (or both), which I discuss in more detail in Rule #3.

Table Rule #2: Keep columns consistent for easy scanning.

I have found that when viewers/readers scan tables, they generally subconsciously assume that all variables in a column are the same: same units and type of value. Changing values of a column between rows can throw off your viewer/reader when he or she looks at it. For example, consider this made-up study data:

  Control Group (n = 100) Experimental Group (n = 100)
Mean Age 45 44
Median Age 43 42
    Male No. (%) 45 (45%) 36 (36%)
    Female No. (%) 55 (55%) 64 (64%)

In this table, the rows each mean different values and/or units. So, for example, going down the control column, the first column is mean age measured in years. The second column switches to median age, a different type of value than mean (although the same unit of years). The final two rows convey the number and percentages of males and females of each: both a different type of value and a different unit (number and percent unlike years). This can be jarring for viewers/readers who often expect columns to be of the same values and units and naturally compare them as if they are similar types of values.

I would recommend transposing it like this, so that the columns represent the similar variables and the rows the two groups:

  Mean Age Median Age (IQR)     Male No. (%)     Female No. (%)
Control Group (n = 100) 45 43 (25, 65) 45 (45%) 55 (55%)
Experimental Group (n = 100) 44 42 (27, 63) 36 (36%) 64 (64%)

Table Rule #3: Don’t be afraid to also use a graph to convey magnitude, proportion, or scale

A table like the gender table in Rule #1 conveys pertinent information numerically, but numbers themselves do not visually show the difference between the values.

Gender Some Crucial Result
Male 36%
Female 84%

Graphs excel at visually depicting the magnitude, proportion, and/or scale of data, so, if in this example, it is important to convey how much greater the “Some Crucial Result” is for females than males, then a basic bar graph allows the reader/viewer to see that the percent is more than double for the females than for the males.

Now, to convey this visual clarity, the graph loses the ability to precisely relate the exact numbers. For example, looking at only this graph, a reader/viewer might be unsure whether the males are at 36%, 37%, or 38%. People have developed many graphing strategies to deal with this (ranging from making the grid lines sharper, writing the exact numbers on top of, next to, or around the segment, among others), but combining the graph and table in instances where one both needs to convey the exact numbers and to convey a sense of their magnitude, proportion, or scale can also work well:

Finally, given that tables can convey multiple statements, feel free to use several graphs to depict the magnitude, proportion, or scale of one table. Do not try to overload a multi-statement table into a single, incomprehensible graph. Break down each statement you are trying to relate with that table and depict each separately in a single graph.

Conclusion

If graphs are sentences, then tables can function more like paragraphs, conveying a large amount of information that make more than one thought or statement. This gives space for your reader/viewer to explore the data and interpret it on their own to answer whatever questions they have.

Photo/Table credit #1: Mika Baumeister at https://unsplash.com/photos/Wpnoqo2plFA

Photo/Table credit #2: Linux Screenshots at https://www.flickr.com/photos/xmodulo/23635690633/


[i] Unfortunately, I do not have the data myself that this chart uses, or I would make a table for it to show what I mean.