In this second part of my interview with Randy Au, he discusses the techniques he used to teach himself to code and his approach to programming and data science as a social scientist.
Prior to joining Google, he spent a decade as a mixture of a data analyst, data scientist, and data engineer at various startups in New York City and before that, studied Communications. In his newsletter, he discusses data science topics like data collection and data quality from a social science perspective. Outside of work he often engages in far too many hobbies, taken to absurd lengths.
Click here to learn more about the Interview Series this is a part of.
Randy Au, a Quantitative UX Researcher at Google, explains how he leverages his backgrounds in communication, statistics, and programming as a quantitative UX researcher in Google Cloud to analyze and improve Cloud Storage products.
Prior to joining Google, he spent a decade as a mixture of a data analyst, data scientist, and data engineer at various startups in New York City and before that, studied Communications. In his newsletter, he discusses data science topics like data collection and data quality from a social science perspective. Outside of work he often engages in far too many hobbies, taken to an absurd lengths.
Click here to learn more about the Interview Series.
I interviewed Anna Wu, a UX researcher and data scientist overseeing Google Cloud’s Compute Engine. In this final part of the conversation, we discuss how design thinking may useful within data science and machine learning.
Here is the first interview if you would like to start from scratch, and here is more information about Interview Series that this is a part of.
Anna Wu, established leader in building and leading high-performing data teams to drive changes impacting hundreds of millions of users. Currently as a research manager at Google, she leads a team of quantitative UX researchers applying UX methods and large scale analytics to inform Cloud product development.
Before this recent chapter, Anna had 10+ years practicing UX and data science at top IT companies and research labs as a UX researcher, data scientist, research scientist at Microsoft, IBM Research and Palo Alto Research Center. She got her PhD in HCI from Penn State and master/bachelor degrees from Tsinghua University.
I interviewed Anna Wu, a UX researcher and data scientist overseeing Google Cloud’s Compute Engine, as the next installment of my Interview Series,. In this first part of our conversatoin, she discusses her journey from mechanical engineering into UX research and data science and the importance of effective storytelling within these two fields.
Anna Wu, established leader in building and leading high-performing data teams to drive changes impacting hundreds of millions of users. Currently as a research manager at Google, she leads a team of quantitative UX researchers applying UX methods and large scale analytics to inform Cloud product development.
Before this recent chapter, Anna had 10+ years practicing UX and data science at top IT companies and research labs as a UX researcher, data scientist, research scientist at Microsoft, IBM Research and Palo Alto Research Center. She got her PhD in HCI from Penn State and master/bachelor degrees from Tsinghua University.
Anna Wu, established leader in building and leading high-performing data teams to drive changes impacting hundreds of millions of users. Currently as a research manager at Google, she leads a team of quantitative UX researchers applying UX methods and large scale analytics to inform Cloud product development.
Before this recent chapter, Anna had 10+ years practicing UX and data science at top IT companies and research labs as a UX researcher, data scientist, research scientist at Microsoft, IBM Research and Palo Alto Research Center. She got her PhD in HCI from Penn State and master/bachelor degrees from Tsinghua University.
I interviewed Olga Shiyan as part of my Interview Series. In it, she discusses her anti-corruption work in Kazakhstan with Transparency International. In particular, she highlights various projects that have integrated anthropology with data science and statistics.
Olga Shiyan is the Executive Director of the Transparency International’s chapter in Kazakhstan. She specializes in advocacy, legislation and draft laws, and democratic training programs. For this, she has developed research methods that combine anthropology and data science and statistics. In 2019, the Kazakhstan Geographic Society awarder for a medal for anti-corruption work.
To learn more about Olga, feel free to check out the following:
I recently organized a professional group called EPIC Data Scientists + Ethnographers along with a few others who are both data scientists and ethnographers. Our goal is to form a virtual community to discuss ways to incorporate ethnography and data science, just like I strive to do on this website.
If you are interested in working with others on this or simply interested in learning more, feel free to join. Whether you are both a data scientist and ethnographer, only one of them, or neither, we would love to hear your perspective.
Thank you, EPIC, for helping to develop this and giving us a platform.
I suspect everyone has seen a bad graph, a mess of bars, lines, pie slices, or what have you that you dreaded having to look at. Maybe you have even made one, which you look at today and wonder what on earth you were thinking.
These graphs violate the most basic graph-making rule in data visualization:
A graph is like a sentence, expressing one idea.
This rule applies to all uses of graphs, whether you are a data scientist, data analyst, statistician, or just making graphs for your friends for fun.
In grade school, your grammar teachers likely explained that a sentence, at its most basic, expresses on thought or idea. Graphs are visual sentences: they should state one and only one thought or idea about the data.
When you look at a graph, you should be able to say, in one sentence, what the graph is saying: such as “Group A is greater than Group B,” or “Y at first improved but is now declining.” If you cannot, then you have yourself a run-on graph.
For example, the above graph is trying to say too many statements: trying to depict the immigration patterns of twenty-two different countries over the course of nearly a century. There are likely useful statements in this data, but the representation as one graph prevents a viewer/reader from being able to easily decipher them.
Likewise, this graph shows way too many lens sizes to meaningfully express a single, coherent idea, leaving the reader/viewer struggling to determine which fields to focus on.
Potential Objection #1: But I have more to say about the data than a single statement.
Great! Then provide more than one graph. Say everything you need to say about the data; just use one graph for each of your statements.
Don’t fall into the One-Graph-to-Rule-Them-All Fallacy: trying to use one graph to express all your statements about the data that ends up a visual mess of incomprehensibility. Create multiple easy-to-read graphs where each graph demonstrates one of your points at a time. Condensing everything into one graph just prevents your viewers from determining what you have to say at all.
Potential
Objection #2: I want the viewers to interpret the findings for themselves, not
just impart my own ideas/conclusions.
Fair point. When presenting/communicating data, there is a time for showing your own insights and a time to open-endedly display the information for your viewers/readers to interpret for themselves. Graphs are tools for the former, and for the latter, use tables. Tables, among other potential uses, convey a wide scope of information for the reader/viewer to interpret on their own.
Remember that first example above about U.S. immigration from various parts of Europe? A table (see below) would convey that information much more easily and allow readers to track whatever places, patterns, or questions they would to learn about. Are you in a situation where you would like to report a large amount of information that your readers can use for their own purposes? Then tables are a much better starting point than graphs.
Some situations require that I lean towards sharing my insights/analysis and others towards encouraging my readers/viewers to form their own conclusions, but since most situations require a combination of the two, I generally combine graphs and tables. I try, when I can, to put smaller tables in the document or slides themselves and, when I cannot, include full tables in an Appendix.
Potential Objection #3: My main idea/point has multiple subpoints.
Many sentences have multiple subpoints needed to express the single idea as well, which does not prevent the sentence structure from meaningfully capturing those ideas. The fancy grammar word for such a subpoint is a claus. Even though some sentences are simple and straightforward with only one subject and predicate, many (like this very sentence) require multiple sets of subjects and predicates to express its thought.
Likewise, some graphical ideas require multiple subordinate or compounded subpoints, and there are types of graphs that allow this. Consider Joint Plots, like the one below. To present the relationships and combined distribution between the two variables adequately, they also display each variable’s individual distributions above and to the right. That way, the viewer can see how both distributions might be influencing the combined distribution. Thus, it displays each variable’s distribution on the side like a subordinate clause.
These are advanced graphs to make, since like with multi-part sentences, one must present the subpoints carefully to make clear what the main point is. Multi-part sentences, likewise, require carefulness in how to organize multiple clauses cohesively. I intend to write a post later describing how to develop these multi-part graphs in more detail.
The general rule still applies for these more complicated graphs:
Can you summarize what the graph is saying in one coherent sentence?
If you cannot, do not use/show that graph. Our brains are very good at intuiting whether a sentence carries one thought, so use this to determine whether your graph is effective.
Data science’s popularity has grown in the last few years, and many have confused it with its older, more familiar relative: statistics. As someone who has worked both as a data scientist and as a statistician, I frequently encounter such confusion. This post seeks to clarify some of the key differences between them.
Before I get into their differences, though, let’s define them. Statistics as a discipline refers to the mathematical processes of collecting, organizing, analyzing, and communicating data. Within statistics, I generally define “traditional” statistics as the the statistical processes taught in introductory statistics courses like basic descriptive statistics, hypothesis testing, confidence intervals, and so on: generally what people outside of statistics, especially in the business world, think of when they hear the word “statistics.”
Data science in its most broad sense is the multi-disciplinary science of organizing, processing, and analyzing computational data to solve problems. Although they are similar, data science differs from both statistics and “traditional” statistics:
Difference
Statistics
Data Science
#1
Field of Mathematics
Interdisciplinary
#2
Sampled Data
Comprehensive Data
#3
Confirming Hypothesis
Exploratory Hypotheses
Difference
#1: Data Science Is More than a Field of Mathematics
Statistics is a field of mathematics; whereas, data science refers to more than just math. At its simplest, data science centers around the use of computational data to solve problems,[i] which means it includes the mathematics/statistics needed to break down the computational data but also the computer science and engineering thinking necessary to code those algorithms efficiently and effectively, and the business, policy, or other subject-specific “smarts” to develop strategic decision-making based on that analysis.
Thus, statistics forms a crucial component of data science, but data science includes more than just statistics. Statistics, as a field of mathematics, just includes the mathematical processes of analyzing and interpreting data; whereas, data science also includes the algorithmic problem-solving to do the analysis computationally and the art of utilizing that analysis to make decisions to meet the practical needs in the context. Statistics clearly forms a crucial part of the process of data science, but data science generally refers to the entire process of analyzing computational data. On a practical level, many data scientists do not come from a pure statistics background but from a computer science or engineering, leveraging their coding expertise to develop efficient algorithmic systems.
Difference
#2: Comprehensive vs Sample Data
In statistical studies, researchers are often unable to analyze the entire population, that is the whole group they are analyzing, so instead they create a smaller, more manageable sample of individuals that they hope represents the population as a whole. Data science projects, however, often involves analyzing big, summative data, encapsulating the entire population.
The tools of traditional statistics work well for scientific studies, where one must go out and collect data on the topic in question. Because this is generally very expensive and time-consuming, researchers can only collect data on a subset of the wider population most of the time.
Recent developments in computation, including the ability to gather, store, transfer, and process greater computational data, have expanded the type of quantitative research now possible, and data science has developed to address these new types of research. Instead of gathering a carefully chosen sample of the population based on a heavily scrutinized set of variables, many data science projects require finding meaningful insights from the myriads of data already collected about the entire population.
Difference
#3: Exploratory vs Confirming
Data scientists often seek to build models that do something with the data; whereas, statisticians through their analysis seek to learn something from the data. Data scientists thus often assess their machine learning models based on how effectively they perform a given task, like how well it optimizes a variable, determines the best course of action, correctly identifies features of an image, provides a good recommendation for the user, and so on. To do this, data scientists often compare the effectiveness or accuracy of the many models based on a chosen performance metric(s).
In traditional statistics, the questions often center around using data to understand the research topic based on the findings from a sample. Questions then center around what the sample can say about the wider population and how likely its results would represent or apply to that wider population.
In contrast, machine learning models generally do not seek to explain the research topic but to do something, which can lead to very different research strategy. Data scientists generally try to determine/produce the algorithm with the best performance (given whatever criteria they use to assess how a performance is “better”), testing many models in the process. Statisticians often employ a single model they think represents the context accurately and then draw conclusions based on it.
Thus, data science is often a form of exploratory analysis, experimenting with several models to determine the best one for a task, and statistics confirmatory analysis, seeking to confirm how reasonable it is to conclude a given hypothesis or hypotheses to be true for the wider population.
A lot of scientific research has been theory confirming: a scientist has a model or theory of the world; they design and conduct an experiment to assess this model; then use hypothesis testing to confirm or negate that model based on the results of the experiment. With changes in data availability and computing, the value of exploratory analysis, data mining, and using data to generate hypotheses has increased dramatically (Carmichael 126).
Data science as a discipline has been at the
forefront of utilizing increased computing abilities to conduct exploratory work.
Conclusion
A data scientist friend of mine once quipped to me that data science simply is applied computational statistics (c.f. this). There is some truth in this: the mathematics of data science work falls within statistics, since it involves collecting, analyzing, and communicating data, and, with its emphasis and utilization of computational data, would definitely be a part of computational statistics. The mathematics of data science is also very clearly applied: geared towards solving practical problems/needs. Hence, data science and statistics interrelate.
They differ, however, both in their formal definitions and practical understandings. Modern computation and big data technologies have had a major influence on data science. Within statistics, computational statistics also seeks to leverage these resources, but what has become “traditional” statistics does not (yet) incorporate these. I suspect in the next few years or decades, developments in modern computing, data science, and computational statistics will reshape what people consider “traditional” or “standard” statistics to be a bit closer to the data science of today.
For more details, see the following useful resources:
In a
previous post about data visualization in data
science and statistics,
I discussed what I consider the single most important rule of graphing data. In
this post, I am following up to discuss the most important rules for making
data tables. I will focus on data tables in reporting/communicating findings to
others, as opposed to the many other uses of tables in data science say to
store, organize, and mine data.
To
summarize, graphs are like sentences, conveying one clear thought to the
viewer/reader. Tables, on the other hand, can function more like paragraphs,
conveying multiple sentences or thoughts to get an overall idea. Unlike graphs,
which often provide one thought, tables can be more exploratory, providing
information for the viewer/reader to analyze and draw his or her own
conclusions from.
Table Rule #1: Don’t be afraid to
provide as much or as little information as you need.
Paragraphs
can use multiple sentences to convey a series of thoughts/statements, and tables
are no different. One can convey multiple pieces of information that
viewers/readers can look through and analyze at his or her own leisure, using
the data to answer their own questions, so feel free to take up the space as
you need. Several page long tables are fair game and, in many cases, absolutely
necessary (although often end up in appendices for readers/viewers needing a
more in-depth take).
In my
previous data visualization post, I gave this bar chart as an
example of trying to say too many statements for a graph:
This is a
paragraphs-worth of information, and a table would represent it much better.[i]
In a table, the reader/viewer can explore the table values by country and year
themselves and answer whatever questions he or she might have. For example, if
someone wanted to analyze how a specific country changed overtime, he or she
could do so easily with a table, and/or if he or want to analyze compare the
immigration ratios between countries of a specific decade, that is possible as
well. In the graph above, each country’s subsegment starts in a different place
vertically for each decade column, making it hard to compare the sizes visually,
and since each decade has dozens of values, that the latter analysis is
visually difficult to decipher as well.
But, at the
same time, do not be afraid to convey a sentence- or graphs-worth of data into
a table, especially when such data is central for what you are saying. Sometimes
writers include one-sentence paragraphs when that single thought is crucial,
and likewise, a single statement table can have a similar effect. For example,
writing a table for a single variable does helps convey that that variable is
important:
Gender
Some Crucial Result
Male
36%
Female
84%
Now,
sometimes in these single statement instances, you might want to use a graph
instead of a table (or both), which I discuss in more detail in Rule #3.
Table Rule #2: Keep columns
consistent for easy scanning.
I have found
that when viewers/readers scan tables, they generally subconsciously assume
that all variables in a column are the same: same units and type of value.
Changing values of a column between rows can throw off your viewer/reader when
he or she looks at it. For example, consider this made-up study data:
Control Group (n = 100)
Experimental Group (n = 100)
Mean Age
45
44
Median Age
43
42
Male No. (%)
45 (45%)
36 (36%)
Female No. (%)
55 (55%)
64 (64%)
In this
table, the rows each mean different values and/or units. So, for example, going
down the control column, the first column is mean age measured in years. The
second column switches to median age, a different type of value than mean (although
the same unit of years). The final two rows convey the number and percentages
of males and females of each: both a different type of value and a different
unit (number and percent unlike years). This can be jarring for viewers/readers
who often expect columns to be of the same values and units and naturally
compare them as if they are similar types of values.
I would
recommend transposing it like this, so that the columns represent the similar
variables and the rows the two groups:
Mean Age
Median Age (IQR)
Male No. (%)
Female No. (%)
Control Group (n =
100)
45
43 (25, 65)
45 (45%)
55 (55%)
Experimental Group (n
= 100)
44
42 (27, 63)
36 (36%)
64 (64%)
Table Rule #3: Don’t be afraid to
also use a graph to convey magnitude, proportion, or scale
A table like
the gender table in Rule #1 conveys pertinent information numerically, but
numbers themselves do not visually show the difference between the values.
Gender
Some Crucial Result
Male
36%
Female
84%
Graphs excel at visually depicting the magnitude, proportion, and/or scale of data, so, if in this example, it is important to convey how much greater the “Some Crucial Result” is for females than males, then a basic bar graph allows the reader/viewer to see that the percent is more than double for the females than for the males.
Now, to convey this visual clarity, the graph loses the ability to precisely relate the exact numbers. For example, looking at only this graph, a reader/viewer might be unsure whether the males are at 36%, 37%, or 38%. People have developed many graphing strategies to deal with this (ranging from making the grid lines sharper, writing the exact numbers on top of, next to, or around the segment, among others), but combining the graph and table in instances where one both needs to convey the exact numbers and to convey a sense of their magnitude, proportion, or scale can also work well:
Finally, given
that tables can convey multiple statements, feel free to use several graphs to depict
the magnitude, proportion, or scale of one table. Do not try to overload a
multi-statement table into a single, incomprehensible graph. Break down each
statement you are trying to relate with that table and depict each separately
in a single graph.
Conclusion
If graphs are sentences, then tables can function more like paragraphs, conveying a large amount of information that make more than one thought or statement. This gives space for your reader/viewer to explore the data and interpret it on their own to answer whatever questions they have.