How do we build relatable machine learning models that regular people can understand? This is a presentation about how design principles apply to the development of machine learning systems. Too often in data science, machine learning software is not built with regular people who will interact with it in mind.
I argue that in order to make machine learning software relatable, we need to use design thinking to intentionally build in mechanisms for users to form their own mental models of how the machine learning software works. Failing to include theses helps cultivate the common sense that machine learning is a black box for users.
I gave three different versions of this talk at Quant UX Con on June 8th, 2022, the Royal Institute of Anthropology’s annual conference on June 10th, 2022, and Google’s AI + Design Tooling Research Symposium on August 5th, 2022.
I hope you find it interesting and feel free to share any thoughts you might have.
Thank you for the conference and talk organizers for making this happen, and I appreciate all the insightful conversations I had about the role of design thinking in building relatable machine learning.
As an anthropologist and data scientist, I often feel caught in the middle two distinct warring factions. Anthropologists and data scientists inherited a historic debate between quantitative and qualitative methodologies in social research within modern Western societies. At its core, this debate has centered on the difference between objective, prescriptive, top-downtechniques and subjective, sitautional, flexible, descritpive bottom-up approaches.[i] In this ensuing conflict, quantative research has been demarcated into the top-down faction and qualitative research within the bottom-up faction to the detriment of understanding both properly.
In my experience on both “sides,” I have seen a tendency among anthropologists to lump all quantitative social research as proscriptive and top-down and thus miss the important subtleties within data science and other quantitative techniques. Machine learning techniques within the field are a partial shift towards bottom-up, situational and iterative quantitative analysis, and business anthropologists should explore what data scientists do as a chance to redevelop their relationship with quantitative analysis.
Shifts in Machine Learning
Shifts within machine learning algorithm development give impetus for incorporating quantitative techniques that are local and interpretive. The debate between top-down vs. bottom-up knowledge production does not need – or at least may no longer need– to divide quantitative and qualitative techniques. Machine learning algorithms “leave open the possibility of situated knowledge production, entangled with narrative,” a clear parallel to qualitative ethnographic techniques.[ii]
At the same time, this shift towards iterative and flexible machine learning techniques is not total within data science: aspects of top-down frameworks remain, in terms of personnel, objectives, habits, strategies, and evaluation criteria. But, seeds of bottom-up thinking definitely exist prominently within data science, with the potential to significantly reshape data science and possibly quantitative analysis in general.
As a discipline, data science is in a uniquely formative and adolescent period, developing into its “standard” practices. This leads to significant fluctuations as the data scientist community defines its methodology. The set of standard practices that we now typically call “traditional” or “standard” statistics, generally taught in introductory statistics courses, developed over a several decade period in the late nineteenth and early twentieth century, especially in Britain.[iii] Connected with recent computer technology, data science is in a similarly formative period right now – developing its standard techniques and ways of thinking. This formative period is a strategic time for anthropologists to encourage bottom-up quantative techniques.
Conclusion
Business
anthropologists could and should be instrumental in helping to develop and
innovatively utilize these situational and iterative machine learning
techniques. This is a strategic time for business anthropologists to do the
following:
Immerse themselves into data science and encourage and cultivate bottom-up quantative machine learning techniques within data science
Cultivate and incorporate (when applicable) situational and iterative machine learning approaches in its ethnographies
For both, anthropologists should use the strengths of ethnographic and anthropological thinking to help develop bottom-up machine learning that is grounded in flexible to specific local contexts. Each requires business anthropologists to reexplore their relationship with data science and machine learning instead of treating it as part of an opposing “methodological clan.” [iv]
[i] Nafus, D.,
& Knox, H. (2018). Ethnography for a Data-Saturated World. Manchester:
Manchester University Press, 11-12
Data science’s popularity has grown in the last few years, and many have confused it with its older, more familiar relative: statistics. As someone who has worked both as a data scientist and as a statistician, I frequently encounter such confusion. This post seeks to clarify some of the key differences between them.
Before I get into their differences, though, let’s define them. Statistics as a discipline refers to the mathematical processes of collecting, organizing, analyzing, and communicating data. Within statistics, I generally define “traditional” statistics as the the statistical processes taught in introductory statistics courses like basic descriptive statistics, hypothesis testing, confidence intervals, and so on: generally what people outside of statistics, especially in the business world, think of when they hear the word “statistics.”
Data science in its most broad sense is the multi-disciplinary science of organizing, processing, and analyzing computational data to solve problems. Although they are similar, data science differs from both statistics and “traditional” statistics:
Difference
Statistics
Data Science
#1
Field of Mathematics
Interdisciplinary
#2
Sampled Data
Comprehensive Data
#3
Confirming Hypothesis
Exploratory Hypotheses
Difference
#1: Data Science Is More than a Field of Mathematics
Statistics is a field of mathematics; whereas, data science refers to more than just math. At its simplest, data science centers around the use of computational data to solve problems,[i] which means it includes the mathematics/statistics needed to break down the computational data but also the computer science and engineering thinking necessary to code those algorithms efficiently and effectively, and the business, policy, or other subject-specific “smarts” to develop strategic decision-making based on that analysis.
Thus, statistics forms a crucial component of data science, but data science includes more than just statistics. Statistics, as a field of mathematics, just includes the mathematical processes of analyzing and interpreting data; whereas, data science also includes the algorithmic problem-solving to do the analysis computationally and the art of utilizing that analysis to make decisions to meet the practical needs in the context. Statistics clearly forms a crucial part of the process of data science, but data science generally refers to the entire process of analyzing computational data. On a practical level, many data scientists do not come from a pure statistics background but from a computer science or engineering, leveraging their coding expertise to develop efficient algorithmic systems.
Difference
#2: Comprehensive vs Sample Data
In statistical studies, researchers are often unable to analyze the entire population, that is the whole group they are analyzing, so instead they create a smaller, more manageable sample of individuals that they hope represents the population as a whole. Data science projects, however, often involves analyzing big, summative data, encapsulating the entire population.
The tools of traditional statistics work well for scientific studies, where one must go out and collect data on the topic in question. Because this is generally very expensive and time-consuming, researchers can only collect data on a subset of the wider population most of the time.
Recent developments in computation, including the ability to gather, store, transfer, and process greater computational data, have expanded the type of quantitative research now possible, and data science has developed to address these new types of research. Instead of gathering a carefully chosen sample of the population based on a heavily scrutinized set of variables, many data science projects require finding meaningful insights from the myriads of data already collected about the entire population.
Difference
#3: Exploratory vs Confirming
Data scientists often seek to build models that do something with the data; whereas, statisticians through their analysis seek to learn something from the data. Data scientists thus often assess their machine learning models based on how effectively they perform a given task, like how well it optimizes a variable, determines the best course of action, correctly identifies features of an image, provides a good recommendation for the user, and so on. To do this, data scientists often compare the effectiveness or accuracy of the many models based on a chosen performance metric(s).
In traditional statistics, the questions often center around using data to understand the research topic based on the findings from a sample. Questions then center around what the sample can say about the wider population and how likely its results would represent or apply to that wider population.
In contrast, machine learning models generally do not seek to explain the research topic but to do something, which can lead to very different research strategy. Data scientists generally try to determine/produce the algorithm with the best performance (given whatever criteria they use to assess how a performance is “better”), testing many models in the process. Statisticians often employ a single model they think represents the context accurately and then draw conclusions based on it.
Thus, data science is often a form of exploratory analysis, experimenting with several models to determine the best one for a task, and statistics confirmatory analysis, seeking to confirm how reasonable it is to conclude a given hypothesis or hypotheses to be true for the wider population.
A lot of scientific research has been theory confirming: a scientist has a model or theory of the world; they design and conduct an experiment to assess this model; then use hypothesis testing to confirm or negate that model based on the results of the experiment. With changes in data availability and computing, the value of exploratory analysis, data mining, and using data to generate hypotheses has increased dramatically (Carmichael 126).
Data science as a discipline has been at the
forefront of utilizing increased computing abilities to conduct exploratory work.
Conclusion
A data scientist friend of mine once quipped to me that data science simply is applied computational statistics (c.f. this). There is some truth in this: the mathematics of data science work falls within statistics, since it involves collecting, analyzing, and communicating data, and, with its emphasis and utilization of computational data, would definitely be a part of computational statistics. The mathematics of data science is also very clearly applied: geared towards solving practical problems/needs. Hence, data science and statistics interrelate.
They differ, however, both in their formal definitions and practical understandings. Modern computation and big data technologies have had a major influence on data science. Within statistics, computational statistics also seeks to leverage these resources, but what has become “traditional” statistics does not (yet) incorporate these. I suspect in the next few years or decades, developments in modern computing, data science, and computational statistics will reshape what people consider “traditional” or “standard” statistics to be a bit closer to the data science of today.
For more details, see the following useful resources:
A friend and fellow professor, Dr. Eve Pinkser, asked me to give a guest lecture on quantitative text analysis techniques within data science for her Public Health Policy Research Methods class with the University of Illinois at Chicago on April 13th, 2020. Multiple people have asked me similar questions about how to use data science to analyze texts quantitatively, so I figured I would post my presentation for anyone interested in learning more.
It provides a basic introduction of the different approaches so that you can determine which to explore in more detail. I have found that many people who are new to data science feel paralyzed when trying to navigate through the vast array of data science techniques out there and unsure where to start.
Many of her students needed to conduct quantitative textual analysis as part of their doctoral work but struggled in determining what type of quantitative research to employ. She asked me to come in and explain the various data science and machine learning-based textual analysis techniques, since this was out of her area of expertise. The goal of the presentation was to help the PhD students in the class think through the types of data science quantitative text analysis techniques that would be helpful for their doctoral research projects.
Hopefully, it would likewise allow you to determine the type or types of text analysis you might need so that you can then look those up in more detail. Textual analysis, as well as the wider field of natural language processing within which it is a part of, is a quickly up-and-coming subfield within data science doing important and groundbreaking work.
In a past blog post,
I defined and described what machine learning is. I briefly highlighted four
instances where machine learning algorithms are useful. This is what I wrote:
Autonomy:
To teach computers to do a task without the direct aid/intervention of humans
(e.g. autonomous vehicles)
Fluctuation:
Help machines adjust when the requirements and data change over time
Intuitive
Processing: Conduct or assist in tasks humans do but
are unable to explain how computationally/algorithmically (e.g. image
recognition)
Big
Data: Breaking down data that is too large to handle
otherwise
The goal of this blog
post is to explain each in more detail.
Case
#1: Autonomy
The first major use of
machine learning centers around teaching computers to do a task or tasks
without the direct aid or intervention of humans. Self-driving vehicles are a
high-profile example of this: teaching a vehicle to drive (scanning the road and
determining how to respond to what is around it) without the aid of or with
minimal direct oversight from a human driver.
There are two types basic
types of tasks that machine learning systems might perform autonomously:
Tasks humans frequently perform
Tasks humans are unable to perform.
Self-driving cars
exemplify the former: humans drive cars, but self-driving cars would perform
all or part of the driving process. Another example would be chatbots and
virtual assistants like Alexa, Cortana, and Ok Google, which seek to converse
with users independently. Such tasks might completely or partially complete the
human activity: for example, some customer service chatbots are designed to
determine the customer’s issue but then to transfer to a human when the issue
has a certain complexity.
Humans have also sought
to build autonomous machine learning algorithms to perform tasks that humans
are unable to perform. Unlike self-driving cars, which conduct an activity many
people do, people might also design a self-driving rover or submarine to drive
and operate in a world that humans have so far been unable to inhabit, like other
planets in our Solar System or the deep ocean. Search engines are another
example: Google uses machine learning to help refine search results, which
involves analyzing a massive amount of web data beyond what a human could
normally do.
Case
#2: Fluctuating Data
Machine learning is also
powerful tool for making sense of and incorporating fluctuating data. Unlike
other types of models with fixed processes for how it predicts its values,
machine learning models can learn from current patterns and adjust both if the
patterns fluctuate overtime or if new use cases arise. This can be especially
helpful when trying to forecast the future, allowing the model to decipher new trends
if and when they emerge. For example, when predicting stock prices, machine
learning algorithms can learn from new data and pick up changing trends to make
the model better at predicting the future.
Of course, humans are
notorious for changing overtime, so fluctuation is often helpful in models that
seek to understand human preferences and behavior. For example, user
recommendations – like Netflix’s, Hulu’s, or YouTube’s video recommendation
systems – adjust based on the usage overtime, enabling them to respond to individual
and/or collective changes in interests.
Case
#3: Intuitive Processing
Data scientist frequently
develop machine learning algorithms to teach computers how to do processes that
humans do naturally but for which we are unable to fully explain how
computationally. For example, popular applications of machine learning center
around replicating some aspect of sensory perception: image recognition, sound
or speech recognition, etc. These replicate the process of inputting sensory
information (e.g. sight and sound) and processing, classifying, and otherwise
making sense of that information. Language processing, like chatbots, form
another example of this. In these contexts, machine learning algorithms learn a
process that humans can do intuitively (see or hear stimuli and understand
language) but are unable to fully explain how or why.
Many early forms of
machine learning arose out of neurological models of how human brains work. The
initial intention of neural nets, for instance, were to model our neurological
decision-making process or processes. Now, much contemporary neurological
scholarship since has disproven the accuracy of neural nets in representing how
our brains and minds work.[i] But, whether they
represent how human minds work at all, neural networks have provided a powerful
technique for computers to use to process and classify information and make
decisions. Likewise, many machine learning algorithms replicate some activity
humans do naturally, even if the way they conduct that human task has little to
do with how humans would.
Case
#4: Big Data
Machine learning is a
powerful tool when analyzing data that is too large to break down through
conventional computational techniques. Recent computer technologies have
increased the possibility of data collection, storage, and processing, a major
driver in big data. Machine learning has arisen as a major, if not the major,
means of analyzing this big data.
Machine learning
algorithms can manage a dizzying array of variables and use them to find insightful
patterns (like lasso regression for linear modeling). Many big data cases
involve hundreds, thousands, and maybe even tens or hundreds of thousands of
input variables, and many machine learning techniques (like best subsets
selection, stepwise selection, and lasso regression) process the myriads of
variables in big data and determine the best ones to use.
Recent developments
computing provides the incredible processing power necessary to do such work
(and debatably, machine learning is currently helping to push computational power
and provide a demand for greater computational abilities). Hand-calculations
and computers several decades ago were often unable to handle the calculations
necessary to analyze large information: demonstrated, for example, by the fact
that computer scientists invented the now popular neural networks many decades
ago, but they did not gain popularity as a method until recent computer
processing made them easy and worthwhile to run.
Tractors and other
large-scale agricultural techniques coincided historically with the enlargement
of farm property sizes, where the such machinery not only allowed farmers to manage
large tracks of land but also incentivized larger farms economically. Likewise,
machine learning algorithms provide the main technological means to analyze big
data, both enabling and in turn incentivized by rise of big data in the
professional world.
Conclusion
Here I have described
four major uses of machine learning algorithms. Machine learning has become
popular in many industries because of at least one of these functionalities, but
of course, they are not the only potential current uses. In addition, as we
develop machine learning tools, we are constantly inventing more. Given machine
learning’s newness compared to many other century-old technologies, time will
tell all the ways humans utilize it.
I am pleased to announce that the Annals of Anthropological Practice has accepted my article “Anthropology by Data Science.” https://anthrosource.onlinelibrary.wiley.com/doi/10.1111/napa.12169. In it, I reflect on the relationship anthropologist have cultivated with data science as a discipline and the importance of integrating machine learning techniques into ethnographic practice.
In the spring of 2018, I researched how anthropologists and related social scholars have analyzed data science and machine learning for my Master’s in Anthropology at the University of Memphis. For the project, I assessed the anthropological literature on data science and machine learning to date and explore potential connections between anthropology and data science, based on my perspective as a data scientist and anthropologist. Here is my final report.
Thank you, Dr. Ted Maclin, for your help overseeing and assisting this project.
This is my practicum report with Indicia Consulting. In lieu of a master’s thesis, the University of Memphis Department of Anthropology required that we master’s students conduct a practicum project. For this, we had to partner with an organization and complete a 300+ hour anthropological research project based on the organization’s needs and our skills and interests. My practicum project was Indicia’s EPIC Project with the California Energy Commission (see this link and this link for more details on the EPIC Project). In this report, I outline potential ways to integrate ethnographic/anthropological and data science research in professional settings.
In November 2019, the American Anthropological Association’s Committee for the Anthropology of Science, Technology, and Computing (CASTAC) awarded me the David Hakken Graduate Student Prize for innovative science and technology scholarship.
The Anthropology Department also required that you publicly present your practicum research to the University of Memphis campus. This PowerPoint summarizes my practicum project. If you are not keen to read the 99 page full report, this is a much shorter alternative:
The following is a presentation I gave at the Society for Applied Anthropology’s 2018 annual conference in Philadelphia, PA. In it, I describe how I think anthropologists should understand, analyze, and relate to machine learning and data science.
Below is a talk I gave at the 2019 Memphis Data conference, organized by the University of Memphis to discuss data science research in the Memphian community. In this presentation, I summarize a project I did with Indicia Consulting that integrated data science and ethnography.