As a data scientist and ethnographer, I have worked on many types of research projects. In professional and business settings, I am excited by the enormous growth in both data science and ethnography but have been frustrated by how, despite recent developments that make them more similar, their respective teams seem to be growing apart and competitively against each other.
Within academia, quantitative and qualitative research methods have developed historically as distinct and competing approaches as if one has to choose which direction to take when doing research: departments or individual researchers specialize in one or the other and fight over scarce research funding. One major justification for this division has been the perception that quantitative approaches tend to be prescriptive and top-down compared with qualitative approaches which tend to be to descriptive and bottom-up. That many professional research contexts have inherited this division is unfortunate.
Recent developments in data science draw parallels with qualitative research and if anything, could be a starting point for collaborative intermingling. What has developed as “traditional” statistics taught in introductory statistics courses is generally top-down, assuming that data follows a prescribed, ideal model and asking regimented questions based on that ideal model. Within the development of machine learning been a shift towards models uniquely tailored to the data and context in question, developed and refined iteratively.[i] These trends may show signs of breaking down the top-down nature of traditional statistics work.
If there was ever a time to integrate quantitative data science and qualitative ethnographic research, it is now. In the increasingly important “data economy,” understanding users/consumers is vital to developing strategic business practices. In the business world, both socially-oriented data scientists and ethnographers are experts in understanding users/consumers, but separating them into competing groups only prevents true synthesis of their insights. Integrating the two should not just include combining the respective research teams and their projects but also encouraging researchers to develop expertise in both instead of simply specializing in one or the other. New creative energy could burst forth when we no longer treat these as distinct methodologies or specialties.
[i] Nafus, D., & Knox, H. (2018). Ethnography for a Data-Saturated World. Manchester: Manchester University Press, 11-12.
As an anthropologist and data scientist, I often feel caught in the middle two distinct warring factions. Anthropologists and data scientists inherited a historic debate between quantitative and qualitative methodologies in social research within modern Western societies. At its core, this debate has centered on the difference between objective, prescriptive, top-downtechniques and subjective, sitautional, flexible, descritpive bottom-up approaches.[i] In this ensuing conflict, quantative research has been demarcated into the top-down faction and qualitative research within the bottom-up faction to the detriment of understanding both properly.
In my experience on both “sides,” I have seen a tendency among anthropologists to lump all quantitative social research as proscriptive and top-down and thus miss the important subtleties within data science and other quantitative techniques. Machine learning techniques within the field are a partial shift towards bottom-up, situational and iterative quantitative analysis, and business anthropologists should explore what data scientists do as a chance to redevelop their relationship with quantitative analysis.
Shifts in Machine Learning
Shifts within machine learning algorithm development give impetus for incorporating quantitative techniques that are local and interpretive. The debate between top-down vs. bottom-up knowledge production does not need – or at least may no longer need– to divide quantitative and qualitative techniques. Machine learning algorithms “leave open the possibility of situated knowledge production, entangled with narrative,” a clear parallel to qualitative ethnographic techniques.[ii]
At the same time, this shift towards iterative and flexible machine learning techniques is not total within data science: aspects of top-down frameworks remain, in terms of personnel, objectives, habits, strategies, and evaluation criteria. But, seeds of bottom-up thinking definitely exist prominently within data science, with the potential to significantly reshape data science and possibly quantitative analysis in general.
As a discipline, data science is in a uniquely formative and adolescent period, developing into its “standard” practices. This leads to significant fluctuations as the data scientist community defines its methodology. The set of standard practices that we now typically call “traditional” or “standard” statistics, generally taught in introductory statistics courses, developed over a several decade period in the late nineteenth and early twentieth century, especially in Britain.[iii] Connected with recent computer technology, data science is in a similarly formative period right now – developing its standard techniques and ways of thinking. This formative period is a strategic time for anthropologists to encourage bottom-up quantative techniques.
Conclusion
Business
anthropologists could and should be instrumental in helping to develop and
innovatively utilize these situational and iterative machine learning
techniques. This is a strategic time for business anthropologists to do the
following:
Immerse themselves into data science and encourage and cultivate bottom-up quantative machine learning techniques within data science
Cultivate and incorporate (when applicable) situational and iterative machine learning approaches in its ethnographies
For both, anthropologists should use the strengths of ethnographic and anthropological thinking to help develop bottom-up machine learning that is grounded in flexible to specific local contexts. Each requires business anthropologists to reexplore their relationship with data science and machine learning instead of treating it as part of an opposing “methodological clan.” [iv]
[i] Nafus, D.,
& Knox, H. (2018). Ethnography for a Data-Saturated World. Manchester:
Manchester University Press, 11-12
For my first interview in the Interview Series, I interviewed Astrid Countee. She is a business anthropologist and technologist with a background in anthropology, software engineering, and data science. She currently works as a user researcher at the peer-to-peer distributed company Holo, as a research associate at The Plenary, as an arts and education nonprofit, and as a co-founder of Missing Link Studios which distributes the This Anthro Life podcast.
If the audio does not play on your computer, you can download it here:
For my fourth interview in the Interview Series, I interviewed Emi Harry. This is the first part of three of our conversation. Emi Harry is the co-founder of Naina Tech Inc., a New York-based tech startup that is poised to launch an adaptive learning platform for early childhood education in the U.S. and Nigeria’s underserved communities. As a highly skilled data scientist and social entrepreneur, Harry is also on the board of Alula Learning, an EdTech learning management systems provider, and Manna, a health and nutrition company, both in Nigeria. She has had a diverse professional experience, having worked in the food, oil and gas, entertainment, and fashion industries in Nigeria, as well as the entertainment, non-profit, and education industries in the United States. Currently, she balances her time between working in tech, creative writing, and fashion designing.
Her educational qualifications include B.S. in Mathematics, University of Lagos, Nigeria; Master’s in Social Entrepreneurship, Hult International Business School, San Francisco; M.Sc. in Data Analytics/Science, Fordham University, New York, and is on track to earn a M.Sc. in Computer Science from Pace University New York.
During this first part of our conversation, we discussed the data science company she founded and how she learned data science.
This is the second part of my interview with Emi Harry as part of my Interview Series. In it, she discusses her experiences of racial discrimination in data science as a black woman, how she manages her dual background in data science and fashion, and how she leverages her storytelling and communication skills as a data scientist. If you would like to start at the beginning of my interview with her, click here.
Emi Harry is the co-founder of Naina Tech Inc., a New York-based tech startup that is poised to launch an adaptive learning platform for early childhood education in the U.S. and Nigeria’s underserved communities. As a highly skilled data scientist and social entrepreneur, Harry is also on the board of Alula Learning, an EdTech learning management systems provider, and Manna, a health and nutrition company, both in Nigeria. She has had a diverse professional experience, having worked in the food, oil and gas, entertainment, and fashion industries in Nigeria, as well as the entertainment, non-profit, and education industries in the United States. Currently, she balances her time between working in tech, creative writing, and fashion designing.
Her educational qualifications include B.S. in Mathematics, University of Lagos, Nigeria; Master’s in Social Entrepreneurship, Hult International Business School, San Francisco; M.Sc. in Data Analytics/Science, Fordham University, New York, and is on track to earn a M.Sc. in Computer Science from Pace University New York.
This is the third part of my interview with Emi Harry as part of my Interview Series. In it, she discusses her dual identify as a data scientist and entrepreneur, including how what it takes to be an entrepreneur, her experiences starting a data science company and recommendations she has for any data scientists considering starting their own.
Emi Harry is the co-founder of Naina Tech Inc., a New York-based tech startup that is poised to launch an adaptive learning platform for early childhood education in the U.S. and Nigeria’s underserved communities. As a highly skilled data scientist and social entrepreneur, Harry is also on the board of Alula Learning, an EdTech learning management systems provider, and Manna, a health and nutrition company, both in Nigeria. She has had a diverse professional experience, having worked in the food, oil and gas, entertainment, and fashion industries in Nigeria, as well as the entertainment, non-profit, and education industries in the United States. Currently, she balances her time between working in tech, creative writing, and fashion designing.
Her educational qualifications include B.S. in Mathematics, University of Lagos, Nigeria; Master’s in Social Entrepreneurship, Hult International Business School, San Francisco; M.Sc. in Data Analytics/Science, Fordham University, New York, and is on track to earn a M.Sc. in Computer Science from Pace University New York.
I interviewed Olga Shiyan as part of my Interview Series. In it, she discusses her anti-corruption work in Kazakhstan with Transparency International. In particular, she highlights various projects that have integrated anthropology with data science and statistics.
Olga Shiyan is the Executive Director of the Transparency International’s chapter in Kazakhstan. She specializes in advocacy, legislation and draft laws, and democratic training programs. For this, she has developed research methods that combine anthropology and data science and statistics. In 2019, the Kazakhstan Geographic Society awarder for a medal for anti-corruption work.
To learn more about Olga, feel free to check out the following:
In a previous article, I have discussed the value of integrating data science and ethnography. On LinkedIn, people commented that they were interested and wanted to hear more detail on potential ways to do this. I replied, “I have found explaining how to conduct studies that integrate the two practically is easier to demonstrate through example than abstractly since the details of how to do it vary based on the specific needs of each project.”
In this article, I intend to do exactly that: analyze four innovative projects that in some way integrated data science and ethnography. I hope these will spur your creative juices to help think through how to creatively combine them for whatever project you are working on.
Synopsis:
Project:
How It Integrated Data Science and Ethnography:
Link to Learn More:
No Show Model
Used ethnography to design machine learning software
Used ethnography to understand how users make sense of and behave towards a machine learning system they encounter and how this, in turn, shapes the development of the machine learning algorithm(s)
A medical clinic at a hospital system in New York City asked me to use machine learning to build a show rate predictor in order to inform an improve its scheduling practices. During the initial construction phase, I used ethnography to both understand in more depth understand the scheduling problem the clinic faced and determine an appropriate interface design.
Through an ethnographic inquiry, I discovered the most important question(s) schedulers ask when scheduling their appointments. This was, “Of the people scheduled for a given doctor on a particular day, how many of them are likely to actually show up?” I then built a machine learning model to answer this exact question. My ethnographic inquiry provided me the design requirements for the data science project.
In addition, I used my ethnographic inquiries to design the interface. I observed how schedulers interacted with their current scheduling software, which gave me a sense for what kind of visualizations would work or not work for my app.
This project exemplifies how ethnography can be helpful both in the development stage of a machine learning project to determine machine learning algorithm(s) needs and on the frontend when communicating the algorithm(s) to and assessing its successfulness with its users.
As both an ethnographer and a data scientist, I was able to translate my ethnographic insights seamlessly into machine learning modeling and API specifications and also conducted follow-up ethnographic inquiries to ensure that what I was building would meet their needs.
Project 2: Cybersensitivity Study
I conducted this project with Indicia Consulting. Its goal was to explore potential connections between individuals’ energy consumption and their relationship with new technology. This is an example of using ethnography to explore and determine potential social and cultural patterns in-depth with a few people and then using data science to analyze those patterns across a large population.
We started the project by observing and interviewing about thirty participants, but as the study progressed, we needed to develop a scalable method to analyze the patterns across whole communities, counties, and even states.
Ethnography is a great tool for exploring a phenomenon in-depth and for developing initial patterns, but it is resource-intensive and thus difficult to conduct on a large group of people. It is not practical for saying analyzing thousands of people. Data science, on the other hand, can easily test the validity across an entire population of patterns noticed in smaller ethnographic studies, yet because it often lacks the granularity of ethnography, would often miss intricate patterns.
Ethnography is also great on the back end for determining whether the implemented machine learning models and their resulting insights make sense on the ground. This forms a type of iterative feedback loop, where data science scales up ethnographic insights and ethnography contextualizes data science models.
Thus, ethnography and data science cover each other’s weaknesses well, forming a great methodological duo for projects centered around trying to understand customers, users, colleagues, or other users in-depth.
Project 3: Facebook Newsfeed Folk Theories
In their study, Motahhare Eslami and her team of researchers conducted an ethnographic inquiry into how various Facebook users conceived of how the Facebook Newsfeed selects which posts/stories rise to the top of their feeds. They analyze several different “folk theories” or working theories by everyday people for the criteria this machine learning system uses to select top stories.
How users think the overall system works influences how they respond to the newsfeed. Users who believe, for example, that the algorithm will prioritize the posts of friends for whom they have liked in the past will often intentionally like the posts of their closest friends and family so that they can see more of their posts.
Users’ perspectives on how the Newsfeed algorithm works influences how they respond to it, which, in turn, affects the very data the algorithm learns from and thus how the algorithm develops. This creates a cyclic feedback loop that influences the development of the machine learning algorithmic systems over time.
Their research exemplifies the importance of understanding how people think about, respond to, and more broadly relate with machine learning-based software systems. Ethnographies into people’s interactions with such systems is a crucial way to develop this understanding.
In a way, many machine learning algorithms are very social in nature: they – or at least the overall software system in which they exist – often succeed or fail based on how humans interact with them. In such cases, no matter how technically robust a machine learning algorithm is, if potential users cannot positively and productively relate to it, then it will fail.
Ethnographies into the “social life” of machine learning software systems (by which I mean how they become a part of – or in some cases fail to become a part of – individuals’ lives) helps understand how the algorithm is developing or learning and determine whether they are successful in what we intended them to do. Such ethnographies require not only in-depth expertise in ethnographic methodology but also an in-depth understanding how machine learning algorithms work to in turn understand how social behavior might be influencing their internal development.
Project 4: Thing Ethnography
Elise Giaccardi and her research team have been pioneering the utilization of data science and machine learning to understand and incorporate the perspective of things into ethnographies. With the development of the internet of things (IOT), she suggests that the data from object sensors could provide fresh insights in ethnographies of how humans relate to their environment by helping to describe how these objects relate to each other. She calls this thing ethnography.
This experimental approach exemplifies one way to use machine learning algorithms within ethnographies as social processes/interactions in of themselves. This could be an innovative way to analyze the social role of these IOT objects in daily life within ethnographic studies. If Eslami’s work exemplifies a way to graft ethnographic analysis into the design cycle of machine learning algorithms, Giaccardi’s research illustrates one way to incorporate data science and machine learning analysis into ethnographies.
Conclusion
Here are four examples of innovative projects that involve integrating data science and ethnography to meet their respective goals. I do not intend these to be the complete or exhaustive account of how to integrate these methodologies but as food for thought to spur further creative thinking into how to connect them.
For those who, when they hear the idea of integrating data science and ethnography, ask the reasonable question, “Interesting but what would that look like practically?”, here are four examples of how it could look. Hopefully, they are helpful in developing your own ideas for how to combine them in whatever project you are working on, even if its details are completely different.
I recently integrated ethnography and data science to develop a Show Rate Predictor for an (anonymous) hospital system. Many readers have asked for real-world examples of this integration, and this project demonstrates how ethnography and data science can join to build machine learning-based software that makes sense to users and meets their needs.
Part 1: Scoping out the Project
A particular clinic in the hospital system was experiencing a large number of appointment no-shows, which produced wasted time, frustration, and confusion for both its patients and employees. I was asked to use data science and machine learning to better understand and improve their scheduling.
I started the project by conducting ethnographic research into the clinic to learn more about how scheduling occurs normally, what effect it was having on the clinic, and what driving problems employees saw. In particular, I observed and interviewed scheduling assistants to understand their day-to-day work and their perspectives on no-shows.
One major lesson I learned through all this was that when scheduling an appointment, schedulers are constantly trying to determine how many people to schedule on a given doctor’s shift to ensure the right number of people show up. For example, say 12-14 patients is a good number of patients for Dr. Rodriguez’s (made up name) Wednesday morning shift. When deciding whether to schedule an appointment for the given patient with Dr. Rodriguez on an upcoming Wednesday, the scheduling assistants try to determine, given the appointments currently scheduled then, whether they can expect 12-14 patients to show up. This was often an inexact science. They would often have to schedule 20-25 patients on a particular doctor’s shift to ensure their ideal window of 12-14 patients would actually come that day. This could create the potential for chaos, however, where too many patients arriving on some days and too few on others.
This question – how many appointments can we expect or predict to occur on a given doctor’s shift – became my driving question to answer with machine learning. After checking in with the various stakeholders at the clinic to make sure this was in fact an important and useful question to answer with machine learning, I started building.
Part 2: Building the Model
Now that I had a driving, answerable question, I decided to break it down into two sequential machine learning models:
The first model learned to predict the probability that a given appointment would occur, learning from the history of occurring or no-show appointments.
The second model, using the appointment probabilities from the first model, estimated how many appointments might occur for every doctors’ shift.
The first model combined three streams of data to assess the no-show probability: appointment data (such as how long ago it was scheduled, type of appointment, etc.); patient information, especially past appointment history; and doctor information. I performed extensive feature selection to determine the best subset of variables to use and tested several types of machine learning models before settling on gradient boosting.
The second model used the probabilities in the first model as input data to predict how many patients to expect to come on each doctors’ shift. I settled on a neural network for the model.
Part 3: Building an App
Next, I worked with the software engineers on my team to develop an app to employ these models in real time and communicate the information to schedulers as they scheduled appointments. My ethnographic research was invaluable for developing how to construct the app.
On the back end, the app calculated the probability that all future appointments would occur, updating with new calculations for newly scheduled or edited appointments. Once a week, it would incorporate that week’s new appointment data and shift attendance to each model’s training data and update those models accordingly.
Through my ethnographic research, I observed how schedulers approached scheduling appointments, including what software they used in the process and how they used each. I used that to determine the best ways to communicate that information, periodically showing my ideas to the schedulers to make sure my strategy would be helpful.
I constructed an interface to communicate the information that would complement the current software they used. In addition to displaying the number of patients expected to arrive, if the machine learning algorithm was predicting that a particular shift was underbooked, it would mark the shift in green on the calendar interface; yellow if the shift was projected to have the ideal number of patients, and red if already expected have too many patients. The color-coding allowed easy visualization of the information in the moment: when trying to find an appointment time for a patient, they could easily look for the green shifts or yellow if they had to, but steer clear of the red. When zooming in on a specific shift, each appointment would be color-coded (likely, unlikely, and in the middle) as well based on the probability that it would occur.
Conclusion
This is one example of a projects that integrates data science and ethnography to build a machine learning app. I used ethnography to construct the app’s parameters and framework. It tethered the app in the needs of the schedulers, ensuring that the machine learning modeling I developed was useful to those who would use it. Frequent check-ins before each step in their development also helped confirm that my proposed concept would in fact help meet their needs.
My data science and machine learning expertise helped guide me in the ethnographic process as well. Being an expert in how machine learning worked and what sorts of questions it could answer allowed me to easily synthesize the insights from my ethnographic inquiries into buildable machine learning models. I understood what machine learning was capable (and not capable) of doing, and I could intuitively develop strategic ways to employ machine learning to address issues they were having.
Hence, my dual role as an ethnography and data scientist benefitted the project greatly. My listening skills from ethnography enabled me to uncover the underlying questions/issues schedulers faced, and my data science expertise gave me the technical skills to develop a viable machine learning solution. Without listening patiently through extensive ethnography, I would not have understood the problem sufficiently, but without my data science expertise, I would have been unable to decipher which questions(s) or issue(s) machine learning could realistically address and how.
This exemplifies why a joint expertise in data science and ethnography is invaluable in developing machine learning software. Two different individuals or teams could complete each separately – an ethnographer(s) analyze the users’ needs and a data scientist(s) then determine whether machine learning modeling could help. But this seems unnecessarily disjointed, potentially producing misunderstanding, confusion, and chaos. By adding an additional layer of people, it can easily lead to either the ethnographer(s) uncovering needs way too broad or complex for a machine learning-based solution to help or the data scientist(s) trying to impose their machine learning “solution” to a problem the users do not have.
Developing expertise in both makes it much easier to simultaneously understand the problems or questions in a particular context and build a doable data science solution.
A friend and fellow professor, Dr. Eve Pinkser, asked me to give a guest lecture on quantitative text analysis techniques within data science for her Public Health Policy Research Methods class with the University of Illinois at Chicago on April 13th, 2020. Multiple people have asked me similar questions about how to use data science to analyze texts quantitatively, so I figured I would post my presentation for anyone interested in learning more.
It provides a basic introduction of the different approaches so that you can determine which to explore in more detail. I have found that many people who are new to data science feel paralyzed when trying to navigate through the vast array of data science techniques out there and unsure where to start.
Many of her students needed to conduct quantitative textual analysis as part of their doctoral work but struggled in determining what type of quantitative research to employ. She asked me to come in and explain the various data science and machine learning-based textual analysis techniques, since this was out of her area of expertise. The goal of the presentation was to help the PhD students in the class think through the types of data science quantitative text analysis techniques that would be helpful for their doctoral research projects.
Hopefully, it would likewise allow you to determine the type or types of text analysis you might need so that you can then look those up in more detail. Textual analysis, as well as the wider field of natural language processing within which it is a part of, is a quickly up-and-coming subfield within data science doing important and groundbreaking work.