Four Innovative Projects that Integrated Data Science and Ethnography

In a previous article, I have discussed the value of integrating data science and ethnography. On LinkedIn, people commented that they were interested and wanted to hear more detail on potential ways to do this. I replied, “I have found explaining how to conduct studies that integrate the two practically is easier to demonstrate through example than abstractly since the details of how to do it vary based on the specific needs of each project.”

In this article, I intend to do exactly that: analyze four innovative projects that in some way integrated data science and ethnography. I hope these will spur your creative juices to help think through how to creatively combine them for whatever project you are working on.

Synopsis:

Project:How It Integrated Data Science and Ethnography:Link to Learn More:
No Show ModelUsed ethnography to design machine learning softwarehttps://ethno-data.com/show-rate-predictor/
Cybersensitivity StudyUsed machine learning to scale up the scope of an ethnographic inquiry to a larger populationhttps://ethno-data.com/masters-practicum-summary/
Facebook Newsfeed Folk TheoriesUsed ethnography to understand how users make sense of and behave towards a machine learning system they encounter and how this, in turn, shapes the development of the machine learning algorithm(s)https://dl.acm.org/doi/10.1145/2858036.2858494
Thing EthnographyUsed machine learning to incorporate objects’ interactions into ethnographic researchhttps://dl.acm.org/doi/10.1145/2901790.2901905 and https://www.semanticscholar.org/paper/Things-Making-Things%3A-An-Ethnography-of-the-Giaccardi-Speed/2db5feac9cc743767fd23aeded3aa555ec8683a4?p2df

Project 1: No Show Model

A medical clinic at a hospital system in New York City asked me to use machine learning to build a show rate predictor in order to inform an improve its scheduling practices. During the initial construction phase, I used ethnography to both understand in more depth understand the scheduling problem the clinic faced and determine an appropriate interface design.

Through an ethnographic inquiry, I discovered the most important question(s) schedulers ask when scheduling their appointments. This was, “Of the people scheduled for a given doctor on a particular day, how many of them are likely to actually show up?” I then built a machine learning model to answer this exact question. My ethnographic inquiry provided me the design requirements for the data science project.  

In addition, I used my ethnographic inquiries to design the interface. I observed how schedulers interacted with their current scheduling software, which gave me a sense for what kind of visualizations would work or not work for my app.

This project exemplifies how ethnography can be helpful both in the development stage of a machine learning project to determine machine learning algorithm(s) needs and on the frontend when communicating the algorithm(s) to and assessing its successfulness with its users.

As both an ethnographer and a data scientist, I was able to translate my ethnographic insights seamlessly into machine learning modeling and API specifications and also conducted follow-up ethnographic inquiries to ensure that what I was building would meet their needs.

Project 2: Cybersensitivity Study

I conducted this project with Indicia Consulting. Its goal was to explore potential connections between individuals’ energy consumption and their relationship with new technology. This is an example of using ethnography to explore and determine potential social and cultural patterns in-depth with a few people and then using data science to analyze those patterns across a large population.

We started the project by observing and interviewing about thirty participants, but as the study progressed, we needed to develop a scalable method to analyze the patterns across whole communities, counties, and even states.

Ethnography is a great tool for exploring a phenomenon in-depth and for developing initial patterns, but it is resource-intensive and thus difficult to conduct on a large group of people. It is not practical for saying analyzing thousands of people. Data science, on the other hand, can easily test the validity across an entire population of patterns noticed in smaller ethnographic studies, yet because it often lacks the granularity of ethnography, would often miss intricate patterns.

Ethnography is also great on the back end for determining whether the implemented machine learning models and their resulting insights make sense on the ground. This forms a type of iterative feedback loop, where data science scales up ethnographic insights and ethnography contextualizes data science models.

Thus, ethnography and data science cover each other’s weaknesses well, forming a great methodological duo for projects centered around trying to understand customers, users, colleagues, or other users in-depth.

Project 3: Facebook Newsfeed Folk Theories

In their study, Motahhare Eslami and her team of researchers conducted an ethnographic inquiry into how various Facebook users conceived of how the Facebook Newsfeed selects which posts/stories rise to the top of their feeds. They analyze several different “folk theories” or working theories by everyday people for the criteria this machine learning system uses to select top stories.

How users think the overall system works influences how they respond to the newsfeed. Users who believe, for example, that the algorithm will prioritize the posts of friends for whom they have liked in the past will often intentionally like the posts of their closest friends and family so that they can see more of their posts.

Users’ perspectives on how the Newsfeed algorithm works influences how they respond to it, which, in turn, affects the very data the algorithm learns from and thus how the algorithm develops. This creates a cyclic feedback loop that influences the development of the machine learning algorithmic systems over time.

Their research exemplifies the importance of understanding how people think about, respond to, and more broadly relate with machine learning-based software systems. Ethnographies into people’s interactions with such systems is a crucial way to develop this understanding.

In a way, many machine learning algorithms are very social in nature: they – or at least the overall software system in which they exist – often succeed or fail based on how humans interact with them. In such cases, no matter how technically robust a machine learning algorithm is, if potential users cannot positively and productively relate to it, then it will fail.

Ethnographies into the “social life” of machine learning software systems (by which I mean how they become a part of – or in some cases fail to become a part of – individuals’ lives) helps understand how the algorithm is developing or learning and determine whether they are successful in what we intended them to do. Such ethnographies require not only in-depth expertise in ethnographic methodology but also an in-depth understanding how machine learning algorithms work to in turn understand how social behavior might be influencing their internal development.

Project 4: Thing Ethnography

Elise Giaccardi and her research team have been pioneering the utilization of data science and machine learning to understand and incorporate the perspective of things into ethnographies. With the development of the internet of things (IOT), she suggests that the data from object sensors could provide fresh insights in ethnographies of how humans relate to their environment by helping to describe how these objects relate to each other. She calls this thing ethnography.

This experimental approach exemplifies one way to use machine learning algorithms within ethnographies as social processes/interactions in of themselves. This could be an innovative way to analyze the social role of these IOT objects in daily life within ethnographic studies. If Eslami’s work exemplifies a way to graft ethnographic analysis into the design cycle of machine learning algorithms, Giaccardi’s research illustrates one way to incorporate data science and machine learning analysis into ethnographies.

Conclusion

Here are four examples of innovative projects that involve integrating data science and ethnography to meet their respective goals. I do not intend these to be the complete or exhaustive account of how to integrate these methodologies but as food for thought to spur further creative thinking into how to connect them.

For those who, when they hear the idea of integrating data science and ethnography, ask the reasonable question, “Interesting but what would that look like practically?”, here are four examples of how it could look. Hopefully, they are helpful in developing your own ideas for how to combine them in whatever project you are working on, even if its details are completely different.

Photo credit #1: StartupStockPhotos at https://pixabay.com/photos/startup-meeting-brainstorming-594090/

Photo credit #2: DarkoStojanovicat at https://pixabay.com/photos/medical-appointment-doctor-563427/  

Photo credit #3: NASA at https://unsplash.com/photos/Q1p7bh3SHj8  

Photo credit #4: Kon Karampelas at https://unsplash.com/photos/HUBofEFQ6CA

Photo credit #5: Pixabay at https://www.pexels.com/photo/app-business-connection-device-221185/  

UX Research and Business Anthropology Are Central within Applied Anthropology

photo of woman wearing turtleneck top
Photo by Ali Pazani on Pexels.com

This is a research paper I wrote for a master’s course on Applied Anthropology at the University of Memphis. The overall master’s program sought to train students in applied anthropology, and the goal of this course was to teach the foundations of what applied anthropology is, in contrast to other types of anthropology.

Even though I found the course interesting, its curriculum lacked the readings and perspectives of applied anthropologists in the business world. As I discuss in the paper, statistically speaking, a significant number of applied anthropologists (and a University of Memphis’s applied anthropology program alum) work in the business sector, so excluding them leaves out what might be the largest group of applied anthropologists from their own field. I wrote this essay as a subtle nudge to encourage the course designers to add the works of business anthropologists, particularly UX researchers, into their curriculum.

Due to the lack of resources by applied business anthropologists in the curriculum, I had to assemble my own resources entirely by myself. Other applied anthropologists have told me they have encountered this as well. So, hopefully, in addition to the essay potentially providing helpful analysis of applied business anthropology, its bibliography might also provide a starting collection of business anthropology resources for you to explore.

Loader Loading…
EAD Logo Taking too long?

Reload Reload document
| Open Open in new tab

Download [274.72 KB]

Three Key Differences between Data Science and Statistics

woman draw a light bulb in white board

Data science’s popularity has grown in the last few years, and many have confused it with its older, more familiar relative: statistics. As someone who has worked both as a data scientist and as a statistician, I frequently encounter such confusion. This post seeks to clarify some of the key differences between them.

Before I get into their differences, though, let’s define them. Statistics as a discipline refers to the mathematical processes of collecting, organizing, analyzing, and communicating data. Within statistics, I generally define “traditional” statistics as the the statistical processes taught in introductory statistics courses like basic descriptive statistics, hypothesis testing, confidence intervals, and so on: generally what people outside of statistics, especially in the business world, think of when they hear the word “statistics.”

Data science in its most broad sense is the multi-disciplinary science of organizing, processing, and analyzing computational data to solve problems. Although they are similar, data science differs from both statistics and “traditional” statistics:

DifferenceStatistics Data Science
#1 Field of Mathematics Interdisciplinary
#2 Sampled Data Comprehensive Data
#3 Confirming Hypothesis Exploratory Hypotheses

Difference #1: Data Science Is More than a Field of Mathematics

Statistics is a field of mathematics; whereas, data science refers to more than just math. At its simplest, data science centers around the use of computational data to solve problems,[i] which means it includes the mathematics/statistics needed to break down the computational data but also the computer science and engineering thinking necessary to code those algorithms efficiently and effectively, and the business, policy, or other subject-specific “smarts” to develop strategic decision-making based on that analysis.

Thus, statistics forms a crucial component of data science, but data science includes more than just statistics. Statistics, as a field of mathematics, just includes the mathematical processes of analyzing and interpreting data; whereas, data science also includes the algorithmic problem-solving to do the analysis computationally and the art of utilizing that analysis to make decisions to meet the practical needs in the context. Statistics clearly forms a crucial part of the process of data science, but data science generally refers to the entire process of analyzing computational data. On a practical level, many data scientists do not come from a pure statistics background but from a computer science or engineering, leveraging their coding expertise to develop efficient algorithmic systems.

laptop computer on glass-top table

Difference #2: Comprehensive vs Sample Data

In statistical studies, researchers are often unable to analyze the entire population, that is the whole group they are analyzing, so instead they create a smaller, more manageable sample of individuals that they hope represents the population as a whole. Data science projects, however, often involves analyzing big, summative data, encapsulating the entire population.

 The tools of traditional statistics work well for scientific studies, where one must go out and collect data on the topic in question. Because this is generally very expensive and time-consuming, researchers can only collect data on a subset of the wider population most of the time.

Recent developments in computation, including the ability to gather, store, transfer, and process greater computational data, have expanded the type of quantitative research now possible, and data science has developed to address these new types of research. Instead of gathering a carefully chosen sample of the population based on a heavily scrutinized set of variables, many data science projects require finding meaningful insights from the myriads of data already collected about the entire population.

stack of jigsaw puzzle pieces

Difference #3: Exploratory vs Confirming  

Data scientists often seek to build models that do something with the data; whereas, statisticians through their analysis seek to learn something from the data. Data scientists thus often assess their machine learning models based on how effectively they perform a given task, like how well it optimizes a variable, determines the best course of action, correctly identifies features of an image, provides a good recommendation for the user, and so on. To do this, data scientists often compare the effectiveness or accuracy of the many models based on a chosen performance metric(s).

In traditional statistics, the questions often center around using data to understand the research topic based on the findings from a sample. Questions then center around what the sample can say about the wider population and how likely its results would represent or apply to that wider population.

In contrast, machine learning models generally do not seek to explain the research topic but to do something, which can lead to very different research strategy. Data scientists generally try to determine/produce the algorithm with the best performance (given whatever criteria they use to assess how a performance is “better”), testing many models in the process. Statisticians often employ a single model they think represents the context accurately and then draw conclusions based on it.

Thus, data science is often a form of exploratory analysis, experimenting with several models to determine the best one for a task, and statistics confirmatory analysis, seeking to confirm how reasonable it is to conclude a given hypothesis or hypotheses to be true for the wider population.

A lot of scientific research has been theory confirming: a scientist has a model or theory of the world; they design and conduct an experiment to assess this model; then use hypothesis testing to confirm or negate that model based on the results of the experiment. With changes in data availability and computing, the value of exploratory analysis, data mining, and using data to generate hypotheses has increased dramatically (Carmichael 126).

Data science as a discipline has been at the forefront of utilizing increased computing abilities to conduct exploratory work.

person holding gold-colored pocket watch

Conclusion

 A data scientist friend of mine once quipped to me that data science simply is applied computational statistics (c.f. this). There is some truth in this: the mathematics of data science work falls within statistics, since it involves collecting, analyzing, and communicating data, and, with its emphasis and utilization of computational data, would definitely be a part of computational statistics. The mathematics of data science is also very clearly applied: geared towards solving practical problems/needs. Hence, data science and statistics interrelate.

They differ, however, both in their formal definitions and practical understandings. Modern computation and big data technologies have had a major influence on data science. Within statistics, computational statistics also seeks to leverage these resources, but what has become “traditional” statistics does not (yet) incorporate these. I suspect in the next few years or decades, developments in modern computing, data science, and computational statistics will reshape what people consider “traditional” or “standard” statistics to be a bit closer to the data science of today.

   For more details, see the following useful resources:

Ian Carmichael’s and J.S. Marron’s “Data science vs. statistics: two cultures?” in the Japanese Journal of Statistics and Data Science: https://link.springer.com/article/10.1007/s42081-018-0009-3
“Data Scientists Versus Statisticians” at https://opendatascience.com/data-scientists-versus-statisticians/ and https://medium.com/odscjournal/data-scientists-versus-statisticians-8ea146b7a47f
“Differences between Data Science and Statistics” at https://www.educba.com/data-science-vs-statistics/

Photo credit #1: Andrea Piacquadio at https://www.pexels.com/photo/woman-draw-a-light-bulb-in-white-board-3758105/

Photo credit #2: Carlos Muza at https://unsplash.com/photos/hpjSkU2UYSU

Photo credit #3: Hans-Peter Gauster at https://unsplash.com/photos/3y1zF4hIPCg

Photo credit #4: Kendall Lane at https://unsplash.com/photos/yEDhhN5zP4o


[i] Carmichael 118.

Using Data Science and Ethnography to Build a Show Rate Predictor

I recently integrated ethnography and data science to develop a Show Rate Predictor for an (anonymous) hospital system. Many readers have asked for real-world examples of this integration, and this project demonstrates how ethnography and data science can join to build machine learning-based software that makes sense to users and meets their needs.

Part 1: Scoping out the Project

A particular clinic in the hospital system was experiencing a large number of appointment no-shows, which produced wasted time, frustration, and confusion for both its patients and employees. I was asked to use data science and machine learning to better understand and improve their scheduling.

I started the project by conducting ethnographic research into the clinic to learn more about how scheduling occurs normally, what effect it was having on the clinic, and what driving problems employees saw. In particular, I observed and interviewed scheduling assistants to understand their day-to-day work and their perspectives on no-shows.

One major lesson I learned through all this was that when scheduling an appointment, schedulers are constantly trying to determine how many people to schedule on a given doctor’s shift to ensure the right number of people show up. For example, say 12-14 patients is a good number of patients for Dr. Rodriguez’s (made up name) Wednesday morning shift. When deciding whether to schedule an appointment for the given patient with Dr. Rodriguez on an upcoming Wednesday, the scheduling assistants try to determine, given the appointments currently scheduled then, whether they can expect 12-14 patients to show up. This was often an inexact science. They would often have to schedule 20-25 patients on a particular doctor’s shift to ensure their ideal window of 12-14 patients would actually come that day. This could create the potential for chaos, however, where too many patients arriving on some days and too few on others.

This question – how many appointments can we expect or predict to occur on a given doctor’s shift – became my driving question to answer with machine learning. After checking in with the various stakeholders at the clinic to make sure this was in fact an important and useful question to answer with machine learning, I started building.

Part 2: Building the Model

Now that I had a driving, answerable question, I decided to break it down into two sequential machine learning models:

  1. The first model learned to predict the probability that a given appointment would occur, learning from the history of occurring or no-show appointments.
  2. The second model, using the appointment probabilities from the first model, estimated how many appointments might occur for every doctors’ shift.

The first model combined three streams of data to assess the no-show probability: appointment data (such as how long ago it was scheduled, type of appointment, etc.); patient information, especially past appointment history; and doctor information. I performed extensive feature selection to determine the best subset of variables to use and tested several types of machine learning models before settling on gradient boosting.

The second model used the probabilities in the first model as input data to predict how many patients to expect to come on each doctors’ shift. I settled on a neural network for the model.

Part 3: Building an App

Next, I worked with the software engineers on my team to develop an app to employ these models in real time and communicate the information to schedulers as they scheduled appointments. My ethnographic research was invaluable for developing how to construct the app.

On the back end, the app calculated the probability that all future appointments would occur, updating with new calculations for newly scheduled or edited appointments. Once a week, it would incorporate that week’s new appointment data and shift attendance to each model’s training data and update those models accordingly.  

Through my ethnographic research, I observed how schedulers approached scheduling appointments, including what software they used in the process and how they used each. I used that to determine the best ways to communicate that information, periodically showing my ideas to the schedulers to make sure my strategy would be helpful.

I constructed an interface to communicate the information that would complement the current software they used. In addition to displaying the number of patients expected to arrive, if the machine learning algorithm was predicting that a particular shift was underbooked, it would mark the shift in green on the calendar interface; yellow if the shift was projected to have the ideal number of patients, and red if already expected have too many patients. The color-coding allowed easy visualization of the information in the moment: when trying to find an appointment time for a patient, they could easily look for the green shifts or yellow if they had to, but steer clear of the red. When zooming in on a specific shift, each appointment would be color-coded (likely, unlikely, and in the middle) as well based on the probability that it would occur.

Conclusion

This is one example of a projects that integrates data science and ethnography to build a machine learning app. I used ethnography to construct the app’s parameters and framework. It tethered the app in the needs of the schedulers, ensuring that the machine learning modeling I developed was useful to those who would use it. Frequent check-ins before each step in their development also helped confirm that my proposed concept would in fact help meet their needs.

My data science and machine learning expertise helped guide me in the ethnographic process as well. Being an expert in how machine learning worked and what sorts of questions it could answer allowed me to easily synthesize the insights from my ethnographic inquiries into buildable machine learning models. I understood what machine learning was capable (and not capable) of doing, and I could intuitively develop strategic ways to employ machine learning to address issues they were having.

Hence, my dual role as an ethnography and data scientist benefitted the project greatly. My listening skills from ethnography enabled me to uncover the underlying questions/issues schedulers faced, and my data science expertise gave me the technical skills to develop a viable machine learning solution. Without listening patiently through extensive ethnography, I would not have understood the problem sufficiently, but without my data science expertise, I would have been unable to decipher which questions(s) or issue(s) machine learning could realistically address and how.

This exemplifies why a joint expertise in data science and ethnography is invaluable in developing machine learning software. Two different individuals or teams could complete each separately – an ethnographer(s) analyze the users’ needs and a data scientist(s) then determine whether machine learning modeling could help. But this seems unnecessarily disjointed, potentially producing misunderstanding, confusion, and chaos. By adding an additional layer of people, it can easily lead to either the ethnographer(s) uncovering needs way too broad or complex for a machine learning-based solution to help or the data scientist(s) trying to impose their machine learning “solution” to a problem the users do not have.

Developing expertise in both makes it much easier to simultaneously understand the problems or questions in a particular context and build a doable data science solution.

Photo credit #1: DarkoStojanovic at https://pixabay.com/photos/medical-appointment-doctor-563427/  

Photo credit #2: geralt at https://pixabay.com/illustrations/time-doctor-doctor-s-appointment-481445/

Photo credit #3: Pixabay at https://www.pexels.com/photo/light-road-red-yellow-46287/  

You Know You’re a Business Anthropologist If… (Funny)

You know you’re a business anthropologist if…

  1. You ask at least 500 follow-up questions when your supervisor gives you a project to really understand the full context. 
  2. You have a prepared spiel about how what you studied was different than digging up Mayan artifacts (unless that happened to be what you did).
  3. You constantly ask people how they feel when completing a task or what they think of the process.
  4. You try to reimagine and redesign any object or process that your organization will let you get your hands on.  
  5. You have critiqued every organization that has hired you.
  6. You have the strangest knick-knacks on your desk from around the world.
  7. You take triple the notes anyone else does in a meeting, recording in detail what everyone’s statements and body postures.
  8. In regular conversation, you interrogate your colleagues like you’re leading an interview.
  9. You frequent your company’s “watercooler spots” – informal places to gather to hang out. This is where the real work happens.
  10. You rage against top-down procedures and formal hierarchy every time you encounter it.
  11. You have resolved to never use PowerPoint for your presentations.
  12. Any time you hear a French word, your mind immediately goes to the French theorist with the most similar sounding name.

I intend this as a fun little exercise thinking about the quirks and idiosyncrasies of working as an anthropologist in the business world. 

Photo Credit: Toa Heftiba at https://unsplash.com/photos/FV3GConVSss

How Do I Become a Data Scientist? The Four Basic Strategies to Learn Data Science

Aspiring data scientists will frequently ask me for recommendations about the best way to learn data science. Should they try a bootcamp or enroll in an online data science course, or any of the myriad options out there?

In the last several years, we have seen the development of many different types of educational programs that teach data science, ranging from free online tutorials to bootcamps to advanced degrees at universities, and the pandemic has seemed to have fostered the establishment of even more programs to meet the increased demand for remote learning. Although probably overall a good thing, having more options increases the complexity of deciding which one to do and the potential noise of programs upselling their services.

This article is a high-level survey of the four basic types of data science education programs to help you think about which might work best for you. Without already knowing data science, it can be difficult to assess how effective a program is at teaching it. Hopefully, this article will help break that chicken-and-the-egg conundrum.

These are the four basic ways to learn data science:

  1. Do-it-yourself learning
  2. Online courses
  3. Bootcamps
  4. Master’s degree or other university degree in data science (or related field)

I will discuss them in order from the cheapest to most expensive. I also included two hybrid strategies that combine a few of these that are worth considering as well. This table provides a quick, high-level synopsis of each one:

Option 1: Do-It-Yourself Online

There are tons of free, online data science resources that can either teach data science from scratch or explain just about any data science content you could possibly want to know. These range from tutorials for those who learn by doing like W3Schools, videos on YouTube and other sites for audio learners like Andrew Ng’s YouTube series, articles for visual learners who enjoy reading like Towards Data Science. You could scour the internet and teach yourself. It has the pros of being free and perfectly flexible to tailor to your schedule.

But as a former teacher, I have found independent learning is not for everyone. You must be entirely self-motivated and self-structured to teach yourself like this. So, know yourself: are you the type of person who could learn well completely independently like this?

Education programs tend to provide these resources that you might lack if you went it alone:

1) Curriculum Oversight: Data science experts in any education program generally establish some kind of data science curriculum for you that includes the necessary topics in the field. Many people who are new to data science do not know yet what data science concepts and skills are most important to learn about. This can create a chicken and egg problem for self-learners who must learn the field at least a little to know the most important items to learn in the first place. Data science programs help circumvent this by giving you an initial curriculum to started with.

2) Guidance of the Norms of the Field: In addition to the teaching the material, education programs implicitly introduce students to data science norms and ways of thinking. Even though there are times to deviate from the established custom, they are important when first working on teams with fellow data scientists. Sometimes self-learners learn the literal material but do not gather the implicit perspectives that enables their incorporation into the data science community.

3) External Social Accountability: Education programs provide a form of social accountability that subtly encourages you to get the work done. Self-learners must rely almost exclusively on their own self-motivation and self-accountability, which, in my experience, works for some people but not others.

4) Social Resources: Education programs (especially ones that meet either in-person or virtually) provide various people – teachers, students, and in some cases mentees/underlings – with whom one can talk through problems with, help you discover your weaknesses and shortcomings, and determine ways to address them. Minute programming details that are easily overlooked by beginners, but experts might easily spot can cause your entire program to fail. To learn independently, you will have to either solve all of these yourself or find data science friends or family who are willing to help you.

5) Certification of Skills: Education programs bestow degrees, grades, and other certifications as external proof that you do, in fact, possess the requisite skills in a data science role. Learning on your own, you must prove that you have these skills to employers by yourself. Developing a portfolio of thought-provoking projects, you have done is the best way to demonstrate this.

6) Guidance in Forming Projects: An impressive project works wonders for showcasing your data science skills. In my experience, beginners to data science often do not yet possess the skills to create, complete, and market a thought-provoking yet doable project, and one of the most important roles data science educators can have is helping students think through how to develop one. You must do this yourself when learning alone.

One can overcome each of these deficits. I have found that for people who learn well independently, its cost and flexibility advantages easily outweigh these cons. Thus, the crucial question is, Would this form of independent learning work for you? In my experience, it works for a comparatively small percent of people, but for those it works for, it is a great option.

If you do decide to teach yourself, I would recommend considering the following:

1) Be conscientious about your learning style when crafting your material. For example, if you are a visual learner, then reading online material resources would be best, but if you are more of an auditory learner, then I would recommend watching video tutorials/lectures on say YouTube.

2) If you have data science friends willing to help you, they can be a great asset, particularly in determining what data science materials to learn, troubleshooting any coding issues you might have, and/or developing a good project(s).

3) People in general learn data science best by doing data science. Avoid the common trap of only reading about data science without getting your hands dirty and experimenting yourself (preferably with unclean, annoying, real-world data, not already trimmed, “textbook perfect” data). Using pristine data to first learn the concepts is fine, but make sure you graduate yourself to practicing with real-life dirty data.

Option 2: Online Course

A variety of online courses exist. Most of them are relatively cheap (usually around $20-$50 a month or $100-$200 per course). For example, at the time of writing this, Udemy has an introductory data science course for a flat rate of $94.99, and Coursera a course for $19.99 a month (both with prices varying  based on discounts and other special deals). Online courses are generally the cheapest of the courses you can enroll in, and because of the length of most, you will probably have to take several levels of courses (introductory to advanced) to learn the field.

Another advantage is that they are flexible: You can learn at your own pace, based on the needs of your schedule. This is really valuable for people who also working a job and studying on the side, with family commitments, and/or other obligations complicating their schedules. Keep in mind, though, that because you often pay per month, how many months you take often dictates the final cost. At the end of the day, spending an extra $100 or so to take a few more months to complete the course is still much cheaper than the other course options.

On the other hand, however, like doing it yourself, they tend to lack the social benefits of classroom learning: instructors to ask questions to and provide external social accountability, and fellow students to work alongside. In my experience, this makes it a very challenging for some learners, but others are not as comparatively affected by it.

In addition, many online courses provide more of a cursory summary of data science and lack the complex projects that are both necessary to learn data science and to market yourself to others. Even though there are exceptions, online courses are often good at introducing data science concepts rather than an in-depth exploration. Many focus on canned problems with already cleaned, ready-to-do data instead of letting you practice on the messy, complex, and often just plain silly data most data scientists actually have to use at their jobs. They also often lack the personnel for one-on-one coaching to mentor each student through portfolio-building projects with complex data.

Thus, online courses tend to provide good, cost-effective introductions to data science, helpful to see whether you like the field (see Hybrid #1 below), but do not generally provide the refined training necessary to become a data scientist. Now, some programs are evolving their courses. Especially as the pandemic increases demand for remote learning, online learning platforms are developing more robust online data science courses. If you choose to learn by taking online courses, I recommend supplementing it with your own projects to get experience practicing data science work and showcase in job interviews.

Hybrid #1: Use an Online course to Introduce Data Science (or Programming)

If you are completely new to data science, an online course can provide a low-cost, structured space to get a sense for what the field entails and determine whether it is a good fit for you. I have seen many people enroll in several thousand-dollar bootcamps or university degree programs only to learn there that they do not like doing data science work. An online course is a much cheaper space to discern that.

You could always explore data science yourself for free to decide whether you like it (see Option 1) instead of taking an online course, but I have found that many people who have never seen data science before do not know what to look up in the field to get started. An introductory online course is not that expensive, and the initial orientation into the major topic areas can be well worth the cost.

There are three basic versions of this approach:

1) If you do not already know a programming language, take an online programming course. I explained in this article why I would recommend Python as the language to learn (with Julia as a close second). If you do not like programming, then you have learned the lesson that you should not become a data scientist, and even if you do not end up in data science, programming is such a valuable skill that having some training in it will only help your occupational prospects in most other related fields.

2) If you do know a programming language, take an introductory data science course. These often provide a high-level overview of data science, especially helpful for people who need to work with data scientists and understand what they are talking about. If you need a math refresher, this is a great option as well.

3) I have seen prospective data scientists take online data analytics courses to prepare them for and determine their potential interest in data science. I would not recommend this, however. Even though data scientists will sometimes treat data analytics as a “diet” or “basic” version of data science, data analytics is different field requiring different skills. For example, data analytics courses typically do not include the rigorous programming. They generally focus on R and SQL if they teach programming at all, which are fine languages for data analytics and statistics but not enough for data science (for which you would want a language like Python). Data analytics and data science also generally emphasize different fields of math: data analytics tends to rely on statistics while data science on linear algebra, for example. Thus, what you would learn in those courses would not apply to data science as much as you would think. Now, if you are unsure of whether you would like to become a data scientist or data analyst, then a data analytics course might help you understand and get a feel for data analytics, but I would not use them to assess whether data science is a good fit for you.

Once you complete the online course, if you still think you would enjoy doing data science work, then you can choose any of the options to learn the field in more depth. This may seem like just getting you back to square one, but by taking an introductory programming or data science course, you have levelled yourself up so to speak and are more ready to face the “boss battle” of becoming a data scientist.

Option 3: Data Science Bootcamps

Data science bootcamps have also become popular. They tend to be several weeks long (in my experience often ranging from 2 to 6 months) intensive training programs. The traditional pre-pandemic bootcamp was in-person and would often cost around $10,000 to $15,000. Metis’s bootcamp is a good example of what they often look like.

Their biggest pros are that they offer the advantages of classroom education far more cheaply and in much less time than getting a university degree. They are a significant step-up cost-wise than the previous options (see Con 2 below), but they seek to provide a comparable (but less academically advanced and in-depth) scope of knowledge as a master’s degree in data science for a significantly lower price and in a fraction of the time. Even though it can often make their pace feel intense, the good bootcamps tend to mostly succeed at providing this. This makes them a great option for anyone who knows they want to become a data scientist. Finally, unlike the previous options, you get a teacher(s) to ask questions to and motivate you, and a set of fellow students to struggle through concepts with. The best programs offer the occupational coaching and build strong networks in data science communities to help their students find jobs afterwards.

They have some major cons, however:

1) They can feel fast-paced, unloading complex concepts in a short amount of time. Many of my friends who have done bootcamps have reported feeling cognitive whiplash. Expect those weeks/months to be mentally intense and to subsume your life. Data science bootcamps are often 9-5 full-time jobs during that time, and you will likely be too mentally exhausted to work on other things in the evenings or weekend (plus in some cases you will have homework to complete then anyways). A few weeks or months is not terribly long for such an ordeal, but it makes them much less flexible than the previous options. For example, this forces many students to take time from their current jobs to complete the bootcamp and to limit their social, familial, and other obligations as much as they can during their bootcamp. This makes it difficult for anyone unable to take time off work, with busy social or familial lives, or otherwise with a lot going on.

2) At several thousand dollars, they are clearly noticeably more expensive of the than the previous options (but still much cheaper than universities). Some offer scholarships and other services on a need-basis, but even then, the opportunity cost of having to put a job on hold can still be expensive. Given their general high salaries, landing a data science job would likely make the money back, but it takes a hefty initial investment.

This makes it an especially poor option for anyone thinking about data science but not sure whether they want to do it. $10,000 is a lot to spend to simply learn you do not like the field, and there are many cheaper ways to initially explore the field (see especially Hybrid #1). The cost still might be worth it, however, for anyone who really wants to become a data scientist but does not yet possess key skills and knowledge.

3) At the time of writing this, the Covid-19 pandemic has forced most data science bootcamps to meet remotely anyways, making their services far more similar to the much cheaper online courses. That said, many have sought to simulate the classroom environment virtually, trying to provide some type of social environment, but the classroom environment was a major advantage that made their significant increase in costs over the previous options worthwhile.

4) They tend to exist in large cities (especially tech centers). For example, bootcamps in the United States tend to concentrate in New York City, Los Angeles, Chicago, San Francisco, etc. Prior to the pandemic, anyone not living in those places would have to travel and temporarily reside in wherever their chosen bootcamp was, an additional expense.

5) They are often difficult for people who do not know programming and for those who do not know college-level mathematics like linear algebra, calculus, and statistics. If you do not know programming, I would recommend learning a programming language like Python (for more see this article I wrote explaining why to learn Python of all languages) through either a cheap online course and/or online tutorials first. Some data science bootcamps offer a preparatory introduction online course that teaches the prerequisite coding and math skills for those who do not understand it. They are worth consideration as well, but keep in mind the equivalent online course might be cheaper with roughly the same educational value.

If you decide to do a bootcamp, these criteria are important when researching which bootcamp to choose:

1) Project Orientation: How well do they enable you to practice data science through portfolio-building projects, and how impressive are the projects its alum did? The best data science bootcamps are generally teach in a project-oriented fashion.  

2) Job-Finding Resources and/or Job Guarantee: What resources or coaching do they give to help you find a job afterwards? Help networking, presenting yourself, and interviewing, for example, are important skills to finding a job as a data scientist, and in addition to teaching you technical curriculum, the best programs tend to find occupational coaches to help specifically with the job-finding process. Also, some programs give a job guarantee: if you do not find a data science job after a certain number of months after graduating then they refund tuition. This generally shows they take job finding important enough to risk their own money on it (although do check at the fine print on the guarantee to see the exact terms they are agreeing to).

3) Alum Resources: A surprisingly import detail to consider is how much resources a bootcamp invests in cultivating alumni networks. I was surprised by how receptive to meeting/networking alum of the online bootcamp I did, and how satisfied alum tend to be with the bootcamp. The effort a bootcamp makes to work with and maintain relationships with its alum impact this significantly. Connectedness with alum can be difficult to assess when researching programs from afar, but asking whether you can speak with alum(s) to learn about their experiences with the program, checking a bootcamp’s alum activity on LinkedIn and other social media websites, and asking about what kind of networking opportunities with alum they facilitate can be great ways to assess how intentional a program is about cultivating relationships. 

4) Scholarship Options: Some programs offer full or at least partial scholarships based on need. Clearly, ways to knock down the cost of the bootcamp would be great, especially if a bootcamp seems like an ideal option for you, but the cost seems too daunting.

Hybrid #2: Online Bootcamp

Online bootcamps tend to possess the schedule flexibility of online courses but offer more rigorous, personal (albeit remote) learning, allowing you to combine the best of aspects of data science bootcamps and online programs. They are also generally cheaper than traditional bootcamps (yet also more expensive than an online course). Finally, they tend to be a much better option for those who do not live in a major city that happens to have a local data science bootcamp program. The pandemic, if anything, has probably helped produce even more online bootcamp programs, since it has forced data science bootcamps to teach virtually.

I enrolled in Springboard’s online data science bootcamp in 2017, a great example of an online bootcamp. At the time, they cost roughly $1,000 a month (at the time of writing their standard rate is $1,490 a month and state their program generally takes six months). This is cheaper than traditional bootcamps but still a few totaling around $10,000 for six months. They had online curriculum typical of online courses but also provided weekly virtual meetings with an instructor to discuss the material and any issues you are having. Now they seem to include virtual lessons online. This individualized training and remote classroom environment are the main value adds over an online course, and you must assess whether, for you, they would be worth the additional cost. They are self-paced, providing much greater flexibility on when and how often you work than typical bootcamps. They also refunded your money if you did not find a job in six months after completion.

If you choose this option, be aware of the potential pitfalls of both online courses and traditional bootcamps. Just like with online programs, you will need to evaluate whether you are comfortable learning the curriculum by yourself (even you can meet with a mentor for major issues once a week, you would be doing the bulk learning by yourself throughout the week). Like with traditional bootcamps, expect the learning to be mentally intense and make sure they help you develop portfolio-building projects and provide job-finding resources and training.

Option 4: Master’s Degree or Other University Degree

The final option is to go back to school to get a degree in data science. This is the most expensive and time-consuming option: a master’s degree (a logical choice if you already have a bachelor’s) is generally the shortest, taking two years. But they cost upwards of $100,000. Even if partial or full scholarships decrease that cost, the opportunity cost of spending several years of your life in school is still higher than any of the other options. It can give a resume boost, however, if you know how to leverage it properly, which will likely increase your salary to make up for the initial cost. I would only recommend getting a master’s degree if you already know you love data science (say because you have already been working in the field, preferably if you also have already figured out the specific area of data science you want to do) but want to take your skills, technique, and/or theoretical knowledge of how the models work to the next level.

The best way to refine your data science skills is by doing data science: finding or creating contexts to push you as you practice data science. Graduate schools are not the only potential environment to refine one’s data science skills (e.g., all the previous options could involve that if done well), and even though graduate schools can be great at providing rigor, these other options can be a lot cheaper and more flexible. Finally, at the time of writing this, at least, the demand for data scientists exceeds the number of actual people in the field, and so getting a data science job without an “official” university degree in data science is pretty realistic.

University data science degree programs are relatively new – generally only a few years old. Thus, not all universities have literal data science degrees or departments but instead require that you enroll in a related program like computer science, statistics, or engineering to learn data science. This does not always mean these other programs are bad or unhelpful, but it often means you will have to perform extraneous or semi-extraneous tasks to data science proper in order to complete your degree (in some cases with minimal help from faculty from other fields).

When considering a program, you should make sure they are proactive about teaching professional and not just academic data science skillsets. These are the specific questions I would research to assess how well they might prepare you for non-academic data science jobs:

1) What proportion of their faculty currently work or at least have worked in the industry as a data scientist (or other similar job title)?

2) How well connected is the department with local organizations, and might they be able to leverage these relationships to help you work with these organizations through a work-study program or internship during the program and/or employment afterwards?

3) Will they help you build – or at least give you the flexibility to build – one’s thesis into an applied data science project that would boost your resume to future employers?

If your chosen program lacks these, I would strongly recommend building resume/portfolio-boosting projects and networking with local data scientists on the side while completing the program. This takes considerable time and energy, so ideally your department would actively help you in this work, instead of requiring that you do it on your own while also completing all their work.

Funding options is something else to consider. Are they willing to fund your degree fully or at least partly? Work-study programs where you work while getting your master’s can be a great way to graduate with no debt and gain resume-building work experiences (although they can make you busy). I benefitted greatly from working as a data scientist while completing my master’s, both because I graduated with no debt and because it allowed me to practice and refine my skills.

Finally, most universities require that you live nearby and attend physically (at least before and likely after the pandemic). Thus, you might have to find a place near you or be willing to relocate for a few years if there is not a data science degree program nearby. If so, you should factor moving expenses into the cost of doing the program.

Conclusion

Learning data science can be an awesome yet daunting prospect, and finding the right strategy for you is complicated, particularly given all the pedagogical, logistical, and financial considerations. Hopefully, this article has helped you think through how to journey forward. 

Photo credit #1: geralt at https://pixabay.com/photos/woman-programming-glasses-reflect-3597101/  

Photo credit #2: Anastase Maragos at  https://unsplash.com/photos/OaFESrP2hhw

Photo credit #3: mohamed_hassan at https://pixabay.com/photos/training-course-3207841/

Photo credit #4: Jukan Tateisi at https://unsplash.com/photos/bJhT_8nbUA0

Photo credit #5: heylagostechie at  https://unsplash.com/photos/IgUR1iX0mqM   

Photo credit #6: Brooke Cagle at https://unsplash.com/photos/WHWYBmtn3_0

Photo credit #7: A_Ginard at https://pixabay.com/photos/architecture-modern-buildings-5084075/

Data Visualization 102: The Most Important Rules for Making Data Tables

In a previous post about data visualization in data science and statistics, I discussed what I consider the single most important rule of graphing data. In this post, I am following up to discuss the most important rules for making data tables. I will focus on data tables in reporting/communicating findings to others, as opposed to the many other uses of tables in data science say to store, organize, and mine data.

To summarize, graphs are like sentences, conveying one clear thought to the viewer/reader. Tables, on the other hand, can function more like paragraphs, conveying multiple sentences or thoughts to get an overall idea. Unlike graphs, which often provide one thought, tables can be more exploratory, providing information for the viewer/reader to analyze and draw his or her own conclusions from.

Table Rule #1: Don’t be afraid to provide as much or as little information as you need.

Paragraphs can use multiple sentences to convey a series of thoughts/statements, and tables are no different. One can convey multiple pieces of information that viewers/readers can look through and analyze at his or her own leisure, using the data to answer their own questions, so feel free to take up the space as you need. Several page long tables are fair game and, in many cases, absolutely necessary (although often end up in appendices for readers/viewers needing a more in-depth take).

In my previous data visualization post, I gave this bar chart as an example of trying to say too many statements for a graph:

This is a paragraphs-worth of information, and a table would represent it much better.[i] In a table, the reader/viewer can explore the table values by country and year themselves and answer whatever questions he or she might have. For example, if someone wanted to analyze how a specific country changed overtime, he or she could do so easily with a table, and/or if he or want to analyze compare the immigration ratios between countries of a specific decade, that is possible as well. In the graph above, each country’s subsegment starts in a different place vertically for each decade column, making it hard to compare the sizes visually, and since each decade has dozens of values, that the latter analysis is visually difficult to decipher as well.

But, at the same time, do not be afraid to convey a sentence- or graphs-worth of data into a table, especially when such data is central for what you are saying. Sometimes writers include one-sentence paragraphs when that single thought is crucial, and likewise, a single statement table can have a similar effect. For example, writing a table for a single variable does helps convey that that variable is important:

Gender Some Crucial Result
Male 36%
Female 84%

Now, sometimes in these single statement instances, you might want to use a graph instead of a table (or both), which I discuss in more detail in Rule #3.

Table Rule #2: Keep columns consistent for easy scanning.

I have found that when viewers/readers scan tables, they generally subconsciously assume that all variables in a column are the same: same units and type of value. Changing values of a column between rows can throw off your viewer/reader when he or she looks at it. For example, consider this made-up study data:

  Control Group (n = 100) Experimental Group (n = 100)
Mean Age 45 44
Median Age 43 42
    Male No. (%) 45 (45%) 36 (36%)
    Female No. (%) 55 (55%) 64 (64%)

In this table, the rows each mean different values and/or units. So, for example, going down the control column, the first column is mean age measured in years. The second column switches to median age, a different type of value than mean (although the same unit of years). The final two rows convey the number and percentages of males and females of each: both a different type of value and a different unit (number and percent unlike years). This can be jarring for viewers/readers who often expect columns to be of the same values and units and naturally compare them as if they are similar types of values.

I would recommend transposing it like this, so that the columns represent the similar variables and the rows the two groups:

  Mean Age Median Age (IQR)     Male No. (%)     Female No. (%)
Control Group (n = 100) 45 43 (25, 65) 45 (45%) 55 (55%)
Experimental Group (n = 100) 44 42 (27, 63) 36 (36%) 64 (64%)

Table Rule #3: Don’t be afraid to also use a graph to convey magnitude, proportion, or scale

A table like the gender table in Rule #1 conveys pertinent information numerically, but numbers themselves do not visually show the difference between the values.

Gender Some Crucial Result
Male 36%
Female 84%

Graphs excel at visually depicting the magnitude, proportion, and/or scale of data, so, if in this example, it is important to convey how much greater the “Some Crucial Result” is for females than males, then a basic bar graph allows the reader/viewer to see that the percent is more than double for the females than for the males.

Now, to convey this visual clarity, the graph loses the ability to precisely relate the exact numbers. For example, looking at only this graph, a reader/viewer might be unsure whether the males are at 36%, 37%, or 38%. People have developed many graphing strategies to deal with this (ranging from making the grid lines sharper, writing the exact numbers on top of, next to, or around the segment, among others), but combining the graph and table in instances where one both needs to convey the exact numbers and to convey a sense of their magnitude, proportion, or scale can also work well:

Finally, given that tables can convey multiple statements, feel free to use several graphs to depict the magnitude, proportion, or scale of one table. Do not try to overload a multi-statement table into a single, incomprehensible graph. Break down each statement you are trying to relate with that table and depict each separately in a single graph.

Conclusion

If graphs are sentences, then tables can function more like paragraphs, conveying a large amount of information that make more than one thought or statement. This gives space for your reader/viewer to explore the data and interpret it on their own to answer whatever questions they have.

Photo/Table credit #1: Mika Baumeister at https://unsplash.com/photos/Wpnoqo2plFA

Photo/Table credit #2: Linux Screenshots at https://www.flickr.com/photos/xmodulo/23635690633/


[i] Unfortunately, I do not have the data myself that this chart uses, or I would make a table for it to show what I mean.

The Best Programming Languages for Data Science and Machine Learning

woman coding on computer

Newcomers to data science or artificial intelligence frequently ask me the best programming language to learn to build machine learning algorithms. Thus, I wrote this article as a reference for anyone who wants to know the answer to that question. These are what I consider the three most important languages, ranked in terms of usefulness based on both overall popularity within the data science community and my own personal experiences:

Best Programming Languages for Machine Learning:
#1 Choice: Python
#2 Choice: R
#3 Choice: Java
#4 Choice: C/C++

#1 Programming Language: Python

Python is the most popular language to use for machine learning and for three good reasons.

First, it’s package-based style allows you to utilize efficient machine learning and statistical packages that others have made, preventing you from having to constantly reinvent the wheel for common problems. Many if not most of the best packages (like NumPy, pandas, scikit learn, etc.) are in Python. This almost allows you to “cheat” when programing machine learning algorithms.

Second, Python is a powerful and flexible all-purpose language, so if you are building a machine learning algorithm to do something, then you can easily build the code for the other overall product or system in which you will use the algorithm without having to switch languages or softwares. It supports object-oriented, functional, and procedure-oriented programming styles, giving the programmer flexibility in how to code, allowing you to use whatever style or combination of various styles you like best or fits the specific context.

Third, unlike a language like Java or C++, Python does not require elaborate setup to program a single line of code. Even though you can easily build the coding infrastructure if you need to, if you only need to run a simple command or test, you can start immediately.

When I program in Python, I personally love using Jupyter Notebook, since its interface allows me to both code and to easily show my code and findings as a report or document. Another data scientist can simultaneously read and analyze my code and its output at the same time. I personally wish more data scientists published their papers and reports in Jupyter Notebook or other notebooks like it because of this.

If you have time to learn a single programming language for machine learning, I would strongly recommend it be Python. The next three languages, R, Java, and C++, do not match its ease and popularity within data science.

#2 Programming Language: R

R is a popular language for statisticians, a programming language that is specifically tailored for advanced statistical analysis. It includes many well-developed packages for machine learning but is not as popular with data scientists as Python. For example, in Towards Data Science’s survey, 57% of data scientists reported using Python, with 33% prioritizing it, and only 31% reported using R, with 17% prioritizing it. This seems to show that R is a complementary, not primary language for data science and machine learning. Most R packages have their equivalent in Python (and to some extent the other way around). Unlike Python, which is an all-purpose language, able to do other wonders other than analyzing data and developing machine learning algorithms, R is specifically tailored to statistics and data analysis, not able to do much beyond that. Saying this, though, R programmers are increasingly developing more and more packages for it, allowing it to do more and more.

source codes screenshot

#3 Programming Language: Java

Java was once the most popular language around, but Python has dethroned it in the last few years. As an avid Java programmer who programs in Java for fun, it breaks my heart to put it so far down the list, but Python is clearly a better language for data science and machine learning. If you are working in an organization or other context that still uses Java for part or all of its software infrastructure, then you may be stuck using it, but most recent developments, particularly in machine learning, have occurred in Python and in R (and a few other languages). Thus, if you use Java, you’ll frequently find yourself having to unnecessarily reinventing the wheel.

Plus, one major con of Java is that conducting quick, on-the-go analysis is not possible, since one must write a whole coding system before one can do a single line of code. Java can be popular in certain contexts, where the surrounding applications/software that utilize the machine learning algorithms are in Java, common in finance, front-end development, and companies that have been using Java-based software.

#4 Programming Languages: C/C++

The same Towards Data Science survey I mentioned above lists C/C++ as the second most popular data science and machine learning language after Python. Java follows them closely, yet I included Java and not C/C++ as third because I personally find Java to be a better overall language than C or C++. In C or C++, you may frequently find yourself reinventing the wheel – having to develop machine learning algorithms that others have already built in Python – but in some backend systems that have been built C or C++ like in engineering and electronics, you do not have much of an option. C++ has a similar problem with Java as well: lacking the ability to do quick on-the-go coding without having to build a whole infrastructure.

Conclusion

For a beginner to the data science scene, learning a single programming is the most helpful way to enter the field. Use learning a programming language to assess whether data science is for you: if you struggle and do not like programming, then developing machine learning algorithms for a living is probably not a good fit for you.

Many groups are trying to develop softwares that enable machine learning without having to program: DataRobot, Auto-WEKA, RapidMiner, BigML, and AutoML, among many others. The pros and cons and successes and failures of these softwares warrants a separate blog post to itself (one I intend to write eventually). As of now though, these have not replaced programming languages in either practical ability to develop complex machine learning algorithms and in demonstrating that you have the technical computational/programming skills for the field.

For a beginner to the data science scene, learning a single programming is the most helpful way to enter the field. Use learning a programming language to assess whether data science is for you: if you struggle and do not like programming, then data science where you would be developing machine learning algorithms for a living is probably not a good fit for you. Depending on where you work or type of field/tasks you are doing, you might end up using the language(s) or software(s) your team works with so that you can easily work jointly on projects with them. For some areas of work or tasks might prefer certain packages and languages. If you demonstrate that you can already know a complex programming language like Python (or Java or C++), even if that is not the preferred language of their team, then you will likely demonstrate to any hiring manager that you can learn their specific language or software.

Photo credit #1: ThisIsEngineering at https://www.pexels.com/photo/woman-coding-on-computer-3861958/

Photo credit #2: Hitesh Choudhary at https://unsplash.com/photos/D9Zow2REm8U

Photo credit #3: thekirbster at https://www.flickr.com/photos/kirbyurner/30491542972/in/photolist-MQRUEh-2g3E1wf-Nsr8q9-HDKJxu-22VkHJU-2bWRXY2/lightbox/ (Yes, even though it is cool looking, this is not my code.)

Photo credit #4: Steinar Engeland at https://unsplash.com/photos/WDf1tEzQ_SY

Photo credit #5: Markus Spiske at https://unsplash.com/photos/jUWw_NEXjDw

In a Helicopter Overlooking the Wildfire: A Data Science Perspective in A Frontline Hospital During the Covid-19 Pandemic

I worked as a data scientist at a hospital in New York City during the worst of the covid-19 pandemic. Over the spring and summer, we became overwhelmed as the city turned into (and left) the global hotspot for covid-19. I have been processing everything that happened since.

The pandemic overwhelmed the entire hospital, particularly my physician colleagues. When I met with them, I could often notice the combined effects of physical and emotional exhaustion in their eyes and voices. Many had just arrived from the ICU, where they had spent several hours fighting to keep their patients alive only to witness many of them die in front of them, and I could sense the emotional toll that was taking.

My experiences of the pandemic as a data scientist differed considerably yet were also exhausting and disturbing in their own way. I spent several months day-in and day-out researching how many of our patients were dying from the pandemic and why: trying to determine what factors contributed to their deaths and what we could do as a hospital to best keep people alive. The patient who died the night before in front of the doctor I am currently meeting with became, for me, one a single row in an already way-too-large data table of covid-19 fatalities.

I felt like a helicopter pilot overlooking an out-of-control wildfire.[1] In such wildfires, teams of firefighters (aka doctors) position themselves at various strategic locations on the ground to push back the fire there as best they can. They experience the flames and carnage up close and personal. My placement in the helicopter, on the other hand, removes me from ground zero, instead forcing me to see and analyze the fire in its entirety and its sweeping and massive destruction across the whole forest. My vantage point provides a strategic vantage point to determine the best ways to fight it, shielding me from the immediate destruction. Nevertheless, witnessing the vastness of the carnage from the air had its own challenges, stress, and emotional toll.

Being an anthropologist by training, I am accustomed to being “on the ground.” Anthropology is predicated on the idea that to understand a culture or phenomena, one must understand the everyday experiences of those on the ground amidst it, and my anthropological training has instilled an instinct to go straight to and talk to those in the thick of it.

Yet, this experience has taught me that that perception is overly simplistic: the so-called “ground” has many layers to it, especially for a complex phenomenon like a pandemic. Being in the helicopter is another way to be in the thick of it just as much as standing before the flames.

Many in the United States have made considerable and commendable efforts to support frontline health workers. Yet, as the pandemic progresses, and its societal effects grow in complexity in the coming months I think we need to broaden our understanding of where the “frontlines” are and who a “frontline worker” is worthy of our support.

In actual battlefields where the “frontline” metaphor comes from, militaries also set up layered teams to support the logistical needs of ground soldiers who also must frequently put themselves in harm’s way in the process. The frontline of this pandemic seems no different.

I think we need to expand our conceptions of what it means to be on the frontlines accordingly. Like anthropology, modern journalism, a key source of pandemic information for many of us, can fall into the issue of overfocusing on the “worst of the worst,” potentially ignoring the broader picture and the diversity of “frontline” experiences. For example, interviewing the busiest medical caregivers in the worst affected hospitals in the most affected places in the world likely does promote viewership, but only telling those stories ignores the experiences and sacrifices of thousands of others necessary to keep them going.   

To be clear, in this blog, I do not personally care about acknowledgement of my own work nor do I think we should ignore the contributions of these medical professional “ground troops” in any way. Rather, in the spirt of “yes and,” we should extend our understanding of the “frontline workers” to acknowledge and celebrate the contributions of many other essential professionals during this crisis, such as transportation services, food distribution, postal workers, etc. I related my own experiences as a data scientist because they helped me learn this, not for any desire for recognition.

This might help us appreciate the complexity of this crisis and its social effects, and the various types of sacrifices people have been making to address it. As it is becoming increasingly clear that this pandemic is not likely to go anywhere anytime soon, appreciating the full extent of both could help us come together to buckle down and fight it.  


[1] This video helped me understand the logistics of fighting wildfires, a fascinating topic in itself: https://www.youtube.com/watch?v=EodxubsO8EI. Feel free to check it out to understand my analogy in more depth.

Photo Credit #1: ReinhardThrainer at https://pixabay.com/photos/fire-forest-helicopter-forest-fire-5457829/

Photo Credit #2: Pixabay at https://www.pexels.com/photo/backlit-breathing-apparatus-danger-dangerous-279979/

Photo Credit #3: Pixabay at https://www.pexels.com/photo/scenic-view-of-rice-paddy-247599/

How to Analyze Texts with Data Science

flat lay photography of an open book beside coffee mug

A friend and fellow professor, Dr. Eve Pinkser, asked me to give a guest lecture on quantitative text analysis techniques within data science for her Public Health Policy Research Methods class with the University of Illinois at Chicago on April 13th, 2020. Multiple people have asked me similar questions about how to use data science to analyze texts quantitatively, so I figured I would post my presentation for anyone interested in learning more.

It provides a basic introduction of the different approaches so that you can determine which to explore in more detail. I have found that many people who are new to data science feel paralyzed when trying to navigate through the vast array of data science techniques out there and unsure where to start.

Many of her students needed to conduct quantitative textual analysis as part of their doctoral work but struggled in determining what type of quantitative research to employ. She asked me to come in and explain the various data science and machine learning-based textual analysis techniques, since this was out of her area of expertise. The goal of the presentation was to help the PhD students in the class think through the types of data science quantitative text analysis techniques that would be helpful for their doctoral research projects.

Hopefully, it would likewise allow you to determine the type or types of text analysis you might need so that you can then look those up in more detail. Textual analysis, as well as the wider field of natural language processing within which it is a part of, is a quickly up-and-coming subfield within data science doing important and groundbreaking work.

Photo credit: fotografierende at https://www.pexels.com/photo/flat-lay-photography-of-an-open-book-beside-coffee-mug-3278768/