Articles

Data Visualization 102: The Most Important Rules for Making Data Tables

In a previous post about data visualization in data science and statistics, I discussed what I consider the single most important rule of graphing data. In this post, I am following up to discuss the most important rules for making data tables. I will focus on data tables in reporting/communicating findings to others, as opposed to the many other uses of tables in data science say to store, organize, and mine data.

To summarize, graphs are like sentences, conveying one clear thought to the viewer/reader. Tables, on the other hand, can function more like paragraphs, conveying multiple sentences or thoughts to get an overall idea. Unlike graphs, which often provide one thought, tables can be more exploratory, providing information for the viewer/reader to analyze and draw his or her own conclusions from.

Table Rule #1: Don’t be afraid to provide as much or as little information as you need.

Paragraphs can use multiple sentences to convey a series of thoughts/statements, and tables are no different. One can convey multiple pieces of information that viewers/readers can look through and analyze at his or her own leisure, using the data to answer their own questions, so feel free to take up the space as you need. Several page long tables are fair game and, in many cases, absolutely necessary (although often end up in appendices for readers/viewers needing a more in-depth take).

In my previous data visualization post, I gave this bar chart as an example of trying to say too many statements for a graph:

This is a paragraphs-worth of information, and a table would represent it much better.[i] In a table, the reader/viewer can explore the table values by country and year themselves and answer whatever questions he or she might have. For example, if someone wanted to analyze how a specific country changed overtime, he or she could do so easily with a table, and/or if he or want to analyze compare the immigration ratios between countries of a specific decade, that is possible as well. In the graph above, each country’s subsegment starts in a different place vertically for each decade column, making it hard to compare the sizes visually, and since each decade has dozens of values, that the latter analysis is visually difficult to decipher as well.

But, at the same time, do not be afraid to convey a sentence- or graphs-worth of data into a table, especially when such data is central for what you are saying. Sometimes writers include one-sentence paragraphs when that single thought is crucial, and likewise, a single statement table can have a similar effect. For example, writing a table for a single variable does helps convey that that variable is important:

Gender Some Crucial Result
Male 36%
Female 84%

Now, sometimes in these single statement instances, you might want to use a graph instead of a table (or both), which I discuss in more detail in Rule #3.

Table Rule #2: Keep columns consistent for easy scanning.

I have found that when viewers/readers scan tables, they generally subconsciously assume that all variables in a column are the same: same units and type of value. Changing values of a column between rows can throw off your viewer/reader when he or she looks at it. For example, consider this made-up study data:

  Control Group (n = 100) Experimental Group (n = 100)
Mean Age 45 44
Median Age 43 42
    Male No. (%) 45 (45%) 36 (36%)
    Female No. (%) 55 (55%) 64 (64%)

In this table, the rows each mean different values and/or units. So, for example, going down the control column, the first column is mean age measured in years. The second column switches to median age, a different type of value than mean (although the same unit of years). The final two rows convey the number and percentages of males and females of each: both a different type of value and a different unit (number and percent unlike years). This can be jarring for viewers/readers who often expect columns to be of the same values and units and naturally compare them as if they are similar types of values.

I would recommend transposing it like this, so that the columns represent the similar variables and the rows the two groups:

  Mean Age Median Age (IQR)     Male No. (%)     Female No. (%)
Control Group (n = 100) 45 43 (25, 65) 45 (45%) 55 (55%)
Experimental Group (n = 100) 44 42 (27, 63) 36 (36%) 64 (64%)

Table Rule #3: Don’t be afraid to also use a graph to convey magnitude, proportion, or scale

A table like the gender table in Rule #1 conveys pertinent information numerically, but numbers themselves do not visually show the difference between the values.

Gender Some Crucial Result
Male 36%
Female 84%

Graphs excel at visually depicting the magnitude, proportion, and/or scale of data, so, if in this example, it is important to convey how much greater the “Some Crucial Result” is for females than males, then a basic bar graph allows the reader/viewer to see that the percent is more than double for the females than for the males.

Now, to convey this visual clarity, the graph loses the ability to precisely relate the exact numbers. For example, looking at only this graph, a reader/viewer might be unsure whether the males are at 36%, 37%, or 38%. People have developed many graphing strategies to deal with this (ranging from making the grid lines sharper, writing the exact numbers on top of, next to, or around the segment, among others), but combining the graph and table in instances where one both needs to convey the exact numbers and to convey a sense of their magnitude, proportion, or scale can also work well:

Finally, given that tables can convey multiple statements, feel free to use several graphs to depict the magnitude, proportion, or scale of one table. Do not try to overload a multi-statement table into a single, incomprehensible graph. Break down each statement you are trying to relate with that table and depict each separately in a single graph.

Conclusion

If graphs are sentences, then tables can function more like paragraphs, conveying a large amount of information that make more than one thought or statement. This gives space for your reader/viewer to explore the data and interpret it on their own to answer whatever questions they have.

Photo/Table credit #1: Mika Baumeister at https://unsplash.com/photos/Wpnoqo2plFA

Photo/Table credit #2: Linux Screenshots at https://www.flickr.com/photos/xmodulo/23635690633/


[i] Unfortunately, I do not have the data myself that this chart uses, or I would make a table for it to show what I mean.

The Best Programming Languages for Data Science and Machine Learning

woman coding on computer

Newcomers to data science or artificial intelligence frequently ask me the best programming language to learn to build machine learning algorithms. Thus, I wrote this article as a reference for anyone who wants to know the answer to that question. These are what I consider the three most important languages, ranked in terms of usefulness based on both overall popularity within the data science community and my own personal experiences:

Best Programming Languages for Machine Learning:
#1 Choice: Python
#2 Choice: R
#3 Choice: Java
#4 Choice: C/C++

#1 Programming Language: Python

Python is the most popular language to use for machine learning and for three good reasons.

First, it’s package-based style allows you to utilize efficient machine learning and statistical packages that others have made, preventing you from having to constantly reinvent the wheel for common problems. Many if not most of the best packages (like NumPy, pandas, scikit learn, etc.) are in Python. This almost allows you to “cheat” when programing machine learning algorithms.

Second, Python is a powerful and flexible all-purpose language, so if you are building a machine learning algorithm to do something, then you can easily build the code for the other overall product or system in which you will use the algorithm without having to switch languages or softwares. It supports object-oriented, functional, and procedure-oriented programming styles, giving the programmer flexibility in how to code, allowing you to use whatever style or combination of various styles you like best or fits the specific context.

Third, unlike a language like Java or C++, Python does not require elaborate setup to program a single line of code. Even though you can easily build the coding infrastructure if you need to, if you only need to run a simple command or test, you can start immediately.

When I program in Python, I personally love using Jupyter Notebook, since its interface allows me to both code and to easily show my code and findings as a report or document. Another data scientist can simultaneously read and analyze my code and its output at the same time. I personally wish more data scientists published their papers and reports in Jupyter Notebook or other notebooks like it because of this.

If you have time to learn a single programming language for machine learning, I would strongly recommend it be Python. The next three languages, R, Java, and C++, do not match its ease and popularity within data science.

#2 Programming Language: R

R is a popular language for statisticians, a programming language that is specifically tailored for advanced statistical analysis. It includes many well-developed packages for machine learning but is not as popular with data scientists as Python. For example, in Towards Data Science’s survey, 57% of data scientists reported using Python, with 33% prioritizing it, and only 31% reported using R, with 17% prioritizing it. This seems to show that R is a complementary, not primary language for data science and machine learning. Most R packages have their equivalent in Python (and to some extent the other way around). Unlike Python, which is an all-purpose language, able to do other wonders other than analyzing data and developing machine learning algorithms, R is specifically tailored to statistics and data analysis, not able to do much beyond that. Saying this, though, R programmers are increasingly developing more and more packages for it, allowing it to do more and more.

source codes screenshot

#3 Programming Language: Java

Java was once the most popular language around, but Python has dethroned it in the last few years. As an avid Java programmer who programs in Java for fun, it breaks my heart to put it so far down the list, but Python is clearly a better language for data science and machine learning. If you are working in an organization or other context that still uses Java for part or all of its software infrastructure, then you may be stuck using it, but most recent developments, particularly in machine learning, have occurred in Python and in R (and a few other languages). Thus, if you use Java, you’ll frequently find yourself having to unnecessarily reinventing the wheel.

Plus, one major con of Java is that conducting quick, on-the-go analysis is not possible, since one must write a whole coding system before one can do a single line of code. Java can be popular in certain contexts, where the surrounding applications/software that utilize the machine learning algorithms are in Java, common in finance, front-end development, and companies that have been using Java-based software.

#4 Programming Languages: C/C++

The same Towards Data Science survey I mentioned above lists C/C++ as the second most popular data science and machine learning language after Python. Java follows them closely, yet I included Java and not C/C++ as third because I personally find Java to be a better overall language than C or C++. In C or C++, you may frequently find yourself reinventing the wheel – having to develop machine learning algorithms that others have already built in Python – but in some backend systems that have been built C or C++ like in engineering and electronics, you do not have much of an option. C++ has a similar problem with Java as well: lacking the ability to do quick on-the-go coding without having to build a whole infrastructure.

Conclusion

For a beginner to the data science scene, learning a single programming is the most helpful way to enter the field. Use learning a programming language to assess whether data science is for you: if you struggle and do not like programming, then developing machine learning algorithms for a living is probably not a good fit for you.

Many groups are trying to develop softwares that enable machine learning without having to program: DataRobot, Auto-WEKA, RapidMiner, BigML, and AutoML, among many others. The pros and cons and successes and failures of these softwares warrants a separate blog post to itself (one I intend to write eventually). As of now though, these have not replaced programming languages in either practical ability to develop complex machine learning algorithms and in demonstrating that you have the technical computational/programming skills for the field.

For a beginner to the data science scene, learning a single programming is the most helpful way to enter the field. Use learning a programming language to assess whether data science is for you: if you struggle and do not like programming, then data science where you would be developing machine learning algorithms for a living is probably not a good fit for you. Depending on where you work or type of field/tasks you are doing, you might end up using the language(s) or software(s) your team works with so that you can easily work jointly on projects with them. For some areas of work or tasks might prefer certain packages and languages. If you demonstrate that you can already know a complex programming language like Python (or Java or C++), even if that is not the preferred language of their team, then you will likely demonstrate to any hiring manager that you can learn their specific language or software.

Photo credit #1: ThisIsEngineering at https://www.pexels.com/photo/woman-coding-on-computer-3861958/

Photo credit #2: Hitesh Choudhary at https://unsplash.com/photos/D9Zow2REm8U

Photo credit #3: thekirbster at https://www.flickr.com/photos/kirbyurner/30491542972/in/photolist-MQRUEh-2g3E1wf-Nsr8q9-HDKJxu-22VkHJU-2bWRXY2/lightbox/ (Yes, even though it is cool looking, this is not my code.)

Photo credit #4: Steinar Engeland at https://unsplash.com/photos/WDf1tEzQ_SY

Photo credit #5: Markus Spiske at https://unsplash.com/photos/jUWw_NEXjDw

In a Helicopter Overlooking the Wildfire: A Data Science Perspective in A Frontline Hospital During the Covid-19 Pandemic

I worked as a data scientist at a hospital in New York City during the worst of the covid-19 pandemic. Over the spring and summer, we became overwhelmed as the city turned into (and left) the global hotspot for covid-19. I have been processing everything that happened since.

The pandemic overwhelmed the entire hospital, particularly my physician colleagues. When I met with them, I could often notice the combined effects of physical and emotional exhaustion in their eyes and voices. Many had just arrived from the ICU, where they had spent several hours fighting to keep their patients alive only to witness many of them die in front of them, and I could sense the emotional toll that was taking.

My experiences of the pandemic as a data scientist differed considerably yet were also exhausting and disturbing in their own way. I spent several months day-in and day-out researching how many of our patients were dying from the pandemic and why: trying to determine what factors contributed to their deaths and what we could do as a hospital to best keep people alive. The patient who died the night before in front of the doctor I am currently meeting with became, for me, one a single row in an already way-too-large data table of covid-19 fatalities.

I felt like a helicopter pilot overlooking an out-of-control wildfire.[1] In such wildfires, teams of firefighters (aka doctors) position themselves at various strategic locations on the ground to push back the fire there as best they can. They experience the flames and carnage up close and personal. My placement in the helicopter, on the other hand, removes me from ground zero, instead forcing me to see and analyze the fire in its entirety and its sweeping and massive destruction across the whole forest. My vantage point provides a strategic vantage point to determine the best ways to fight it, shielding me from the immediate destruction. Nevertheless, witnessing the vastness of the carnage from the air had its own challenges, stress, and emotional toll.

Being an anthropologist by training, I am accustomed to being “on the ground.” Anthropology is predicated on the idea that to understand a culture or phenomena, one must understand the everyday experiences of those on the ground amidst it, and my anthropological training has instilled an instinct to go straight to and talk to those in the thick of it.

Yet, this experience has taught me that that perception is overly simplistic: the so-called “ground” has many layers to it, especially for a complex phenomenon like a pandemic. Being in the helicopter is another way to be in the thick of it just as much as standing before the flames.

Many in the United States have made considerable and commendable efforts to support frontline health workers. Yet, as the pandemic progresses, and its societal effects grow in complexity in the coming months I think we need to broaden our understanding of where the “frontlines” are and who a “frontline worker” is worthy of our support.

In actual battlefields where the “frontline” metaphor comes from, militaries also set up layered teams to support the logistical needs of ground soldiers who also must frequently put themselves in harm’s way in the process. The frontline of this pandemic seems no different.

I think we need to expand our conceptions of what it means to be on the frontlines accordingly. Like anthropology, modern journalism, a key source of pandemic information for many of us, can fall into the issue of overfocusing on the “worst of the worst,” potentially ignoring the broader picture and the diversity of “frontline” experiences. For example, interviewing the busiest medical caregivers in the worst affected hospitals in the most affected places in the world likely does promote viewership, but only telling those stories ignores the experiences and sacrifices of thousands of others necessary to keep them going.   

To be clear, in this blog, I do not personally care about acknowledgement of my own work nor do I think we should ignore the contributions of these medical professional “ground troops” in any way. Rather, in the spirt of “yes and,” we should extend our understanding of the “frontline workers” to acknowledge and celebrate the contributions of many other essential professionals during this crisis, such as transportation services, food distribution, postal workers, etc. I related my own experiences as a data scientist because they helped me learn this, not for any desire for recognition.

This might help us appreciate the complexity of this crisis and its social effects, and the various types of sacrifices people have been making to address it. As it is becoming increasingly clear that this pandemic is not likely to go anywhere anytime soon, appreciating the full extent of both could help us come together to buckle down and fight it.  


[1] This video helped me understand the logistics of fighting wildfires, a fascinating topic in itself: https://www.youtube.com/watch?v=EodxubsO8EI. Feel free to check it out to understand my analogy in more depth.

Photo Credit #1: ReinhardThrainer at https://pixabay.com/photos/fire-forest-helicopter-forest-fire-5457829/

Photo Credit #2: Pixabay at https://www.pexels.com/photo/backlit-breathing-apparatus-danger-dangerous-279979/

Photo Credit #3: Pixabay at https://www.pexels.com/photo/scenic-view-of-rice-paddy-247599/

How to Analyze Texts with Data Science

flat lay photography of an open book beside coffee mug

A friend and fellow professor, Dr. Eve Pinkser, asked me to give a guest lecture on quantitative text analysis techniques within data science for her Public Health Policy Research Methods class with the University of Illinois at Chicago on April 13th, 2020. Multiple people have asked me similar questions about how to use data science to analyze texts quantitatively, so I figured I would post my presentation for anyone interested in learning more.

It provides a basic introduction of the different approaches so that you can determine which to explore in more detail. I have found that many people who are new to data science feel paralyzed when trying to navigate through the vast array of data science techniques out there and unsure where to start.

Many of her students needed to conduct quantitative textual analysis as part of their doctoral work but struggled in determining what type of quantitative research to employ. She asked me to come in and explain the various data science and machine learning-based textual analysis techniques, since this was out of her area of expertise. The goal of the presentation was to help the PhD students in the class think through the types of data science quantitative text analysis techniques that would be helpful for their doctoral research projects.

Hopefully, it would likewise allow you to determine the type or types of text analysis you might need so that you can then look those up in more detail. Textual analysis, as well as the wider field of natural language processing within which it is a part of, is a quickly up-and-coming subfield within data science doing important and groundbreaking work.

Photo credit: fotografierende at https://www.pexels.com/photo/flat-lay-photography-of-an-open-book-beside-coffee-mug-3278768/

Three Situations When Ethnography Is Useful in a Professional Setting

This is a follow-up to my previous article, “What Is Ethnography,” outlining ways ethnography is useful in professional settings.

To recap, I defined ethnography as a research approach that seeks “to understand the lived experiences of a particular culture, setting, group, or other context by some combination of being with those in that context (also called participant-observation), interviewing or talking with them, and analyzing what is produced in that context.”

Ethnography is a powerful tool, developed by anthropologists and other social scientists over the course of several decades. Here are three types of situations in professional settings when I have found to use ethnography to be especially powerful:

1. To see the given product and/or people in action
2. When brainstorming about a design
3. To understand how people navigate complex, patchwork processes

Situation #1: To See the Given Product and/or People in Action

Ethnography allows you to witness people in action: using your product or service, engaging in the type of activity you are interested, or in whatever other situation you are interested in studying.

Many other social science research methods involve creating an artificial environment in which to observe how participants act or think in. Focus groups, for example, involve assembling potential customers or users into a room: forming a synthetic space to discuss the product or service in question, and in many experimental settings, researchers create a simulated environment to control for and analyze the variables or factors they are interested in.

Ethnography, on the other hand, centers around observing and understanding how people navigate real-world settings. Through it, you can get a sense for how people conduct the activity for which you are designing a product or service and/or how people actually use your product or service.

For example, if you want to understand how people use GPS apps to get around, one can see how people use the app “in the wild:” when rushing through heavy traffic to get to a meeting or while lost in the middle of who knows where. Instead of hearing their processed thoughts in a focus group setting or trying to simulate the environment, you can witness what the tumultuousness yourself and develop a sense for how to build a product that helps people in those exact situations.

Situation #2: When Brainstorming about a New Product Design

Ethnography is especially useful during the early stages of designing a product or service, or during a major redesign. Ethnography helps you scope out the needs of your potential customers and how they approach meeting said needs. Thus, it helps you determine how to build a product or service that addresses those needs in a way that would make sense for your users.

During such initial stages of product design, ethnography helps determine the questions you should be asking. Many have a tendency during these initial stages to construct designs based on their own perception of people’s needs and desires and miss what the customers’ or users’ do in fact need and desire. Through ethnography, you ground your strategy in the customers’ mindsets and experiences themselves.

The brainstorming stages of product development also require a lot of flexibility and adaptability: As one determines what the product or service should become, one must be open to multiple potential avenues. Ethnography is a powerful tool for navigating such ambiguity. It centers you on the users, their experiences and mindsets, and the context which they might use the product or service, providing tools to ask open-ended questions and to generate new and helpful ideas for what to build.

Situation #3: To Understand How People Navigate Complex, Patchwork Processes

At a past company, I analyzed how customer service representatives regularly used the various software systems when talking with customers. Over the years, the company had designed and bought various software programs, each to perform a set of functions and with unique abilities, limitations, and quirks. Overtime, this created a complex web of interlocking apps, databases, and interfaces, which customer service representatives had to navigate when performing their job of monitoring customer’s accounts. Other employees described the whole scene as the “Wild West:” each customer service representative had to create their own way to use these software systems while on the phone with a (in many cases disgruntled) customer.

Many companies end up building such patchwork systems – whether of software, of departments or teams, of physical infrastructure, or something else entirely – built by stacking several iterations of development overtime until, they become a hydra of complexity that employees must figure out how to navigate to get their work done.

Ethnography is a powerful tool for making sense of such processes. Instead of relying on official policies for how to conduct various actions and procedures, ethnography helps you understand and make sense of the unofficial and informal strategies people use to do what they need. Through this, you can get a sense for how the patchwork system really works. This is necessary for developing ways to improve or build open such patchwork processes.

In the customer service research project, my task was to develop strategies to improve the technology customer service representatives used as they talked with customers. Seeing how representatives used the software through ethnographic research helped me understand and focus the analysis on their day-to-day needs and struggles.

Conclusion

Ethnography is a powerful tool, and the business world and other professional settings have been increasingly realizing this (c.f. this and this ). I have provided three circumstances where I have personally found ethnography to be invaluable. Ethnography allows you to experience what is happening on the ground and through that to shape and inform the research questions we ask and recommendations or products we build for people in those contexts.

Photo credit #1: DariusSankowski at https://pixabay.com/photos/navigation-car-drive-road-gps-1048294/

Photo credit #2: AbsolutVision at https://unsplash.com/photos/82TpEld0_e4

Photo credit #3: Tony Wan at https://unsplash.com/photos/NSXmh14ccRU

The Four Most Common Data Science Interview Questions and How to Prepare for Them

Interviewing for a data science role can be a daunting task, especially for those new to the field. I have lost count of the number of data science interviews I have had over the years, but here are the four most common questions I have encountered and strategies for preparing for each. Prepping for these questions is a great opportunity to develop your story thesis, the most important part of any data science interview.

Most Common Data Science Questions:
1) Tell me about yourself.
2) Describe a data science job you have worked on.
3) What kind of experience do you have with messy data?
4) What programming languages and software have you used?

Question 1: Tell me about yourself.

This is probably hands down the most common interview question across all industries and fields, not just data science, so the fact that it is the most commonly asked questions in data science interviews may not seem that surprising. A good answer is crucial to establish a favorable first impression and to lay your main story or thesis of who you are that you will come back to throughout the interview.

In data science interviews, I emphasize my passion for using data science tools to help organizations solve complex problems that were previously vexing. If you are unsure what your thesis is, I designed this activity to help people decipher it. Here is an example of how I would describe myself:

“I fell in love with data science because I enjoy helping organizations solve complex problems. In my past roles, I have used my combined data science and social science skills to explore and build solutions for complicated problems for which the typical ways of doing things within the organization have not worked. I am energized by the intellectual stimulation of breaking down complex problems and using data science to develop potential innovative yet useful solutions. What kind of problems do you guys have that has led you to need to find a data scientist like me?”

Your self-description should tell the story of who you are in a way that demonstrates how you would be a natural fit for the role and helpful to the organization. As your interview thesis, if you laid it out well, then every other question you answer will simply involve fleshing out one (or a combination) of those three basic parts of your self-story: 1) Who you are, 2) How your identity makes you a natural fit for the role, and 3) How this would benefit the organization.

Here are four other important observations to note about how I told my story:

  1. I emphasized who I was – an innovator developing unique solutions to complex problems – while showing my innovator identity naturally connects with data science and could be helpful for the organization. You might not consider yourself an “innovator” per se, but the trick is to figure out who you are based on what energizes and impassions you and then show how performing the data science role you are applying for is a natural fit for who you are.
  2. I told the story with normal words, not technical jargon. I have found that many, if not most, of my interviews, especially the first-round interviews, are with employees without technical expertise, and since you often do not know the level of technical expertise of the interviewer, it is better to err on the non-technical side.  
  3. I kept my story positive, only mentioning what I like to do. Sometimes people instinctively try to illustrate what they want by describing things they do not like to do: e.g. “At previous last job, I learned I do not like doing Y, so I am seeking to do X instead” or “I am doing Y, and I hate it. I want out.” I would describe these aspects of my story later if the interviewer asks, but I would stick with the positive at first: only mentioning what I want to do.
  4. I used strong, subjective, even emotional phrases like “fell in love with,” “passionate about,” and “energized by.” At first glance, these phrases might seem overly informal, but I have found they help interviewers remember me. Do not overdo it, but being more vivid and personable is generally helps rather than hurts your interview chances for data science positions.  

Question 2: Describe a data science project you have worked on.

This is the second most common question I encountered, so make sure you come prepared with an exemplar project to showcase. They may ask you a lot of questions about your project, so I would recommend choosing a project where you did an amazing job on, really knocked it out of the park and that you are proud of. Unless there are disclosure issues, post your work on GitHub, a blog, LinkedIn, or somewhere else online, and include a link to it in your job application.

How to explain the project will vary considerably depending on your interviewer’s degree of expertise. I generally start with a non-technical, high level explanation and provide the technical details if the interviewer(s) prompts me to with follow-up questions. This gives the interviewers the freedom to choose the level of technical expertise they would like in their follow-up. A data scientist interviewer worth his or her salt will quickly steer the conversation into more technical aspects of your project that he or she wants to learn more about, but even then, starting non-technical demonstrates that you know how to effectively communicate your work to non-technical audiences as well.

When describing your project, you are effectively telling the story of the project, and most project stories have the following core components:

  • Who: You are probably the story’s protagonist (it is your interview after all, so naturally pick a project or part of a project where you were the primary driver), but there are likely multiple important side characters that you will need to setup, like who commissioned the project, who it was for, who the data was about, and so on.
  • What: The problem, need, or question your project sought to address generally forms the “conflict” of the project story, so be sure to explain what led to the problem, need, or question (in stories, called the inciting incident).
  • When and Where: The timeframe setting/context in which the project took place (e.g., the organization you were working with or a class you took for which the project was for). How long you had to complete the project can also be important to establish.
  • How: What did you to solve the problem. If you tried a lot of approaches before discovering what works, the how includes both your methodological story and your final solution (that is part of the rising and falling action for how you overcame the project). This is the meat of your story. You will want technical and non-technical descriptions of the how:
    • Technical How: Generally, the core two parts of a technical description are the model you used (and any you tried if applicable) and how you determined the features/variables you selected. Another important part might be how you cleaned and/or gleaned the data. 
    • Non-technical How: I have found that non-technical audiences usually do not glean much from either the model I ended using or my feature selection procedure. Instead, I explain what type of functionality I ensured the model had to solve the problem I had just setup: for example, “I built a model that calculated the probability of X phenomena based on data sources A, B, and C, testing various types of models to determine which would do this best, and then discerned which variables among those datasets were the best to use.” For a non-technical audience, that is generally enough. The core component for them is what goes into the model (the data), what result the model produced from it, and how that informed the problem, need, or question driving the project. 

Finally, in your how explanation, make sure you slip in whatever programming languages and software you used: Python, R, SQL, Azure, etc.

  • Why: This is your explanation of why you chose the approach(es) you did for your how. Now, just like with the how, you will need a technical and non-technical explanations of the why.

Make sure your non-technical explanation of why aligns with your non-technical how. I commonly see data scientists make the mistake of going over a non-technical individual’s head by trying to provide a technical why explanation for their non-technical how. In particular, I would not explain the metric or criteria you used to compare models or decide the feature selection procedure in my non-technical explanation, since these will likely lose a non-technical person. If my non-technical how description focused what data the model used and what it did with it, then my non-technical why focuses on why building a model to do that mattered and how it helped others and/or myself in the real world.

  • What happened: This is the result of the project. Did you succeed or fail (or somewhere in between)? Was it useful for whoever you built it for? Were you able to conduct any follow-up analysis after deployment? Maybe most importantly, what did you learn from the experience? In narrative terminology, this is the resolution. The more you can quantitatively measure any outcomes the better. 

These are the basic components of a project story. Here is the most common project I use, and when reading through it, feel free to analyze how I present each component of the story. I wrote this blog for a general audience, so I provided my non-technical how and why.

Question 3: What kind of experience do you have with messy data?

Interviewers ask me this question surprisingly frequently. They usually preface the question by explaining that they at the organization have a lot of messy data that would require cleaning/processing for their future data scientist. This is a great opportunity to showcase your comfort with data science and data science issues.

I typically answering something like this:

“Yes, I have had to organize and clean messy data all the time. That’s par for the course in data science: the running joke among data scientists is that 90% of any data science project is data cleaning, and 10% actually doing anything with it. At least you guys are honest about the fact that your data is messy. When I worked as a consultant, for example, I talked with many organizations about potential data science projects, and if they said their data was clean and ready to go, chances are they were lying either to themselves or to me about how messy and haphazard their data really was. The fact that you are upfront about the messiness of your data tells me that you guys as an organization are realistically assessing where you are and what you need.”

This answer not only establishes that I have handled messy data before but also normalizes the problem in the field as resolvable by an expert (like myself) and compliments them for being up front. Answering this question confidently and positively has uniquely put me at the top of the list as the front runner candidate in some interviews. Giving a good answer to is is a perfect opportunity to endear yourself with your interviewer.

Question 4: What programming languages and/or software have you used?

Even though a technical interviewer might ask this as well, I have encountered this question most frequently among non-technical interviewers. In my experience, fellow data scientist interviewers have more insider ways of deciphering whether you do in fact know data science, but for non-technical interviewers, this question is their initial way to probe that. Sometimes, they will cling to a laundry list of software and/or languages to determine whether you are qualified.

Now, I believe that having experience using the exact combination of softwares that the data science team you would be joining uses is generally not that important a criterion for job success. For a good data scientist, learning another software system or programming language once you know dozens is not that difficult of a task. But their question is completely natural and reasonable coming from their side, so you will have to answer it.

If they open-endedly ask what softwares and languages you have use, list through the ones you have used, maybe starting with the ones you use the most often. I generally start by mentioning Python, since not only is it my favorite language for data science (see this article) but also conveys that I am familiar with programming in general.

More often, though, they might ask whether you have used X software before, often asking whether you have used each software on a list they have in front of them. I would never recommend lying by claiming that you have experience with a software you have never used, but I would recommend recasting a “No” by providing an equivalent software to it that you have worked with. Here is an example:

“No, I have not used Julia, but that is because I prefer using Python for what others might use Julia for. Python is an equivalent high-functioning programming language in complexity, and the data science teams I worked on happened to prefer it over Julia.”

This not only conveys the “No” in a bit more of a positive light, but it shows that you are familiar with the software he or she just mentioned and confident about using it to match your would-be team.   

Question 5: What are you looking for in a job?

Most often, this is the last major question interviewers ask me, but I have gotten it at the beginning as well. They probably save it until the end, because the question transitions very easily to the next part of the interview: either them describing the role or you providing any questions you have.

If you did a good job laying out your thesis story in the first question, then here you simply restate it from a different angle. You already laid the groundwork, and you are just bringing it home at this point. If they ask me this at the beginning of the interview before the “Tell me about yourself” question, then I use this question to retell my thesis story from this new angle.

Here is my typical answer:

“Like I said, I am energized by figuring out how to help organizations solve complex data science problems. Over the years, I have found two concrete things in an organization help me with this. First, I thrive in stimulating work environments where I am given the space and resources to think creatively through problems. Second, I also need to be able to work with people from a variety of backgrounds and disciplines from whom I can learn from and develop innovative approaches to the problem at hand. You guys seem to provide both. [I then conclude by explaining why they seem to provide both based on what you learned about the organization during the interview, or if we have not had a chance to talk about them yet, ask about these within the organization.]”

Notice that the first sentence references my self-explanation answer to the “Tell me about myself” question. If they ask this question before I have given that spiel, I spend about 30 seconds or a minute providing a condensed self-introduction and then continue with the rest of the answer.

Conclusion

These are the five most common data science interview questions I have encountered and how to prepare for them. I have found that when data scientists give advice on how to prepare for job interviews, they often focus on preparing for highly technical, factual questions (e.g., here and here). Even though having a solid data science foundation can be important, refining your overall story thesis – who you are, what you are passionate about doing, and how that relates to this job – is far more important to advance through the interview process.

I have found that humans, even supposedly “nerdy” data scientists, tend to connect with people and stories, so if you can hook them there, they generally remember you better and are more likely to hire you. When you have a compelling story, every other question will naturally fall into place as an intuitive further clarification of that overall story.  

Photo Credit #1: Work With Island at https://unsplash.com/photos/FX2QA0TMEYg

Photo Credit #2: Free-Photos at https://pixabay.com/photos/glasses-reading-glasses-spectacles-1246611/

Photo Credit #3: geralt at https://pixabay.com/illustrations/questions-font-who-what-how-why-2245264/

Photo Credit #4: Darwin Vegher at https://unsplash.com/photos/W_ZYCEUapF0

Photo Credit #5: geralt at https://pixabay.com/illustrations/software-program-cd-dvd-disc-pack-417880/

Photo Credit #6: jenoliver777 at https://pixabay.com/photos/horses-dogs-groundwork-blaze-2888749/

Anthropologist in I.T. (Comic, Funny)

Here’s a fun little comic about some of my experiences working as an anthropologist in I.T. It’s actually a blast.

I wrote this comic for the University of Memphis Anthropology Department, where they featured it on their Fall 2018 newsletter.

Thank you, Rusty Haner, for illustrating the panels.

Data Science Interview Comic

Here’s a comic I wrote about the travails of interviewing as a data scientist. Thank you, Emi Harry, for producing the panels.

The Job Hunt (Part 1): Introduction

For the last few months, I have been considering a mini-blog series on the job hunt, and (at the time I am writing this) the economic downturn resulting from the coronavirus pandemic has made a discussion on finding a job even more relevant.

In this mini-blog series, I will focus on the following topic areas in the job hunt:

  1. Job Hunt Preparations: Self-care, goal setting, daily habits, vocational reflection, etc.
  2. Learning about potential opportunities: networking, job searching, etc.
  3. Marketing yourself for employers: writing a resume/CV and cover letter, building a portfolio, etc.
  4. Developing your own skill sets: Navigating whether to develop your skills and which resources to use

This blog may be useful to you if you are in the following situations:

  1. Recently left or about to leave a job (for whatever reason) and are looking for another
  2. Have been disconcerted at your current job (again for whatever reason) and have decided to look elsewhere
  3. Recently graduating from school or some other kind of training and seeking to enter the workforce
  4. Finding gigs is a regular part of your work

(Final Note: Even though my blog focuses on the integration of data science and anthropology, in this mini-series, I intend my advice for just about any industry. Data science and anthropology are where I have most experience in, though, so, of course, I might implicitly have a bias towards strategies what works in those fields.)

Photo/Graph credit: MIH83 at https://pixabay.com/illustrations/job-application-job-search-1344744/

The Job Hunt (Part 2): Preparing Yourself

woman holding black flag

This is my first blog on the Job Hunt mini-series. When starting to embark on finding a new job, preparing yourself is the most important first step, so the first set of posts will focus on the initial work necessary to launch yourself going forward.

Prepare yourself both physically, including financially and logistically, but more importantly mentally and emotionally for what you are about to undertake. The job hunt is often an adventure, so readying yourself is crucial.

Here are three basic ways to prepare yourself:

Give yourself time to process what you’re leaving/left.
Take stock mentally, emotionally, and materially for the long haul.
Be patient and know what you can control.

Give yourself time to process what you’re leaving/left.

Frequently when people are looking for a job, they recently ended or would like to end some prior situation: maybe something happened causing them to resign or be let go, or they are stressed for whatever reason at their current position and thus seeking something else. In such situations, make sure you explicitly take time to process and heal from whatever you may be coming out of.

How to do so might depend on both who you are and what you have encountered. Maybe you need to replenish yourself from burnout, emotionally and/or mentally process what happened, or reassess who you are. This will take time, and that it does is in no way a negative reflection of you.

Balance, Meditation, Meditate, Silent, Rest, Sky, Sun

Be conscientious about developing meaningful practices that will rejuvenate you and help you process what happened. Journal, take up a hobby, talk with friends or family, or do whatever helps you. After a day of exertion, our bodies physically need to sleep at night to rebuild the muscle tissue for a new day of adventures. Our emotional and mental faculties often work similarly: taking the time to slow down and process what happened will allow you to move forward in pursuit of your next occupational adventure.

Take stock for the long haul.

Be prepared mentally, emotionally, and materially for the long haul. I frequently hear people say that it takes on average six months to find a job. At the time of writing this, the economy is bad, and it could be longer. Materially and financially take stock of how many resources you have and how you can plan to get by for a while.

Prepare yourself mentally and emotionally a long trek in finding the next job. Try to resist the urge to appease yourself with the potentially false promise of a quick turnaround and ignore any swindlers trying to sell you the same.

Telling yourself that it could take months of grueling work to find a job will help you in the long run. It’ll be much easier on you to have a shorter-than-expected job hunt than to have your high hopes for a quick out crushed.

The job search is (almost always) a long and arduous process. Be ready for that.

To me personally, the job hunt feels like a tunnel: you hope/sense that there is a light at the end of it when you will find the next gig, but you do not know when it will come. It could always be tomorrow that you get that amazing job offer or several months from now.

This image has an empty alt attribute; its file name is Job-Hunt-Part-2-Picture-2-1-1024x682.jpg

There is generally light at the end of the tunnel, but that doesn’t mean the experience isn’t difficult. Having unrealistic expectations will only make the dark times ahead feel all the darker.

Be patient and know what you can control.

The ancient stoic philosophers emphasized not holding yourself accountable to what is beyond your control, and I have found that the job hunt can necessitate its own version of stoicism. You can do a lot to better your application, find the right job, connect with the right people, and these are important.

But, there is always so much about it that you cannot control. You cannot fully control where employers on the other side are coming from and what decisions they make: how and where they look for candidates, what they think of you and whether they value you, or even whether organizations/companies have an open position in the first place.

One pro of the online applications is that we even more equipped to apply to positions around the world, but one con is that thousands of applications go unread. Job prospects can often come and go based on wider structural societally factors – like a failing economy – or the successes or failures of that specific organization. You can do everything right in application and still fail for reasons outside of your control.

This image has an empty alt attribute; its file name is Job-Hunt-Part-2-Picture-3.jpg

You can and should strive to make your application as strong as possible: both presenting yourself in the best possible light and searching within your means for the best job openings for you. But, as in Richard Niebuhr’s famous prayer, possessing the wisdom to know what you can and cannot change is crucial. This requires that you be patient with yourself if and when you fail so that you can continue to pick yourself back up and try again.

Photo/Graph credit #1: Engin Akyurt at https://www.pexels.com/photo/woman-holding-black-flag-1571734/

Photo/Graph credit #2: realworkhard at https://pixabay.com/photos/balance-meditation-meditate-silent-110850/

Photo/Graph credit #3: Free Photos at https://pixabay.com/photos/person-man-male-worker-inside-731151/

Photo/Graph credit #4: Flazingo Phots at https://www.flickr.com/photos/124247024@N07/14110060693/