Aspiring data scientists will frequently ask me for recommendations about the best way to learn data science. Should they try a bootcamp or enroll in an online data science course, or any of the myriad options out there?
In the last several years, we have seen the development of many different types of educational programs that teach data science, ranging from free online tutorials to bootcamps to advanced degrees at universities, and the pandemic has seemed to have fostered the establishment of even more programs to meet the increased demand for remote learning. Although probably overall a good thing, having more options increases the complexity of deciding which one to do and the potential noise of programs upselling their services.
This article is a high-level survey of the four basic types of data science education programs to help you think about which might work best for you. Without already knowing data science, it can be difficult to assess how effective a program is at teaching it. Hopefully, this article will help break that chicken-and-the-egg conundrum.
These are the four basic ways to learn data science:
Do-it-yourself learning
Online courses
Bootcamps
Master’s degree or other university degree in data science (or related field)
I will discuss them in order from the cheapest to most expensive. I also included two hybrid strategies that combine a few of these that are worth considering as well. This table provides a quick, high-level synopsis of each one:
Option 1: Do-It-Yourself Online
There are tons of free, online data science resources that can either teach data science from scratch or explain just about any data science content you could possibly want to know. These range from tutorials for those who learn by doing like W3Schools, videos on YouTube and other sites for audio learners like Andrew Ng’s YouTube series, articles for visual learners who enjoy reading like Towards Data Science. You could scour the internet and teach yourself. It has the pros of being free and perfectly flexible to tailor to your schedule.
But as a former teacher, I have found independent learning is not for everyone. You must be entirely self-motivated and self-structured to teach yourself like this. So, know yourself: are you the type of person who could learn well completely independently like this?
Education programs tend to provide these resources that you might lack if you went it alone:
1) Curriculum Oversight: Data science experts in any education program generally establish some kind of data science curriculum for you that includes the necessary topics in the field. Many people who are new to data science do not know yet what data science concepts and skills are most important to learn about. This can create a chicken and egg problem for self-learners who must learn the field at least a little to know the most important items to learn in the first place. Data science programs help circumvent this by giving you an initial curriculum to started with.
2) Guidance of the Norms of the Field: In addition to the teaching the material, education programs implicitly introduce students to data science norms and ways of thinking. Even though there are times to deviate from the established custom, they are important when first working on teams with fellow data scientists. Sometimes self-learners learn the literal material but do not gather the implicit perspectives that enables their incorporation into the data science community.
3) External Social Accountability: Education programs provide a form of social accountability that subtly encourages you to get the work done. Self-learners must rely almost exclusively on their own self-motivation and self-accountability, which, in my experience, works for some people but not others.
4) Social Resources: Education programs (especially ones that meet either in-person or virtually) provide various people – teachers, students, and in some cases mentees/underlings – with whom one can talk through problems with, help you discover your weaknesses and shortcomings, and determine ways to address them. Minute programming details that are easily overlooked by beginners, but experts might easily spot can cause your entire program to fail. To learn independently, you will have to either solve all of these yourself or find data science friends or family who are willing to help you.
5) Certification of Skills: Education programs bestow degrees, grades, and other certifications as external proof that you do, in fact, possess the requisite skills in a data science role. Learning on your own, you must prove that you have these skills to employers by yourself. Developing a portfolio of thought-provoking projects, you have done is the best way to demonstrate this.
6) Guidance in Forming Projects: An impressive project works wonders for showcasing your data science skills. In my experience, beginners to data science often do not yet possess the skills to create, complete, and market a thought-provoking yet doable project, and one of the most important roles data science educators can have is helping students think through how to develop one. You must do this yourself when learning alone.
One can overcome each of these deficits. I have found that for people who learn well independently, its cost and flexibility advantages easily outweigh these cons. Thus, the crucial question is, Would this form of independent learning work for you? In my experience, it works for a comparatively small percent of people, but for those it works for, it is a great option.
If you do decide to teach yourself, I would recommend considering the following:
1) Be conscientious about your learning style when crafting your material. For example, if you are a visual learner, then reading online material resources would be best, but if you are more of an auditory learner, then I would recommend watching video tutorials/lectures on say YouTube.
2) If you have data science friends willing to help you, they can be a great asset, particularly in determining what data science materials to learn, troubleshooting any coding issues you might have, and/or developing a good project(s).
3) People in general learn data science best by doing data science. Avoid the common trap of only reading about data science without getting your hands dirty and experimenting yourself (preferably with unclean, annoying, real-world data, not already trimmed, “textbook perfect” data). Using pristine data to first learn the concepts is fine, but make sure you graduate yourself to practicing with real-life dirty data.
Option 2: Online Course
A variety of online courses exist. Most of them are relatively cheap (usually around $20-$50 a month or $100-$200 per course). For example, at the time of writing this, Udemy has an introductory data science course for a flat rate of $94.99, and Coursera a course for $19.99 a month (both with prices varying based on discounts and other special deals). Online courses are generally the cheapest of the courses you can enroll in, and because of the length of most, you will probably have to take several levels of courses (introductory to advanced) to learn the field.
Another advantage is that they are flexible: You can learn at your own pace, based on the needs of your schedule. This is really valuable for people who also working a job and studying on the side, with family commitments, and/or other obligations complicating their schedules. Keep in mind, though, that because you often pay per month, how many months you take often dictates the final cost. At the end of the day, spending an extra $100 or so to take a few more months to complete the course is still much cheaper than the other course options.
On the other hand, however, like doing it yourself, they tend to lack the social benefits of classroom learning: instructors to ask questions to and provide external social accountability, and fellow students to work alongside. In my experience, this makes it a very challenging for some learners, but others are not as comparatively affected by it.
In addition, many online courses provide more of a cursory summary of data science and lack the complex projects that are both necessary to learn data science and to market yourself to others. Even though there are exceptions, online courses are often good at introducing data science concepts rather than an in-depth exploration. Many focus on canned problems with already cleaned, ready-to-do data instead of letting you practice on the messy, complex, and often just plain silly data most data scientists actually have to use at their jobs. They also often lack the personnel for one-on-one coaching to mentor each student through portfolio-building projects with complex data.
Thus, online courses tend to provide good, cost-effective introductions to data science, helpful to see whether you like the field (see Hybrid #1 below), but do not generally provide the refined training necessary to become a data scientist. Now, some programs are evolving their courses. Especially as the pandemic increases demand for remote learning, online learning platforms are developing more robust online data science courses. If you choose to learn by taking online courses, I recommend supplementing it with your own projects to get experience practicing data science work and showcase in job interviews.
Hybrid #1: Use an Online course to Introduce Data Science (or Programming)
If you are completely new to data science, an online course can provide a low-cost, structured space to get a sense for what the field entails and determine whether it is a good fit for you. I have seen many people enroll in several thousand-dollar bootcamps or university degree programs only to learn there that they do not like doing data science work. An online course is a much cheaper space to discern that.
You could always explore data science yourself for free to decide whether you like it (see Option 1) instead of taking an online course, but I have found that many people who have never seen data science before do not know what to look up in the field to get started. An introductory online course is not that expensive, and the initial orientation into the major topic areas can be well worth the cost.
There are three basic versions of this approach:
1) If you do not already know a programming language, take an online programming course. I explained in this article why I would recommend Python as the language to learn (with Julia as a close second). If you do not like programming, then you have learned the lesson that you should not become a data scientist, and even if you do not end up in data science, programming is such a valuable skill that having some training in it will only help your occupational prospects in most other related fields.
2) If you do know a programming language, take an introductory data science course. These often provide a high-level overview of data science, especially helpful for people who need to work with data scientists and understand what they are talking about. If you need a math refresher, this is a great option as well.
3) I have seen prospective data scientists take online data analytics courses to prepare them for and determine their potential interest in data science. I would not recommend this, however. Even though data scientists will sometimes treat data analytics as a “diet” or “basic” version of data science, data analytics is different field requiring different skills. For example, data analytics courses typically do not include the rigorous programming. They generally focus on R and SQL if they teach programming at all, which are fine languages for data analytics and statistics but not enough for data science (for which you would want a language like Python). Data analytics and data science also generally emphasize different fields of math: data analytics tends to rely on statistics while data science on linear algebra, for example. Thus, what you would learn in those courses would not apply to data science as much as you would think. Now, if you are unsure of whether you would like to become a data scientist or data analyst, then a data analytics course might help you understand and get a feel for data analytics, but I would not use them to assess whether data science is a good fit for you.
Once you complete the online course, if you still think you would enjoy doing data science work, then you can choose any of the options to learn the field in more depth. This may seem like just getting you back to square one, but by taking an introductory programming or data science course, you have levelled yourself up so to speak and are more ready to face the “boss battle” of becoming a data scientist.
Option 3: Data Science Bootcamps
Data science bootcamps have also become popular. They tend to be several weeks long (in my experience often ranging from 2 to 6 months) intensive training programs. The traditional pre-pandemic bootcamp was in-person and would often cost around $10,000 to $15,000. Metis’s bootcamp is a good example of what they often look like.
Their biggest pros are that they offer the advantages of classroom education far more cheaply and in much less time than getting a university degree. They are a significant step-up cost-wise than the previous options (see Con 2 below), but they seek to provide a comparable (but less academically advanced and in-depth) scope of knowledge as a master’s degree in data science for a significantly lower price and in a fraction of the time. Even though it can often make their pace feel intense, the good bootcamps tend to mostly succeed at providing this. This makes them a great option for anyone who knows they want to become a data scientist. Finally, unlike the previous options, you get a teacher(s) to ask questions to and motivate you, and a set of fellow students to struggle through concepts with. The best programs offer the occupational coaching and build strong networks in data science communities to help their students find jobs afterwards.
They have some major cons, however:
1) They can feel fast-paced, unloading complex concepts in a short amount of time. Many of my friends who have done bootcamps have reported feeling cognitive whiplash. Expect those weeks/months to be mentally intense and to subsume your life. Data science bootcamps are often 9-5 full-time jobs during that time, and you will likely be too mentally exhausted to work on other things in the evenings or weekend (plus in some cases you will have homework to complete then anyways). A few weeks or months is not terribly long for such an ordeal, but it makes them much less flexible than the previous options. For example, this forces many students to take time from their current jobs to complete the bootcamp and to limit their social, familial, and other obligations as much as they can during their bootcamp. This makes it difficult for anyone unable to take time off work, with busy social or familial lives, or otherwise with a lot going on.
2) At several thousand dollars, they are clearly noticeably more expensive of the than the previous options (but still much cheaper than universities). Some offer scholarships and other services on a need-basis, but even then, the opportunity cost of having to put a job on hold can still be expensive. Given their general high salaries, landing a data science job would likely make the money back, but it takes a hefty initial investment.
This makes it an especially poor option for anyone thinking about data science but not sure whether they want to do it. $10,000 is a lot to spend to simply learn you do not like the field, and there are many cheaper ways to initially explore the field (see especially Hybrid #1). The cost still might be worth it, however, for anyone who really wants to become a data scientist but does not yet possess key skills and knowledge.
3) At the time of writing this, the Covid-19 pandemic has forced most data science bootcamps to meet remotely anyways, making their services far more similar to the much cheaper online courses. That said, many have sought to simulate the classroom environment virtually, trying to provide some type of social environment, but the classroom environment was a major advantage that made their significant increase in costs over the previous options worthwhile.
4) They tend to exist in large cities (especially tech centers). For example, bootcamps in the United States tend to concentrate in New York City, Los Angeles, Chicago, San Francisco, etc. Prior to the pandemic, anyone not living in those places would have to travel and temporarily reside in wherever their chosen bootcamp was, an additional expense.
5) They are often difficult for people who do not know programming and for those who do not know college-level mathematics like linear algebra, calculus, and statistics. If you do not know programming, I would recommend learning a programming language like Python (for more see this article I wrote explaining why to learn Python of all languages) through either a cheap online course and/or online tutorials first. Some data science bootcamps offer a preparatory introduction online course that teaches the prerequisite coding and math skills for those who do not understand it. They are worth consideration as well, but keep in mind the equivalent online course might be cheaper with roughly the same educational value.
If you decide to do a bootcamp, these criteria are important when researching which bootcamp to choose:
1) Project Orientation: How well do they enable you to practice data science through portfolio-building projects, and how impressive are the projects its alum did? The best data science bootcamps are generally teach in a project-oriented fashion.
2) Job-Finding Resources and/or Job Guarantee: What resources or coaching do they give to help you find a job afterwards? Help networking, presenting yourself, and interviewing, for example, are important skills to finding a job as a data scientist, and in addition to teaching you technical curriculum, the best programs tend to find occupational coaches to help specifically with the job-finding process. Also, some programs give a job guarantee: if you do not find a data science job after a certain number of months after graduating then they refund tuition. This generally shows they take job finding important enough to risk their own money on it (although do check at the fine print on the guarantee to see the exact terms they are agreeing to).
3) Alum Resources: A surprisingly import detail to consider is how much resources a bootcamp invests in cultivating alumni networks. I was surprised by how receptive to meeting/networking alum of the online bootcamp I did, and how satisfied alum tend to be with the bootcamp. The effort a bootcamp makes to work with and maintain relationships with its alum impact this significantly. Connectedness with alum can be difficult to assess when researching programs from afar, but asking whether you can speak with alum(s) to learn about their experiences with the program, checking a bootcamp’s alum activity on LinkedIn and other social media websites, and asking about what kind of networking opportunities with alum they facilitate can be great ways to assess how intentional a program is about cultivating relationships.
4) Scholarship Options: Some programs offer full or at least partial scholarships based on need. Clearly, ways to knock down the cost of the bootcamp would be great, especially if a bootcamp seems like an ideal option for you, but the cost seems too daunting.
Hybrid #2: Online Bootcamp
Online bootcamps tend to possess the schedule flexibility of online courses but offer more rigorous, personal (albeit remote) learning, allowing you to combine the best of aspects of data science bootcamps and online programs. They are also generally cheaper than traditional bootcamps (yet also more expensive than an online course). Finally, they tend to be a much better option for those who do not live in a major city that happens to have a local data science bootcamp program. The pandemic, if anything, has probably helped produce even more online bootcamp programs, since it has forced data science bootcamps to teach virtually.
I enrolled in Springboard’s online data science bootcamp in 2017, a great example of an online bootcamp. At the time, they cost roughly $1,000 a month (at the time of writing their standard rate is $1,490 a month and state their program generally takes six months). This is cheaper than traditional bootcamps but still a few totaling around $10,000 for six months. They had online curriculum typical of online courses but also provided weekly virtual meetings with an instructor to discuss the material and any issues you are having. Now they seem to include virtual lessons online. This individualized training and remote classroom environment are the main value adds over an online course, and you must assess whether, for you, they would be worth the additional cost. They are self-paced, providing much greater flexibility on when and how often you work than typical bootcamps. They also refunded your money if you did not find a job in six months after completion.
If you choose this option, be aware of the potential pitfalls of both online courses and traditional bootcamps. Just like with online programs, you will need to evaluate whether you are comfortable learning the curriculum by yourself (even you can meet with a mentor for major issues once a week, you would be doing the bulk learning by yourself throughout the week). Like with traditional bootcamps, expect the learning to be mentally intense and make sure they help you develop portfolio-building projects and provide job-finding resources and training.
Option 4: Master’s Degree or Other University Degree
The final option is to go back to school to get a degree in data science. This is the most expensive and time-consuming option: a master’s degree (a logical choice if you already have a bachelor’s) is generally the shortest, taking two years. But they cost upwards of $100,000. Even if partial or full scholarships decrease that cost, the opportunity cost of spending several years of your life in school is still higher than any of the other options. It can give a resume boost, however, if you know how to leverage it properly, which will likely increase your salary to make up for the initial cost. I would only recommend getting a master’s degree if you already know you love data science (say because you have already been working in the field, preferably if you also have already figured out the specific area of data science you want to do) but want to take your skills, technique, and/or theoretical knowledge of how the models work to the next level.
The best way to refine your data science skills is by doing data science: finding or creating contexts to push you as you practice data science. Graduate schools are not the only potential environment to refine one’s data science skills (e.g., all the previous options could involve that if done well), and even though graduate schools can be great at providing rigor, these other options can be a lot cheaper and more flexible. Finally, at the time of writing this, at least, the demand for data scientists exceeds the number of actual people in the field, and so getting a data science job without an “official” university degree in data science is pretty realistic.
University data science degree programs are relatively new – generally only a few years old. Thus, not all universities have literal data science degrees or departments but instead require that you enroll in a related program like computer science, statistics, or engineering to learn data science. This does not always mean these other programs are bad or unhelpful, but it often means you will have to perform extraneous or semi-extraneous tasks to data science proper in order to complete your degree (in some cases with minimal help from faculty from other fields).
When considering a program, you should make sure they are proactive about teaching professional and not just academic data science skillsets. These are the specific questions I would research to assess how well they might prepare you for non-academic data science jobs:
1) What proportion of their faculty currently work or at least have worked in the industry as a data scientist (or other similar job title)?
2) How well connected is the department with local organizations, and might they be able to leverage these relationships to help you work with these organizations through a work-study program or internship during the program and/or employment afterwards?
3) Will they help you build – or at least give you the flexibility to build – one’s thesis into an applied data science project that would boost your resume to future employers?
If your chosen program lacks these, I would strongly recommend building resume/portfolio-boosting projects and networking with local data scientists on the side while completing the program. This takes considerable time and energy, so ideally your department would actively help you in this work, instead of requiring that you do it on your own while also completing all their work.
Funding options is something else to consider. Are they willing to fund your degree fully or at least partly? Work-study programs where you work while getting your master’s can be a great way to graduate with no debt and gain resume-building work experiences (although they can make you busy). I benefitted greatly from working as a data scientist while completing my master’s, both because I graduated with no debt and because it allowed me to practice and refine my skills.
Finally, most universities require that you live nearby and attend physically (at least before and likely after the pandemic). Thus, you might have to find a place near you or be willing to relocate for a few years if there is not a data science degree program nearby. If so, you should factor moving expenses into the cost of doing the program.
Conclusion
Learning data science can be an awesome yet daunting prospect, and finding the right strategy for you is complicated, particularly given all the pedagogical, logistical, and financial considerations. Hopefully, this article has helped you think through how to journey forward.
In a
previous post about data visualization in data
science and statistics,
I discussed what I consider the single most important rule of graphing data. In
this post, I am following up to discuss the most important rules for making
data tables. I will focus on data tables in reporting/communicating findings to
others, as opposed to the many other uses of tables in data science say to
store, organize, and mine data.
To
summarize, graphs are like sentences, conveying one clear thought to the
viewer/reader. Tables, on the other hand, can function more like paragraphs,
conveying multiple sentences or thoughts to get an overall idea. Unlike graphs,
which often provide one thought, tables can be more exploratory, providing
information for the viewer/reader to analyze and draw his or her own
conclusions from.
Table Rule #1: Don’t be afraid to
provide as much or as little information as you need.
Paragraphs
can use multiple sentences to convey a series of thoughts/statements, and tables
are no different. One can convey multiple pieces of information that
viewers/readers can look through and analyze at his or her own leisure, using
the data to answer their own questions, so feel free to take up the space as
you need. Several page long tables are fair game and, in many cases, absolutely
necessary (although often end up in appendices for readers/viewers needing a
more in-depth take).
In my
previous data visualization post, I gave this bar chart as an
example of trying to say too many statements for a graph:
This is a
paragraphs-worth of information, and a table would represent it much better.[i]
In a table, the reader/viewer can explore the table values by country and year
themselves and answer whatever questions he or she might have. For example, if
someone wanted to analyze how a specific country changed overtime, he or she
could do so easily with a table, and/or if he or want to analyze compare the
immigration ratios between countries of a specific decade, that is possible as
well. In the graph above, each country’s subsegment starts in a different place
vertically for each decade column, making it hard to compare the sizes visually,
and since each decade has dozens of values, that the latter analysis is
visually difficult to decipher as well.
But, at the
same time, do not be afraid to convey a sentence- or graphs-worth of data into
a table, especially when such data is central for what you are saying. Sometimes
writers include one-sentence paragraphs when that single thought is crucial,
and likewise, a single statement table can have a similar effect. For example,
writing a table for a single variable does helps convey that that variable is
important:
Gender
Some Crucial Result
Male
36%
Female
84%
Now,
sometimes in these single statement instances, you might want to use a graph
instead of a table (or both), which I discuss in more detail in Rule #3.
Table Rule #2: Keep columns
consistent for easy scanning.
I have found
that when viewers/readers scan tables, they generally subconsciously assume
that all variables in a column are the same: same units and type of value.
Changing values of a column between rows can throw off your viewer/reader when
he or she looks at it. For example, consider this made-up study data:
Control Group (n = 100)
Experimental Group (n = 100)
Mean Age
45
44
Median Age
43
42
Male No. (%)
45 (45%)
36 (36%)
Female No. (%)
55 (55%)
64 (64%)
In this
table, the rows each mean different values and/or units. So, for example, going
down the control column, the first column is mean age measured in years. The
second column switches to median age, a different type of value than mean (although
the same unit of years). The final two rows convey the number and percentages
of males and females of each: both a different type of value and a different
unit (number and percent unlike years). This can be jarring for viewers/readers
who often expect columns to be of the same values and units and naturally
compare them as if they are similar types of values.
I would
recommend transposing it like this, so that the columns represent the similar
variables and the rows the two groups:
Mean Age
Median Age (IQR)
Male No. (%)
Female No. (%)
Control Group (n =
100)
45
43 (25, 65)
45 (45%)
55 (55%)
Experimental Group (n
= 100)
44
42 (27, 63)
36 (36%)
64 (64%)
Table Rule #3: Don’t be afraid to
also use a graph to convey magnitude, proportion, or scale
A table like
the gender table in Rule #1 conveys pertinent information numerically, but
numbers themselves do not visually show the difference between the values.
Gender
Some Crucial Result
Male
36%
Female
84%
Graphs excel at visually depicting the magnitude, proportion, and/or scale of data, so, if in this example, it is important to convey how much greater the “Some Crucial Result” is for females than males, then a basic bar graph allows the reader/viewer to see that the percent is more than double for the females than for the males.
Now, to convey this visual clarity, the graph loses the ability to precisely relate the exact numbers. For example, looking at only this graph, a reader/viewer might be unsure whether the males are at 36%, 37%, or 38%. People have developed many graphing strategies to deal with this (ranging from making the grid lines sharper, writing the exact numbers on top of, next to, or around the segment, among others), but combining the graph and table in instances where one both needs to convey the exact numbers and to convey a sense of their magnitude, proportion, or scale can also work well:
Finally, given
that tables can convey multiple statements, feel free to use several graphs to depict
the magnitude, proportion, or scale of one table. Do not try to overload a
multi-statement table into a single, incomprehensible graph. Break down each
statement you are trying to relate with that table and depict each separately
in a single graph.
Conclusion
If graphs are sentences, then tables can function more like paragraphs, conveying a large amount of information that make more than one thought or statement. This gives space for your reader/viewer to explore the data and interpret it on their own to answer whatever questions they have.
Newcomers to data science or artificial intelligence frequently ask me the best programming language to learn to build machine learning algorithms. Thus, I wrote this article as a reference for anyone who wants to know the answer to that question. These are what I consider the three most important languages, ranked in terms of usefulness based on both overall popularity within the data science community and my own personal experiences:
Best Programming Languages for Machine Learning:
#1 Choice: Python
#2 Choice: R
#3 Choice: Java
#4 Choice: C/C++
#1 Programming Language: Python
Python is the most popular language to use for machine learning and for three good reasons.
First, it’s package-based style allows you to utilize efficient machine learning and statistical packages that others have made, preventing you from having to constantly reinvent the wheel for common problems. Many if not most of the best packages (like NumPy, pandas, scikit learn, etc.) are in Python. This almost allows you to “cheat” when programing machine learning algorithms.
Second, Python is a powerful and flexible all-purpose language, so if you are building a machine learning algorithm to do something, then you can easily build the code for the other overall product or system in which you will use the algorithm without having to switch languages or softwares. It supports object-oriented, functional, and procedure-oriented programming styles, giving the programmer flexibility in how to code, allowing you to use whatever style or combination of various styles you like best or fits the specific context.
Third, unlike a language like Java or C++, Python does not require elaborate setup to program a single line of code. Even though you can easily build the coding infrastructure if you need to, if you only need to run a simple command or test, you can start immediately.
When I program in Python, I personally love using Jupyter Notebook, since its interface allows me to both code and to easily show my code and findings as a report or document. Another data scientist can simultaneously read and analyze my code and its output at the same time. I personally wish more data scientists published their papers and reports in Jupyter Notebook or other notebooks like it because of this.
If you have time to learn a single programming language for machine learning, I would strongly recommend it be Python. The next three languages, R, Java, and C++, do not match its ease and popularity within data science.
#2 Programming Language: R
R is a popular language for statisticians, a programming language that is specifically tailored for advanced statistical analysis. It includes many well-developed packages for machine learning but is not as popular with data scientists as Python. For example, in Towards Data Science’s survey, 57% of data scientists reported using Python, with 33% prioritizing it, and only 31% reported using R, with 17% prioritizing it. This seems to show that R is a complementary, not primary language for data science and machine learning. Most R packages have their equivalent in Python (and to some extent the other way around). Unlike Python, which is an all-purpose language, able to do other wonders other than analyzing data and developing machine learning algorithms, R is specifically tailored to statistics and data analysis, not able to do much beyond that. Saying this, though, R programmers are increasingly developing more and more packages for it, allowing it to do more and more.
#3 Programming Language: Java
Java was once the most popular language around, but Python has dethroned it in the last few years. As an avid Java programmer who programs in Java for fun, it breaks my heart to put it so far down the list, but Python is clearly a better language for data science and machine learning. If you are working in an organization or other context that still uses Java for part or all of its software infrastructure, then you may be stuck using it, but most recent developments, particularly in machine learning, have occurred in Python and in R (and a few other languages). Thus, if you use Java, you’ll frequently find yourself having to unnecessarily reinventing the wheel.
Plus, one major con of Java is that conducting quick, on-the-go analysis is not possible, since one must write a whole coding system before one can do a single line of code. Java can be popular in certain contexts, where the surrounding applications/software that utilize the machine learning algorithms are in Java, common in finance, front-end development, and companies that have been using Java-based software.
#4 Programming Languages: C/C++
The same Towards Data Science survey I mentioned above lists C/C++ as the second most popular data science and machine learning language after Python. Java follows them closely, yet I included Java and not C/C++ as third because I personally find Java to be a better overall language than C or C++. In C or C++, you may frequently find yourself reinventing the wheel – having to develop machine learning algorithms that others have already built in Python – but in some backend systems that have been built C or C++ like in engineering and electronics, you do not have much of an option. C++ has a similar problem with Java as well: lacking the ability to do quick on-the-go coding without having to build a whole infrastructure.
Conclusion
For a beginner to the data science scene, learning a single programming is the most helpful way to enter the field. Use learning a programming language to assess whether data science is for you: if you struggle and do not like programming, then developing machine learning algorithms for a living is probably not a good fit for you.
Many groups are trying to develop softwares that enable machine learning without having to program: DataRobot, Auto-WEKA, RapidMiner, BigML, and AutoML, among many others. The pros and cons and successes and failures of these softwares warrants a separate blog post to itself (one I intend to write eventually). As of now though, these have not replaced programming languages in either practical ability to develop complex machine learning algorithms and in demonstrating that you have the technical computational/programming skills for the field.
For a beginner to the data science scene, learning a single programming is the most helpful way to enter the field. Use learning a programming language to assess whether data science is for you: if you struggle and do not like programming, then data science where you would be developing machine learning algorithms for a living is probably not a good fit for you. Depending on where you work or type of field/tasks you are doing, you might end up using the language(s) or software(s) your team works with so that you can easily work jointly on projects with them. For some areas of work or tasks might prefer certain packages and languages. If you demonstrate that you can already know a complex programming language like Python (or Java or C++), even if that is not the preferred language of their team, then you will likely demonstrate to any hiring manager that you can learn their specific language or software.
I worked as a data scientist at a hospital in New York City during the worst of the covid-19 pandemic. Over the spring and summer, we became overwhelmed as the city turned into (and left) the global hotspot for covid-19. I have been processing everything that happened since.
The pandemic overwhelmed the entire hospital, particularly my physician colleagues. When I met with them, I could often notice the combined effects of physical and emotional exhaustion in their eyes and voices. Many had just arrived from the ICU, where they had spent several hours fighting to keep their patients alive only to witness many of them die in front of them, and I could sense the emotional toll that was taking.
My experiences of the pandemic as a data scientist differed considerably yet were also exhausting and disturbing in their own way. I spent several months day-in and day-out researching how many of our patients were dying from the pandemic and why: trying to determine what factors contributed to their deaths and what we could do as a hospital to best keep people alive. The patient who died the night before in front of the doctor I am currently meeting with became, for me, one a single row in an already way-too-large data table of covid-19 fatalities.
I felt like a helicopter pilot overlooking an out-of-control wildfire.[1] In such wildfires, teams of firefighters (aka doctors) position themselves at various strategic locations on the ground to push back the fire there as best they can. They experience the flames and carnage up close and personal. My placement in the helicopter, on the other hand, removes me from ground zero, instead forcing me to see and analyze the fire in its entirety and its sweeping and massive destruction across the whole forest. My vantage point provides a strategic vantage point to determine the best ways to fight it, shielding me from the immediate destruction. Nevertheless, witnessing the vastness of the carnage from the air had its own challenges, stress, and emotional toll.
Being an anthropologist by training, I am accustomed to being “on the ground.” Anthropology is predicated on the idea that to understand a culture or phenomena, one must understand the everyday experiences of those on the ground amidst it, and my anthropological training has instilled an instinct to go straight to and talk to those in the thick of it.
Yet, this experience has taught me that that perception is overly simplistic: the so-called “ground” has many layers to it, especially for a complex phenomenon like a pandemic. Being in the helicopter is another way to be in the thick of it just as much as standing before the flames.
Many in the United States have made considerable and commendable efforts to support frontline health workers. Yet, as the pandemic progresses, and its societal effects grow in complexity in the coming months I think we need to broaden our understanding of where the “frontlines” are and who a “frontline worker” is worthy of our support.
In actual battlefields where the “frontline” metaphor comes from, militaries also set up layered teams to support the logistical needs of ground soldiers who also must frequently put themselves in harm’s way in the process. The frontline of this pandemic seems no different.
I think we need to expand our conceptions of what it means to be on the frontlines accordingly. Like anthropology, modern journalism, a key source of pandemic information for many of us, can fall into the issue of overfocusing on the “worst of the worst,” potentially ignoring the broader picture and the diversity of “frontline” experiences. For example, interviewing the busiest medical caregivers in the worst affected hospitals in the most affected places in the world likely does promote viewership, but only telling those stories ignores the experiences and sacrifices of thousands of others necessary to keep them going.
To be clear, in this blog, I do not personally care about acknowledgement of my own work nor do I think we should ignore the contributions of these medical professional “ground troops” in any way. Rather, in the spirt of “yes and,” we should extend our understanding of the “frontline workers” to acknowledge and celebrate the contributions of many other essential professionals during this crisis, such as transportation services, food distribution, postal workers, etc. I related my own experiences as a data scientist because they helped me learn this, not for any desire for recognition.
This might help us appreciate the complexity of this crisis and its social effects, and the various types of sacrifices people have been making to address it. As it is becoming increasingly clear that this pandemic is not likely to go anywhere anytime soon, appreciating the full extent of both could help us come together to buckle down and fight it.
[1] This video helped me understand the logistics of fighting wildfires, a fascinating topic in itself: https://www.youtube.com/watch?v=EodxubsO8EI. Feel free to check it out to understand my analogy in more depth.
A friend and fellow professor, Dr. Eve Pinkser, asked me to give a guest lecture on quantitative text analysis techniques within data science for her Public Health Policy Research Methods class with the University of Illinois at Chicago on April 13th, 2020. Multiple people have asked me similar questions about how to use data science to analyze texts quantitatively, so I figured I would post my presentation for anyone interested in learning more.
It provides a basic introduction of the different approaches so that you can determine which to explore in more detail. I have found that many people who are new to data science feel paralyzed when trying to navigate through the vast array of data science techniques out there and unsure where to start.
Many of her students needed to conduct quantitative textual analysis as part of their doctoral work but struggled in determining what type of quantitative research to employ. She asked me to come in and explain the various data science and machine learning-based textual analysis techniques, since this was out of her area of expertise. The goal of the presentation was to help the PhD students in the class think through the types of data science quantitative text analysis techniques that would be helpful for their doctoral research projects.
Hopefully, it would likewise allow you to determine the type or types of text analysis you might need so that you can then look those up in more detail. Textual analysis, as well as the wider field of natural language processing within which it is a part of, is a quickly up-and-coming subfield within data science doing important and groundbreaking work.
This is a follow-up to my previous article, “What Is Ethnography,” outlining ways ethnography is useful in professional settings.
To recap, I defined ethnography as a research approach that seeks “to understand the lived experiences of a particular culture, setting, group, or other context by some combination of being with those in that context (also called participant-observation), interviewing or talking with them, and analyzing what is produced in that context.”
Ethnography is a powerful tool, developed by anthropologists and other social scientists over the course of several decades. Here are three types of situations in professional settings when I have found to use ethnography to be especially powerful:
1. To see the given product and/or people in action
2. When brainstorming about a design
3. To understand how people navigate complex, patchwork processes
Situation
#1: To See the Given Product and/or People in Action
Ethnography allows you to witness people in action: using your product or service, engaging in the type of activity you are interested, or in whatever other situation you are interested in studying.
Many other social science research methods involve creating an artificial environment in which to observe how participants act or think in. Focus groups, for example, involve assembling potential customers or users into a room: forming a synthetic space to discuss the product or service in question, and in many experimental settings, researchers create a simulated environment to control for and analyze the variables or factors they are interested in.
Ethnography, on the other hand, centers around observing and understanding how people navigate real-world settings. Through it, you can get a sense for how people conduct the activity for which you are designing a product or service and/or how people actually use your product or service.
For example, if you want to understand how people use GPS apps to get around, one can see how people use the app “in the wild:” when rushing through heavy traffic to get to a meeting or while lost in the middle of who knows where. Instead of hearing their processed thoughts in a focus group setting or trying to simulate the environment, you can witness what the tumultuousness yourself and develop a sense for how to build a product that helps people in those exact situations.
Situation
#2: When Brainstorming about a New Product Design
Ethnography is especially useful during the early stages of designing a product or service, or during a major redesign. Ethnography helps you scope out the needs of your potential customers and how they approach meeting said needs. Thus, it helps you determine how to build a product or service that addresses those needs in a way that would make sense for your users.
During such initial stages of product design, ethnography helps determine the questions you should be asking. Many have a tendency during these initial stages to construct designs based on their own perception of people’s needs and desires and miss what the customers’ or users’ do in fact need and desire. Through ethnography, you ground your strategy in the customers’ mindsets and experiences themselves.
The brainstorming stages of product development also require a lot of flexibility and adaptability: As one determines what the product or service should become, one must be open to multiple potential avenues. Ethnography is a powerful tool for navigating such ambiguity. It centers you on the users, their experiences and mindsets, and the context which they might use the product or service, providing tools to ask open-ended questions and to generate new and helpful ideas for what to build.
Situation
#3: To Understand How People Navigate Complex, Patchwork Processes
At a past company, I analyzed how customer service representatives regularly used the various software systems when talking with customers. Over the years, the company had designed and bought various software programs, each to perform a set of functions and with unique abilities, limitations, and quirks. Overtime, this created a complex web of interlocking apps, databases, and interfaces, which customer service representatives had to navigate when performing their job of monitoring customer’s accounts. Other employees described the whole scene as the “Wild West:” each customer service representative had to create their own way to use these software systems while on the phone with a (in many cases disgruntled) customer.
Many companies end up building such patchwork systems – whether of software, of departments or teams, of physical infrastructure, or something else entirely – built by stacking several iterations of development overtime until, they become a hydra of complexity that employees must figure out how to navigate to get their work done.
Ethnography is a powerful tool for making sense of such processes. Instead of relying on official policies for how to conduct various actions and procedures, ethnography helps you understand and make sense of the unofficial and informal strategies people use to do what they need. Through this, you can get a sense for how the patchwork system really works. This is necessary for developing ways to improve or build open such patchwork processes.
In the customer service research project, my task was
to develop strategies to improve the technology customer service representatives
used as they talked with customers. Seeing how representatives used the
software through ethnographic research helped me understand and focus the analysis
on their day-to-day needs and struggles.
Conclusion
Ethnography is a powerful tool, and the business world and other professional settings have been increasingly realizing this (c.f. this and this ). I have provided three circumstances where I have personally found ethnography to be invaluable. Ethnography allows you to experience what is happening on the ground and through that to shape and inform the research questions we ask and recommendations or products we build for people in those contexts.
Interviewing for a data science role can be a daunting task, especially for those new to the field. I have lost count of the number of data science interviews I have had over the years, but here are the four most common questions I have encountered and strategies for preparing for each. Prepping for these questions is a great opportunity to develop your story thesis, the most important part of any data science interview.
Most Common Data Science Questions:
1) Tell me about yourself.
2) Describe a data science job you have worked on.
3) What kind of experience do you have with messy data?
4) What programming languages and software have you used?
Question 1: Tell me about yourself.
This is probably hands down the most common interview question across all industries and fields, not just data science, so the fact that it is the most commonly asked questions in data science interviews may not seem that surprising. A good answer is crucial to establish a favorable first impression and to lay your main story or thesis of who you are that you will come back to throughout the interview.
In data science interviews, I emphasize my passion for using data science tools to help organizations solve complex problems that were previously vexing. If you are unsure what your thesis is, I designed this activity to help people decipher it. Here is an example of how I would describe myself:
“I fell in love with data science because I enjoy helping organizations solve complex problems. In my past roles, I have used my combined data science and social science skills to explore and build solutions for complicated problems for which the typical ways of doing things within the organization have not worked. I am energized by the intellectual stimulation of breaking down complex problems and using data science to develop potential innovative yet useful solutions. What kind of problems do you guys have that has led you to need to find a data scientist like me?”
Your self-description should tell the story of who you are in a way that demonstrates how you would be a natural fit for the role and helpful to the organization. As your interview thesis, if you laid it out well, then every other question you answer will simply involve fleshing out one (or a combination) of those three basic parts of your self-story: 1) Who you are, 2) How your identity makes you a natural fit for the role, and 3) How this would benefit the organization.
Here are four other important observations to note about how I told my story:
I emphasized who I was – an innovator developing unique solutions to complex problems – while showing my innovator identity naturally connects with data science and could be helpful for the organization. You might not consider yourself an “innovator” per se, but the trick is to figure out who you are based on what energizes and impassions you and then show how performing the data science role you are applying for is a natural fit for who you are.
I told the story with normal words, not technical jargon. I have found that many, if not most, of my interviews, especially the first-round interviews, are with employees without technical expertise, and since you often do not know the level of technical expertise of the interviewer, it is better to err on the non-technical side.
I kept my story positive, only mentioning what I like to do. Sometimes people instinctively try to illustrate what they want by describing things they do not like to do: e.g. “At previous last job, I learned I do not like doing Y, so I am seeking to do X instead” or “I am doing Y, and I hate it. I want out.” I would describe these aspects of my story later if the interviewer asks, but I would stick with the positive at first: only mentioning what I want to do.
I used strong, subjective, even emotional phrases like “fell in love with,” “passionate about,” and “energized by.” At first glance, these phrases might seem overly informal, but I have found they help interviewers remember me. Do not overdo it, but being more vivid and personable is generally helps rather than hurts your interview chances for data science positions.
Question 2: Describe a data science project you have worked on.
This is the second most common question I encountered, so make sure you come prepared with an exemplar project to showcase. They may ask you a lot of questions about your project, so I would recommend choosing a project where you did an amazing job on, really knocked it out of the park and that you are proud of. Unless there are disclosure issues, post your work on GitHub, a blog, LinkedIn, or somewhere else online, and include a link to it in your job application.
How to explain the project will vary considerably depending on your interviewer’s degree of expertise. I generally start with a non-technical, high level explanation and provide the technical details if the interviewer(s) prompts me to with follow-up questions. This gives the interviewers the freedom to choose the level of technical expertise they would like in their follow-up. A data scientist interviewer worth his or her salt will quickly steer the conversation into more technical aspects of your project that he or she wants to learn more about, but even then, starting non-technical demonstrates that you know how to effectively communicate your work to non-technical audiences as well.
When describing your project, you are effectively telling the story of the project, and most project stories have the following core components:
Who: You are probably the story’s protagonist (it is your interview after all, so naturally pick a project or part of a project where you were the primary driver), but there are likely multiple important side characters that you will need to setup, like who commissioned the project, who it was for, who the data was about, and so on.
What: The problem, need, or question your project sought to address generally forms the “conflict” of the project story, so be sure to explain what led to the problem, need, or question (in stories, called the inciting incident).
When and Where: The timeframe setting/context in which the project took place (e.g., the organization you were working with or a class you took for which the project was for). How long you had to complete the project can also be important to establish.
How: What did you to solve the problem. If you tried a lot of approaches before discovering what works, the how includes both your methodological story and your final solution (that is part of the rising and falling action for how you overcame the project). This is the meat of your story. You will want technical and non-technical descriptions of the how:
Technical How: Generally, the core two parts of a technical description are the model you used (and any you tried if applicable) and how you determined the features/variables you selected. Another important part might be how you cleaned and/or gleaned the data.
Non-technical How: I have found that non-technical audiences usually do not glean much from either the model I ended using or my feature selection procedure. Instead, I explain what type of functionality I ensured the model had to solve the problem I had just setup: for example, “I built a model that calculated the probability of X phenomena based on data sources A, B, and C, testing various types of models to determine which would do this best, and then discerned which variables among those datasets were the best to use.” For a non-technical audience, that is generally enough. The core component for them is what goes into the model (the data), what result the model produced from it, and how that informed the problem, need, or question driving the project.
Finally, in your how explanation, make sure you slip in whatever programming languages and software you used: Python, R, SQL, Azure, etc.
Why: This is your explanation of why you chose the approach(es) you did for your how. Now, just like with the how, you will need a technical and non-technical explanations of the why.
Make sure your non-technical explanation of why aligns with your non-technical how. I commonly see data scientists make the mistake of going over a non-technical individual’s head by trying to provide a technical why explanation for their non-technical how. In particular, I would not explain the metric or criteria you used to compare models or decide the feature selection procedure in my non-technical explanation, since these will likely lose a non-technical person. If my non-technical how description focused what data the model used and what it did with it, then my non-technical why focuses on why building a model to do that mattered and how it helped others and/or myself in the real world.
What happened: This is the result of the project. Did you succeed or fail (or somewhere in between)? Was it useful for whoever you built it for? Were you able to conduct any follow-up analysis after deployment? Maybe most importantly, what did you learn from the experience? In narrative terminology, this is the resolution. The more you can quantitatively measure any outcomes the better.
These are the basic components of a project story. Here is the most common project I use, and when reading through it, feel free to analyze how I present each component of the story. I wrote this blog for a general audience, so I provided my non-technical how and why.
Question 3: What kind of experience do you have with messy data?
Interviewers ask me this question surprisingly frequently. They usually preface the question by explaining that they at the organization have a lot of messy data that would require cleaning/processing for their future data scientist. This is a great opportunity to showcase your comfort with data science and data science issues.
I typically answering something like this:
“Yes, I have had to organize and clean messy data all the time. That’s par for the course in data science: the running joke among data scientists is that 90% of any data science project is data cleaning, and 10% actually doing anything with it. At least you guys are honest about the fact that your data is messy. When I worked as a consultant, for example, I talked with many organizations about potential data science projects, and if they said their data was clean and ready to go, chances are they were lying either to themselves or to me about how messy and haphazard their data really was. The fact that you are upfront about the messiness of your data tells me that you guys as an organization are realistically assessing where you are and what you need.”
This answer not only establishes that I have handled messy data before but also normalizes the problem in the field as resolvable by an expert (like myself) and compliments them for being up front. Answering this question confidently and positively has uniquely put me at the top of the list as the front runner candidate in some interviews. Giving a good answer to is is a perfect opportunity to endear yourself with your interviewer.
Question 4: What programming languages and/or software have you used?
Even though a technical interviewer might ask this as well, I have encountered this question most frequently among non-technical interviewers. In my experience, fellow data scientist interviewers have more insider ways of deciphering whether you do in fact know data science, but for non-technical interviewers, this question is their initial way to probe that. Sometimes, they will cling to a laundry list of software and/or languages to determine whether you are qualified.
Now, I believe that having experience using the exact combination of softwares that the data science team you would be joining uses is generally not that important a criterion for job success. For a good data scientist, learning another software system or programming language once you know dozens is not that difficult of a task. But their question is completely natural and reasonable coming from their side, so you will have to answer it.
If they open-endedly ask what softwares and languages you have use, list through the ones you have used, maybe starting with the ones you use the most often. I generally start by mentioning Python, since not only is it my favorite language for data science (see this article) but also conveys that I am familiar with programming in general.
More often, though, they might ask whether you have used X software before, often asking whether you have used each software on a list they have in front of them. I would never recommend lying by claiming that you have experience with a software you have never used, but I would recommend recasting a “No” by providing an equivalent software to it that you have worked with. Here is an example:
“No, I have not used Julia, but that is because I prefer using Python for what others might use Julia for. Python is an equivalent high-functioning programming language in complexity, and the data science teams I worked on happened to prefer it over Julia.”
This not only conveys the “No” in a bit more of a positive light, but it shows that you are familiar with the software he or she just mentioned and confident about using it to match your would-be team.
Question 5: What are you looking for in a job?
Most often, this is the last major question interviewers ask me, but I have gotten it at the beginning as well. They probably save it until the end, because the question transitions very easily to the next part of the interview: either them describing the role or you providing any questions you have.
If you did a good job laying out your thesis story in the first question, then here you simply restate it from a different angle. You already laid the groundwork, and you are just bringing it home at this point. If they ask me this at the beginning of the interview before the “Tell me about yourself” question, then I use this question to retell my thesis story from this new angle.
Here is my typical answer:
“Like I said, I am energized by figuring out how to help organizations solve complex data science problems. Over the years, I have found two concrete things in an organization help me with this. First, I thrive in stimulating work environments where I am given the space and resources to think creatively through problems. Second, I also need to be able to work with people from a variety of backgrounds and disciplines from whom I can learn from and develop innovative approaches to the problem at hand. You guys seem to provide both. [I then conclude by explaining why they seem to provide both based on what you learned about the organization during the interview, or if we have not had a chance to talk about them yet, ask about these within the organization.]”
Notice that the first sentence references my self-explanation answer to the “Tell me about myself” question. If they ask this question before I have given that spiel, I spend about 30 seconds or a minute providing a condensed self-introduction and then continue with the rest of the answer.
Conclusion
These are the five most common data science interview questions I have encountered and how to prepare for them. I have found that when data scientists give advice on how to prepare for job interviews, they often focus on preparing for highly technical, factual questions (e.g., here and here). Even though having a solid data science foundation can be important, refining your overall story thesis – who you are, what you are passionate about doing, and how that relates to this job – is far more important to advance through the interview process.
I have found that humans, even supposedly “nerdy” data scientists, tend to connect with people and stories, so if you can hook them there, they generally remember you better and are more likely to hire you. When you have a compelling story, every other question will naturally fall into place as an intuitive further clarification of that overall story.