How Do I Become a Data Scientist? The Four Basic Strategies to Learn Data Science

Aspiring data scientists will frequently ask me for recommendations about the best way to learn data science. Should they try a bootcamp or enroll in an online data science course, or any of the myriad options out there?

In the last several years, we have seen the development of many different types of educational programs that teach data science, ranging from free online tutorials to bootcamps to advanced degrees at universities, and the pandemic has seemed to have fostered the establishment of even more programs to meet the increased demand for remote learning. Although probably overall a good thing, having more options increases the complexity of deciding which one to do and the potential noise of programs upselling their services.

This article is a high-level survey of the four basic types of data science education programs to help you think about which might work best for you. Without already knowing data science, it can be difficult to assess how effective a program is at teaching it. Hopefully, this article will help break that chicken-and-the-egg conundrum.

These are the four basic ways to learn data science:

  1. Do-it-yourself learning
  2. Online courses
  3. Bootcamps
  4. Master’s degree or other university degree in data science (or related field)

I will discuss them in order from the cheapest to most expensive. I also included two hybrid strategies that combine a few of these that are worth considering as well. This table provides a quick, high-level synopsis of each one:

Option 1: Do-It-Yourself Online

There are tons of free, online data science resources that can either teach data science from scratch or explain just about any data science content you could possibly want to know. These range from tutorials for those who learn by doing like W3Schools, videos on YouTube and other sites for audio learners like Andrew Ng’s YouTube series, articles for visual learners who enjoy reading like Towards Data Science. You could scour the internet and teach yourself. It has the pros of being free and perfectly flexible to tailor to your schedule.

But as a former teacher, I have found independent learning is not for everyone. You must be entirely self-motivated and self-structured to teach yourself like this. So, know yourself: are you the type of person who could learn well completely independently like this?

Education programs tend to provide these resources that you might lack if you went it alone:

1) Curriculum Oversight: Data science experts in any education program generally establish some kind of data science curriculum for you that includes the necessary topics in the field. Many people who are new to data science do not know yet what data science concepts and skills are most important to learn about. This can create a chicken and egg problem for self-learners who must learn the field at least a little to know the most important items to learn in the first place. Data science programs help circumvent this by giving you an initial curriculum to started with.

2) Guidance of the Norms of the Field: In addition to the teaching the material, education programs implicitly introduce students to data science norms and ways of thinking. Even though there are times to deviate from the established custom, they are important when first working on teams with fellow data scientists. Sometimes self-learners learn the literal material but do not gather the implicit perspectives that enables their incorporation into the data science community.

3) External Social Accountability: Education programs provide a form of social accountability that subtly encourages you to get the work done. Self-learners must rely almost exclusively on their own self-motivation and self-accountability, which, in my experience, works for some people but not others.

4) Social Resources: Education programs (especially ones that meet either in-person or virtually) provide various people – teachers, students, and in some cases mentees/underlings – with whom one can talk through problems with, help you discover your weaknesses and shortcomings, and determine ways to address them. Minute programming details that are easily overlooked by beginners, but experts might easily spot can cause your entire program to fail. To learn independently, you will have to either solve all of these yourself or find data science friends or family who are willing to help you.

5) Certification of Skills: Education programs bestow degrees, grades, and other certifications as external proof that you do, in fact, possess the requisite skills in a data science role. Learning on your own, you must prove that you have these skills to employers by yourself. Developing a portfolio of thought-provoking projects, you have done is the best way to demonstrate this.

6) Guidance in Forming Projects: An impressive project works wonders for showcasing your data science skills. In my experience, beginners to data science often do not yet possess the skills to create, complete, and market a thought-provoking yet doable project, and one of the most important roles data science educators can have is helping students think through how to develop one. You must do this yourself when learning alone.

One can overcome each of these deficits. I have found that for people who learn well independently, its cost and flexibility advantages easily outweigh these cons. Thus, the crucial question is, Would this form of independent learning work for you? In my experience, it works for a comparatively small percent of people, but for those it works for, it is a great option.

If you do decide to teach yourself, I would recommend considering the following:

1) Be conscientious about your learning style when crafting your material. For example, if you are a visual learner, then reading online material resources would be best, but if you are more of an auditory learner, then I would recommend watching video tutorials/lectures on say YouTube.

2) If you have data science friends willing to help you, they can be a great asset, particularly in determining what data science materials to learn, troubleshooting any coding issues you might have, and/or developing a good project(s).

3) People in general learn data science best by doing data science. Avoid the common trap of only reading about data science without getting your hands dirty and experimenting yourself (preferably with unclean, annoying, real-world data, not already trimmed, “textbook perfect” data). Using pristine data to first learn the concepts is fine, but make sure you graduate yourself to practicing with real-life dirty data.

Option 2: Online Course

A variety of online courses exist. Most of them are relatively cheap (usually around $20-$50 a month or $100-$200 per course). For example, at the time of writing this, Udemy has an introductory data science course for a flat rate of $94.99, and Coursera a course for $19.99 a month (both with prices varying  based on discounts and other special deals). Online courses are generally the cheapest of the courses you can enroll in, and because of the length of most, you will probably have to take several levels of courses (introductory to advanced) to learn the field.

Another advantage is that they are flexible: You can learn at your own pace, based on the needs of your schedule. This is really valuable for people who also working a job and studying on the side, with family commitments, and/or other obligations complicating their schedules. Keep in mind, though, that because you often pay per month, how many months you take often dictates the final cost. At the end of the day, spending an extra $100 or so to take a few more months to complete the course is still much cheaper than the other course options.

On the other hand, however, like doing it yourself, they tend to lack the social benefits of classroom learning: instructors to ask questions to and provide external social accountability, and fellow students to work alongside. In my experience, this makes it a very challenging for some learners, but others are not as comparatively affected by it.

In addition, many online courses provide more of a cursory summary of data science and lack the complex projects that are both necessary to learn data science and to market yourself to others. Even though there are exceptions, online courses are often good at introducing data science concepts rather than an in-depth exploration. Many focus on canned problems with already cleaned, ready-to-do data instead of letting you practice on the messy, complex, and often just plain silly data most data scientists actually have to use at their jobs. They also often lack the personnel for one-on-one coaching to mentor each student through portfolio-building projects with complex data.

Thus, online courses tend to provide good, cost-effective introductions to data science, helpful to see whether you like the field (see Hybrid #1 below), but do not generally provide the refined training necessary to become a data scientist. Now, some programs are evolving their courses. Especially as the pandemic increases demand for remote learning, online learning platforms are developing more robust online data science courses. If you choose to learn by taking online courses, I recommend supplementing it with your own projects to get experience practicing data science work and showcase in job interviews.

Hybrid #1: Use an Online course to Introduce Data Science (or Programming)

If you are completely new to data science, an online course can provide a low-cost, structured space to get a sense for what the field entails and determine whether it is a good fit for you. I have seen many people enroll in several thousand-dollar bootcamps or university degree programs only to learn there that they do not like doing data science work. An online course is a much cheaper space to discern that.

You could always explore data science yourself for free to decide whether you like it (see Option 1) instead of taking an online course, but I have found that many people who have never seen data science before do not know what to look up in the field to get started. An introductory online course is not that expensive, and the initial orientation into the major topic areas can be well worth the cost.

There are three basic versions of this approach:

1) If you do not already know a programming language, take an online programming course. I explained in this article why I would recommend Python as the language to learn (with Julia as a close second). If you do not like programming, then you have learned the lesson that you should not become a data scientist, and even if you do not end up in data science, programming is such a valuable skill that having some training in it will only help your occupational prospects in most other related fields.

2) If you do know a programming language, take an introductory data science course. These often provide a high-level overview of data science, especially helpful for people who need to work with data scientists and understand what they are talking about. If you need a math refresher, this is a great option as well.

3) I have seen prospective data scientists take online data analytics courses to prepare them for and determine their potential interest in data science. I would not recommend this, however. Even though data scientists will sometimes treat data analytics as a “diet” or “basic” version of data science, data analytics is different field requiring different skills. For example, data analytics courses typically do not include the rigorous programming. They generally focus on R and SQL if they teach programming at all, which are fine languages for data analytics and statistics but not enough for data science (for which you would want a language like Python). Data analytics and data science also generally emphasize different fields of math: data analytics tends to rely on statistics while data science on linear algebra, for example. Thus, what you would learn in those courses would not apply to data science as much as you would think. Now, if you are unsure of whether you would like to become a data scientist or data analyst, then a data analytics course might help you understand and get a feel for data analytics, but I would not use them to assess whether data science is a good fit for you.

Once you complete the online course, if you still think you would enjoy doing data science work, then you can choose any of the options to learn the field in more depth. This may seem like just getting you back to square one, but by taking an introductory programming or data science course, you have levelled yourself up so to speak and are more ready to face the “boss battle” of becoming a data scientist.

Option 3: Data Science Bootcamps

Data science bootcamps have also become popular. They tend to be several weeks long (in my experience often ranging from 2 to 6 months) intensive training programs. The traditional pre-pandemic bootcamp was in-person and would often cost around $10,000 to $15,000. Metis’s bootcamp is a good example of what they often look like.

Their biggest pros are that they offer the advantages of classroom education far more cheaply and in much less time than getting a university degree. They are a significant step-up cost-wise than the previous options (see Con 2 below), but they seek to provide a comparable (but less academically advanced and in-depth) scope of knowledge as a master’s degree in data science for a significantly lower price and in a fraction of the time. Even though it can often make their pace feel intense, the good bootcamps tend to mostly succeed at providing this. This makes them a great option for anyone who knows they want to become a data scientist. Finally, unlike the previous options, you get a teacher(s) to ask questions to and motivate you, and a set of fellow students to struggle through concepts with. The best programs offer the occupational coaching and build strong networks in data science communities to help their students find jobs afterwards.

They have some major cons, however:

1) They can feel fast-paced, unloading complex concepts in a short amount of time. Many of my friends who have done bootcamps have reported feeling cognitive whiplash. Expect those weeks/months to be mentally intense and to subsume your life. Data science bootcamps are often 9-5 full-time jobs during that time, and you will likely be too mentally exhausted to work on other things in the evenings or weekend (plus in some cases you will have homework to complete then anyways). A few weeks or months is not terribly long for such an ordeal, but it makes them much less flexible than the previous options. For example, this forces many students to take time from their current jobs to complete the bootcamp and to limit their social, familial, and other obligations as much as they can during their bootcamp. This makes it difficult for anyone unable to take time off work, with busy social or familial lives, or otherwise with a lot going on.

2) At several thousand dollars, they are clearly noticeably more expensive of the than the previous options (but still much cheaper than universities). Some offer scholarships and other services on a need-basis, but even then, the opportunity cost of having to put a job on hold can still be expensive. Given their general high salaries, landing a data science job would likely make the money back, but it takes a hefty initial investment.

This makes it an especially poor option for anyone thinking about data science but not sure whether they want to do it. $10,000 is a lot to spend to simply learn you do not like the field, and there are many cheaper ways to initially explore the field (see especially Hybrid #1). The cost still might be worth it, however, for anyone who really wants to become a data scientist but does not yet possess key skills and knowledge.

3) At the time of writing this, the Covid-19 pandemic has forced most data science bootcamps to meet remotely anyways, making their services far more similar to the much cheaper online courses. That said, many have sought to simulate the classroom environment virtually, trying to provide some type of social environment, but the classroom environment was a major advantage that made their significant increase in costs over the previous options worthwhile.

4) They tend to exist in large cities (especially tech centers). For example, bootcamps in the United States tend to concentrate in New York City, Los Angeles, Chicago, San Francisco, etc. Prior to the pandemic, anyone not living in those places would have to travel and temporarily reside in wherever their chosen bootcamp was, an additional expense.

5) They are often difficult for people who do not know programming and for those who do not know college-level mathematics like linear algebra, calculus, and statistics. If you do not know programming, I would recommend learning a programming language like Python (for more see this article I wrote explaining why to learn Python of all languages) through either a cheap online course and/or online tutorials first. Some data science bootcamps offer a preparatory introduction online course that teaches the prerequisite coding and math skills for those who do not understand it. They are worth consideration as well, but keep in mind the equivalent online course might be cheaper with roughly the same educational value.

If you decide to do a bootcamp, these criteria are important when researching which bootcamp to choose:

1) Project Orientation: How well do they enable you to practice data science through portfolio-building projects, and how impressive are the projects its alum did? The best data science bootcamps are generally teach in a project-oriented fashion.  

2) Job-Finding Resources and/or Job Guarantee: What resources or coaching do they give to help you find a job afterwards? Help networking, presenting yourself, and interviewing, for example, are important skills to finding a job as a data scientist, and in addition to teaching you technical curriculum, the best programs tend to find occupational coaches to help specifically with the job-finding process. Also, some programs give a job guarantee: if you do not find a data science job after a certain number of months after graduating then they refund tuition. This generally shows they take job finding important enough to risk their own money on it (although do check at the fine print on the guarantee to see the exact terms they are agreeing to).

3) Alum Resources: A surprisingly import detail to consider is how much resources a bootcamp invests in cultivating alumni networks. I was surprised by how receptive to meeting/networking alum of the online bootcamp I did, and how satisfied alum tend to be with the bootcamp. The effort a bootcamp makes to work with and maintain relationships with its alum impact this significantly. Connectedness with alum can be difficult to assess when researching programs from afar, but asking whether you can speak with alum(s) to learn about their experiences with the program, checking a bootcamp’s alum activity on LinkedIn and other social media websites, and asking about what kind of networking opportunities with alum they facilitate can be great ways to assess how intentional a program is about cultivating relationships. 

4) Scholarship Options: Some programs offer full or at least partial scholarships based on need. Clearly, ways to knock down the cost of the bootcamp would be great, especially if a bootcamp seems like an ideal option for you, but the cost seems too daunting.

Hybrid #2: Online Bootcamp

Online bootcamps tend to possess the schedule flexibility of online courses but offer more rigorous, personal (albeit remote) learning, allowing you to combine the best of aspects of data science bootcamps and online programs. They are also generally cheaper than traditional bootcamps (yet also more expensive than an online course). Finally, they tend to be a much better option for those who do not live in a major city that happens to have a local data science bootcamp program. The pandemic, if anything, has probably helped produce even more online bootcamp programs, since it has forced data science bootcamps to teach virtually.

I enrolled in Springboard’s online data science bootcamp in 2017, a great example of an online bootcamp. At the time, they cost roughly $1,000 a month (at the time of writing their standard rate is $1,490 a month and state their program generally takes six months). This is cheaper than traditional bootcamps but still a few totaling around $10,000 for six months. They had online curriculum typical of online courses but also provided weekly virtual meetings with an instructor to discuss the material and any issues you are having. Now they seem to include virtual lessons online. This individualized training and remote classroom environment are the main value adds over an online course, and you must assess whether, for you, they would be worth the additional cost. They are self-paced, providing much greater flexibility on when and how often you work than typical bootcamps. They also refunded your money if you did not find a job in six months after completion.

If you choose this option, be aware of the potential pitfalls of both online courses and traditional bootcamps. Just like with online programs, you will need to evaluate whether you are comfortable learning the curriculum by yourself (even you can meet with a mentor for major issues once a week, you would be doing the bulk learning by yourself throughout the week). Like with traditional bootcamps, expect the learning to be mentally intense and make sure they help you develop portfolio-building projects and provide job-finding resources and training.

Option 4: Master’s Degree or Other University Degree

The final option is to go back to school to get a degree in data science. This is the most expensive and time-consuming option: a master’s degree (a logical choice if you already have a bachelor’s) is generally the shortest, taking two years. But they cost upwards of $100,000. Even if partial or full scholarships decrease that cost, the opportunity cost of spending several years of your life in school is still higher than any of the other options. It can give a resume boost, however, if you know how to leverage it properly, which will likely increase your salary to make up for the initial cost. I would only recommend getting a master’s degree if you already know you love data science (say because you have already been working in the field, preferably if you also have already figured out the specific area of data science you want to do) but want to take your skills, technique, and/or theoretical knowledge of how the models work to the next level.

The best way to refine your data science skills is by doing data science: finding or creating contexts to push you as you practice data science. Graduate schools are not the only potential environment to refine one’s data science skills (e.g., all the previous options could involve that if done well), and even though graduate schools can be great at providing rigor, these other options can be a lot cheaper and more flexible. Finally, at the time of writing this, at least, the demand for data scientists exceeds the number of actual people in the field, and so getting a data science job without an “official” university degree in data science is pretty realistic.

University data science degree programs are relatively new – generally only a few years old. Thus, not all universities have literal data science degrees or departments but instead require that you enroll in a related program like computer science, statistics, or engineering to learn data science. This does not always mean these other programs are bad or unhelpful, but it often means you will have to perform extraneous or semi-extraneous tasks to data science proper in order to complete your degree (in some cases with minimal help from faculty from other fields).

When considering a program, you should make sure they are proactive about teaching professional and not just academic data science skillsets. These are the specific questions I would research to assess how well they might prepare you for non-academic data science jobs:

1) What proportion of their faculty currently work or at least have worked in the industry as a data scientist (or other similar job title)?

2) How well connected is the department with local organizations, and might they be able to leverage these relationships to help you work with these organizations through a work-study program or internship during the program and/or employment afterwards?

3) Will they help you build – or at least give you the flexibility to build – one’s thesis into an applied data science project that would boost your resume to future employers?

If your chosen program lacks these, I would strongly recommend building resume/portfolio-boosting projects and networking with local data scientists on the side while completing the program. This takes considerable time and energy, so ideally your department would actively help you in this work, instead of requiring that you do it on your own while also completing all their work.

Funding options is something else to consider. Are they willing to fund your degree fully or at least partly? Work-study programs where you work while getting your master’s can be a great way to graduate with no debt and gain resume-building work experiences (although they can make you busy). I benefitted greatly from working as a data scientist while completing my master’s, both because I graduated with no debt and because it allowed me to practice and refine my skills.

Finally, most universities require that you live nearby and attend physically (at least before and likely after the pandemic). Thus, you might have to find a place near you or be willing to relocate for a few years if there is not a data science degree program nearby. If so, you should factor moving expenses into the cost of doing the program.

Conclusion

Learning data science can be an awesome yet daunting prospect, and finding the right strategy for you is complicated, particularly given all the pedagogical, logistical, and financial considerations. Hopefully, this article has helped you think through how to journey forward. 

Photo credit #1: geralt at https://pixabay.com/photos/woman-programming-glasses-reflect-3597101/  

Photo credit #2: Anastase Maragos at  https://unsplash.com/photos/OaFESrP2hhw

Photo credit #3: mohamed_hassan at https://pixabay.com/photos/training-course-3207841/

Photo credit #4: Jukan Tateisi at https://unsplash.com/photos/bJhT_8nbUA0

Photo credit #5: heylagostechie at  https://unsplash.com/photos/IgUR1iX0mqM   

Photo credit #6: Brooke Cagle at https://unsplash.com/photos/WHWYBmtn3_0

Photo credit #7: A_Ginard at https://pixabay.com/photos/architecture-modern-buildings-5084075/

The Best Programming Languages for Data Science and Machine Learning

woman coding on computer

Newcomers to data science or artificial intelligence frequently ask me the best programming language to learn to build machine learning algorithms. Thus, I wrote this article as a reference for anyone who wants to know the answer to that question. These are what I consider the three most important languages, ranked in terms of usefulness based on both overall popularity within the data science community and my own personal experiences:

Best Programming Languages for Machine Learning:
#1 Choice: Python
#2 Choice: R
#3 Choice: Java
#4 Choice: C/C++

#1 Programming Language: Python

Python is the most popular language to use for machine learning and for three good reasons.

First, it’s package-based style allows you to utilize efficient machine learning and statistical packages that others have made, preventing you from having to constantly reinvent the wheel for common problems. Many if not most of the best packages (like NumPy, pandas, scikit learn, etc.) are in Python. This almost allows you to “cheat” when programing machine learning algorithms.

Second, Python is a powerful and flexible all-purpose language, so if you are building a machine learning algorithm to do something, then you can easily build the code for the other overall product or system in which you will use the algorithm without having to switch languages or softwares. It supports object-oriented, functional, and procedure-oriented programming styles, giving the programmer flexibility in how to code, allowing you to use whatever style or combination of various styles you like best or fits the specific context.

Third, unlike a language like Java or C++, Python does not require elaborate setup to program a single line of code. Even though you can easily build the coding infrastructure if you need to, if you only need to run a simple command or test, you can start immediately.

When I program in Python, I personally love using Jupyter Notebook, since its interface allows me to both code and to easily show my code and findings as a report or document. Another data scientist can simultaneously read and analyze my code and its output at the same time. I personally wish more data scientists published their papers and reports in Jupyter Notebook or other notebooks like it because of this.

If you have time to learn a single programming language for machine learning, I would strongly recommend it be Python. The next three languages, R, Java, and C++, do not match its ease and popularity within data science.

#2 Programming Language: R

R is a popular language for statisticians, a programming language that is specifically tailored for advanced statistical analysis. It includes many well-developed packages for machine learning but is not as popular with data scientists as Python. For example, in Towards Data Science’s survey, 57% of data scientists reported using Python, with 33% prioritizing it, and only 31% reported using R, with 17% prioritizing it. This seems to show that R is a complementary, not primary language for data science and machine learning. Most R packages have their equivalent in Python (and to some extent the other way around). Unlike Python, which is an all-purpose language, able to do other wonders other than analyzing data and developing machine learning algorithms, R is specifically tailored to statistics and data analysis, not able to do much beyond that. Saying this, though, R programmers are increasingly developing more and more packages for it, allowing it to do more and more.

source codes screenshot

#3 Programming Language: Java

Java was once the most popular language around, but Python has dethroned it in the last few years. As an avid Java programmer who programs in Java for fun, it breaks my heart to put it so far down the list, but Python is clearly a better language for data science and machine learning. If you are working in an organization or other context that still uses Java for part or all of its software infrastructure, then you may be stuck using it, but most recent developments, particularly in machine learning, have occurred in Python and in R (and a few other languages). Thus, if you use Java, you’ll frequently find yourself having to unnecessarily reinventing the wheel.

Plus, one major con of Java is that conducting quick, on-the-go analysis is not possible, since one must write a whole coding system before one can do a single line of code. Java can be popular in certain contexts, where the surrounding applications/software that utilize the machine learning algorithms are in Java, common in finance, front-end development, and companies that have been using Java-based software.

#4 Programming Languages: C/C++

The same Towards Data Science survey I mentioned above lists C/C++ as the second most popular data science and machine learning language after Python. Java follows them closely, yet I included Java and not C/C++ as third because I personally find Java to be a better overall language than C or C++. In C or C++, you may frequently find yourself reinventing the wheel – having to develop machine learning algorithms that others have already built in Python – but in some backend systems that have been built C or C++ like in engineering and electronics, you do not have much of an option. C++ has a similar problem with Java as well: lacking the ability to do quick on-the-go coding without having to build a whole infrastructure.

Conclusion

For a beginner to the data science scene, learning a single programming is the most helpful way to enter the field. Use learning a programming language to assess whether data science is for you: if you struggle and do not like programming, then developing machine learning algorithms for a living is probably not a good fit for you.

Many groups are trying to develop softwares that enable machine learning without having to program: DataRobot, Auto-WEKA, RapidMiner, BigML, and AutoML, among many others. The pros and cons and successes and failures of these softwares warrants a separate blog post to itself (one I intend to write eventually). As of now though, these have not replaced programming languages in either practical ability to develop complex machine learning algorithms and in demonstrating that you have the technical computational/programming skills for the field.

For a beginner to the data science scene, learning a single programming is the most helpful way to enter the field. Use learning a programming language to assess whether data science is for you: if you struggle and do not like programming, then data science where you would be developing machine learning algorithms for a living is probably not a good fit for you. Depending on where you work or type of field/tasks you are doing, you might end up using the language(s) or software(s) your team works with so that you can easily work jointly on projects with them. For some areas of work or tasks might prefer certain packages and languages. If you demonstrate that you can already know a complex programming language like Python (or Java or C++), even if that is not the preferred language of their team, then you will likely demonstrate to any hiring manager that you can learn their specific language or software.

Photo credit #1: ThisIsEngineering at https://www.pexels.com/photo/woman-coding-on-computer-3861958/

Photo credit #2: Hitesh Choudhary at https://unsplash.com/photos/D9Zow2REm8U

Photo credit #3: thekirbster at https://www.flickr.com/photos/kirbyurner/30491542972/in/photolist-MQRUEh-2g3E1wf-Nsr8q9-HDKJxu-22VkHJU-2bWRXY2/lightbox/ (Yes, even though it is cool looking, this is not my code.)

Photo credit #4: Steinar Engeland at https://unsplash.com/photos/WDf1tEzQ_SY

Photo credit #5: Markus Spiske at https://unsplash.com/photos/jUWw_NEXjDw

The Four Most Common Data Science Interview Questions and How to Prepare for Them

Interviewing for a data science role can be a daunting task, especially for those new to the field. I have lost count of the number of data science interviews I have had over the years, but here are the four most common questions I have encountered and strategies for preparing for each. Prepping for these questions is a great opportunity to develop your story thesis, the most important part of any data science interview.

Most Common Data Science Questions:
1) Tell me about yourself.
2) Describe a data science job you have worked on.
3) What kind of experience do you have with messy data?
4) What programming languages and software have you used?

Question 1: Tell me about yourself.

This is probably hands down the most common interview question across all industries and fields, not just data science, so the fact that it is the most commonly asked questions in data science interviews may not seem that surprising. A good answer is crucial to establish a favorable first impression and to lay your main story or thesis of who you are that you will come back to throughout the interview.

In data science interviews, I emphasize my passion for using data science tools to help organizations solve complex problems that were previously vexing. If you are unsure what your thesis is, I designed this activity to help people decipher it. Here is an example of how I would describe myself:

“I fell in love with data science because I enjoy helping organizations solve complex problems. In my past roles, I have used my combined data science and social science skills to explore and build solutions for complicated problems for which the typical ways of doing things within the organization have not worked. I am energized by the intellectual stimulation of breaking down complex problems and using data science to develop potential innovative yet useful solutions. What kind of problems do you guys have that has led you to need to find a data scientist like me?”

Your self-description should tell the story of who you are in a way that demonstrates how you would be a natural fit for the role and helpful to the organization. As your interview thesis, if you laid it out well, then every other question you answer will simply involve fleshing out one (or a combination) of those three basic parts of your self-story: 1) Who you are, 2) How your identity makes you a natural fit for the role, and 3) How this would benefit the organization.

Here are four other important observations to note about how I told my story:

  1. I emphasized who I was – an innovator developing unique solutions to complex problems – while showing my innovator identity naturally connects with data science and could be helpful for the organization. You might not consider yourself an “innovator” per se, but the trick is to figure out who you are based on what energizes and impassions you and then show how performing the data science role you are applying for is a natural fit for who you are.
  2. I told the story with normal words, not technical jargon. I have found that many, if not most, of my interviews, especially the first-round interviews, are with employees without technical expertise, and since you often do not know the level of technical expertise of the interviewer, it is better to err on the non-technical side.  
  3. I kept my story positive, only mentioning what I like to do. Sometimes people instinctively try to illustrate what they want by describing things they do not like to do: e.g. “At previous last job, I learned I do not like doing Y, so I am seeking to do X instead” or “I am doing Y, and I hate it. I want out.” I would describe these aspects of my story later if the interviewer asks, but I would stick with the positive at first: only mentioning what I want to do.
  4. I used strong, subjective, even emotional phrases like “fell in love with,” “passionate about,” and “energized by.” At first glance, these phrases might seem overly informal, but I have found they help interviewers remember me. Do not overdo it, but being more vivid and personable is generally helps rather than hurts your interview chances for data science positions.  

Question 2: Describe a data science project you have worked on.

This is the second most common question I encountered, so make sure you come prepared with an exemplar project to showcase. They may ask you a lot of questions about your project, so I would recommend choosing a project where you did an amazing job on, really knocked it out of the park and that you are proud of. Unless there are disclosure issues, post your work on GitHub, a blog, LinkedIn, or somewhere else online, and include a link to it in your job application.

How to explain the project will vary considerably depending on your interviewer’s degree of expertise. I generally start with a non-technical, high level explanation and provide the technical details if the interviewer(s) prompts me to with follow-up questions. This gives the interviewers the freedom to choose the level of technical expertise they would like in their follow-up. A data scientist interviewer worth his or her salt will quickly steer the conversation into more technical aspects of your project that he or she wants to learn more about, but even then, starting non-technical demonstrates that you know how to effectively communicate your work to non-technical audiences as well.

When describing your project, you are effectively telling the story of the project, and most project stories have the following core components:

  • Who: You are probably the story’s protagonist (it is your interview after all, so naturally pick a project or part of a project where you were the primary driver), but there are likely multiple important side characters that you will need to setup, like who commissioned the project, who it was for, who the data was about, and so on.
  • What: The problem, need, or question your project sought to address generally forms the “conflict” of the project story, so be sure to explain what led to the problem, need, or question (in stories, called the inciting incident).
  • When and Where: The timeframe setting/context in which the project took place (e.g., the organization you were working with or a class you took for which the project was for). How long you had to complete the project can also be important to establish.
  • How: What did you to solve the problem. If you tried a lot of approaches before discovering what works, the how includes both your methodological story and your final solution (that is part of the rising and falling action for how you overcame the project). This is the meat of your story. You will want technical and non-technical descriptions of the how:
    • Technical How: Generally, the core two parts of a technical description are the model you used (and any you tried if applicable) and how you determined the features/variables you selected. Another important part might be how you cleaned and/or gleaned the data. 
    • Non-technical How: I have found that non-technical audiences usually do not glean much from either the model I ended using or my feature selection procedure. Instead, I explain what type of functionality I ensured the model had to solve the problem I had just setup: for example, “I built a model that calculated the probability of X phenomena based on data sources A, B, and C, testing various types of models to determine which would do this best, and then discerned which variables among those datasets were the best to use.” For a non-technical audience, that is generally enough. The core component for them is what goes into the model (the data), what result the model produced from it, and how that informed the problem, need, or question driving the project. 

Finally, in your how explanation, make sure you slip in whatever programming languages and software you used: Python, R, SQL, Azure, etc.

  • Why: This is your explanation of why you chose the approach(es) you did for your how. Now, just like with the how, you will need a technical and non-technical explanations of the why.

Make sure your non-technical explanation of why aligns with your non-technical how. I commonly see data scientists make the mistake of going over a non-technical individual’s head by trying to provide a technical why explanation for their non-technical how. In particular, I would not explain the metric or criteria you used to compare models or decide the feature selection procedure in my non-technical explanation, since these will likely lose a non-technical person. If my non-technical how description focused what data the model used and what it did with it, then my non-technical why focuses on why building a model to do that mattered and how it helped others and/or myself in the real world.

  • What happened: This is the result of the project. Did you succeed or fail (or somewhere in between)? Was it useful for whoever you built it for? Were you able to conduct any follow-up analysis after deployment? Maybe most importantly, what did you learn from the experience? In narrative terminology, this is the resolution. The more you can quantitatively measure any outcomes the better. 

These are the basic components of a project story. Here is the most common project I use, and when reading through it, feel free to analyze how I present each component of the story. I wrote this blog for a general audience, so I provided my non-technical how and why.

Question 3: What kind of experience do you have with messy data?

Interviewers ask me this question surprisingly frequently. They usually preface the question by explaining that they at the organization have a lot of messy data that would require cleaning/processing for their future data scientist. This is a great opportunity to showcase your comfort with data science and data science issues.

I typically answering something like this:

“Yes, I have had to organize and clean messy data all the time. That’s par for the course in data science: the running joke among data scientists is that 90% of any data science project is data cleaning, and 10% actually doing anything with it. At least you guys are honest about the fact that your data is messy. When I worked as a consultant, for example, I talked with many organizations about potential data science projects, and if they said their data was clean and ready to go, chances are they were lying either to themselves or to me about how messy and haphazard their data really was. The fact that you are upfront about the messiness of your data tells me that you guys as an organization are realistically assessing where you are and what you need.”

This answer not only establishes that I have handled messy data before but also normalizes the problem in the field as resolvable by an expert (like myself) and compliments them for being up front. Answering this question confidently and positively has uniquely put me at the top of the list as the front runner candidate in some interviews. Giving a good answer to is is a perfect opportunity to endear yourself with your interviewer.

Question 4: What programming languages and/or software have you used?

Even though a technical interviewer might ask this as well, I have encountered this question most frequently among non-technical interviewers. In my experience, fellow data scientist interviewers have more insider ways of deciphering whether you do in fact know data science, but for non-technical interviewers, this question is their initial way to probe that. Sometimes, they will cling to a laundry list of software and/or languages to determine whether you are qualified.

Now, I believe that having experience using the exact combination of softwares that the data science team you would be joining uses is generally not that important a criterion for job success. For a good data scientist, learning another software system or programming language once you know dozens is not that difficult of a task. But their question is completely natural and reasonable coming from their side, so you will have to answer it.

If they open-endedly ask what softwares and languages you have use, list through the ones you have used, maybe starting with the ones you use the most often. I generally start by mentioning Python, since not only is it my favorite language for data science (see this article) but also conveys that I am familiar with programming in general.

More often, though, they might ask whether you have used X software before, often asking whether you have used each software on a list they have in front of them. I would never recommend lying by claiming that you have experience with a software you have never used, but I would recommend recasting a “No” by providing an equivalent software to it that you have worked with. Here is an example:

“No, I have not used Julia, but that is because I prefer using Python for what others might use Julia for. Python is an equivalent high-functioning programming language in complexity, and the data science teams I worked on happened to prefer it over Julia.”

This not only conveys the “No” in a bit more of a positive light, but it shows that you are familiar with the software he or she just mentioned and confident about using it to match your would-be team.   

Question 5: What are you looking for in a job?

Most often, this is the last major question interviewers ask me, but I have gotten it at the beginning as well. They probably save it until the end, because the question transitions very easily to the next part of the interview: either them describing the role or you providing any questions you have.

If you did a good job laying out your thesis story in the first question, then here you simply restate it from a different angle. You already laid the groundwork, and you are just bringing it home at this point. If they ask me this at the beginning of the interview before the “Tell me about yourself” question, then I use this question to retell my thesis story from this new angle.

Here is my typical answer:

“Like I said, I am energized by figuring out how to help organizations solve complex data science problems. Over the years, I have found two concrete things in an organization help me with this. First, I thrive in stimulating work environments where I am given the space and resources to think creatively through problems. Second, I also need to be able to work with people from a variety of backgrounds and disciplines from whom I can learn from and develop innovative approaches to the problem at hand. You guys seem to provide both. [I then conclude by explaining why they seem to provide both based on what you learned about the organization during the interview, or if we have not had a chance to talk about them yet, ask about these within the organization.]”

Notice that the first sentence references my self-explanation answer to the “Tell me about myself” question. If they ask this question before I have given that spiel, I spend about 30 seconds or a minute providing a condensed self-introduction and then continue with the rest of the answer.

Conclusion

These are the five most common data science interview questions I have encountered and how to prepare for them. I have found that when data scientists give advice on how to prepare for job interviews, they often focus on preparing for highly technical, factual questions (e.g., here and here). Even though having a solid data science foundation can be important, refining your overall story thesis – who you are, what you are passionate about doing, and how that relates to this job – is far more important to advance through the interview process.

I have found that humans, even supposedly “nerdy” data scientists, tend to connect with people and stories, so if you can hook them there, they generally remember you better and are more likely to hire you. When you have a compelling story, every other question will naturally fall into place as an intuitive further clarification of that overall story.  

Photo Credit #1: Work With Island at https://unsplash.com/photos/FX2QA0TMEYg

Photo Credit #2: Free-Photos at https://pixabay.com/photos/glasses-reading-glasses-spectacles-1246611/

Photo Credit #3: geralt at https://pixabay.com/illustrations/questions-font-who-what-how-why-2245264/

Photo Credit #4: Darwin Vegher at https://unsplash.com/photos/W_ZYCEUapF0

Photo Credit #5: geralt at https://pixabay.com/illustrations/software-program-cd-dvd-disc-pack-417880/

Photo Credit #6: jenoliver777 at https://pixabay.com/photos/horses-dogs-groundwork-blaze-2888749/