Data Science Storytelling: Quantitative UX Research in Google Cloud with Randy Au (Part 2 of 2)

In this second part of my interview with Randy Au, he discusses the techniques he used to teach himself to code and his approach to programming and data science as a social scientist.

Here is Part 1 of our interview.

Prior to joining Google, he spent a decade as a mixture of a data analyst, data scientist, and data engineer at various startups in New York City and before that, studied Communications. In his newsletter, he discusses data science topics like data collection and data quality from a social science perspective. Outside of work he often engages in far too many hobbies, taken to absurd lengths.

Click here to learn more about the Interview Series this is a part of.

More about Randy:

Data Science Storytelling: Quantitative UX Research in Google Cloud with Randy Au (Part 1 of 2)

Randy Au, a Quantitative UX Researcher at Google, explains how he leverages his backgrounds in communication, statistics, and programming as a quantitative UX researcher in Google Cloud to analyze and improve Cloud Storage products.

Here is Part 2 of our interview.

Prior to joining Google, he spent a decade as a mixture of a data analyst, data scientist, and data engineer at various startups in New York City and before that, studied Communications. In his newsletter, he discusses data science topics like data collection and data quality from a social science perspective. Outside of work he often engages in far too many hobbies, taken to an absurd lengths.

Click here to learn more about the Interview Series.

More about Randy:

Applying Computational Ethnography and Statistics to Vapor Wave: Interview with Tanner Greene (Part 2 of 2)

Here is the second part of three in my conversation with Tanner Greene. He discusses his strategies for transitioning from graduate school to UX research and his recommendations for any fellow student seeking to do the same.

Here is Part 1 of our interview.

Tanner Greene is a UX Researcher and Ph.D. Candidate at the University of Virginia, where he’s finishing a dissertation on the history of vaporwave, a music genre created on social media platforms. Tanner’s interests straddle math and the humanities, spanning digital cultures, user metadata, and a long-dormant statistics ability he wants to revive. In his spare time, Tanner enjoys writing about music, playing video games, and dreaming about learning SQL.

Resources We Referenced:

For more context on my interview series in general, click here.

Applying Computational Ethnography and Statistics to Vapor Wave: Interview with Tanner Greene (Part 1 of 2)

For my next installment in my Interview Series, I interviewed Tanner Greene. He recently received his doctorate from the University of Virginia for his research on the digital music genre, vapor wave. He primarily used qualitative means but has also taught himself Python to be able to employ quantitative textual analysis into his project. It is a good example of how to integrate qualitative digital ethnographic techniques with quantitative natural language processing.

In this first part, he discusses why he decided to study the vapor wave community and his experiences learning Python to conduct statistical analysis with.

Here is Part 2 of our interview.

Tanner’s interests straddle math and the humanities, spanning digital cultures, user metadata, and a long-dormant statistics ability he wants to revive. In his spare time, Tanner enjoys writing about music, playing video games, and dreaming about learning SQL.

Resources We Referenced:

For more context on my interview series in general, click here.

The Best Programming Languages for Data Science and Machine Learning

woman coding on computer

Newcomers to data science or artificial intelligence frequently ask me the best programming language to learn to build machine learning algorithms. Thus, I wrote this article as a reference for anyone who wants to know the answer to that question. These are what I consider the three most important languages, ranked in terms of usefulness based on both overall popularity within the data science community and my own personal experiences:

Best Programming Languages for Machine Learning:
#1 Choice: Python
#2 Choice: R
#3 Choice: Java
#4 Choice: C/C++

#1 Programming Language: Python

Python is the most popular language to use for machine learning and for three good reasons.

First, it’s package-based style allows you to utilize efficient machine learning and statistical packages that others have made, preventing you from having to constantly reinvent the wheel for common problems. Many if not most of the best packages (like NumPy, pandas, scikit learn, etc.) are in Python. This almost allows you to “cheat” when programing machine learning algorithms.

Second, Python is a powerful and flexible all-purpose language, so if you are building a machine learning algorithm to do something, then you can easily build the code for the other overall product or system in which you will use the algorithm without having to switch languages or softwares. It supports object-oriented, functional, and procedure-oriented programming styles, giving the programmer flexibility in how to code, allowing you to use whatever style or combination of various styles you like best or fits the specific context.

Third, unlike a language like Java or C++, Python does not require elaborate setup to program a single line of code. Even though you can easily build the coding infrastructure if you need to, if you only need to run a simple command or test, you can start immediately.

When I program in Python, I personally love using Jupyter Notebook, since its interface allows me to both code and to easily show my code and findings as a report or document. Another data scientist can simultaneously read and analyze my code and its output at the same time. I personally wish more data scientists published their papers and reports in Jupyter Notebook or other notebooks like it because of this.

If you have time to learn a single programming language for machine learning, I would strongly recommend it be Python. The next three languages, R, Java, and C++, do not match its ease and popularity within data science.

#2 Programming Language: R

R is a popular language for statisticians, a programming language that is specifically tailored for advanced statistical analysis. It includes many well-developed packages for machine learning but is not as popular with data scientists as Python. For example, in Towards Data Science’s survey, 57% of data scientists reported using Python, with 33% prioritizing it, and only 31% reported using R, with 17% prioritizing it. This seems to show that R is a complementary, not primary language for data science and machine learning. Most R packages have their equivalent in Python (and to some extent the other way around). Unlike Python, which is an all-purpose language, able to do other wonders other than analyzing data and developing machine learning algorithms, R is specifically tailored to statistics and data analysis, not able to do much beyond that. Saying this, though, R programmers are increasingly developing more and more packages for it, allowing it to do more and more.

source codes screenshot

#3 Programming Language: Java

Java was once the most popular language around, but Python has dethroned it in the last few years. As an avid Java programmer who programs in Java for fun, it breaks my heart to put it so far down the list, but Python is clearly a better language for data science and machine learning. If you are working in an organization or other context that still uses Java for part or all of its software infrastructure, then you may be stuck using it, but most recent developments, particularly in machine learning, have occurred in Python and in R (and a few other languages). Thus, if you use Java, you’ll frequently find yourself having to unnecessarily reinventing the wheel.

Plus, one major con of Java is that conducting quick, on-the-go analysis is not possible, since one must write a whole coding system before one can do a single line of code. Java can be popular in certain contexts, where the surrounding applications/software that utilize the machine learning algorithms are in Java, common in finance, front-end development, and companies that have been using Java-based software.

#4 Programming Languages: C/C++

The same Towards Data Science survey I mentioned above lists C/C++ as the second most popular data science and machine learning language after Python. Java follows them closely, yet I included Java and not C/C++ as third because I personally find Java to be a better overall language than C or C++. In C or C++, you may frequently find yourself reinventing the wheel – having to develop machine learning algorithms that others have already built in Python – but in some backend systems that have been built C or C++ like in engineering and electronics, you do not have much of an option. C++ has a similar problem with Java as well: lacking the ability to do quick on-the-go coding without having to build a whole infrastructure.

Conclusion

For a beginner to the data science scene, learning a single programming is the most helpful way to enter the field. Use learning a programming language to assess whether data science is for you: if you struggle and do not like programming, then developing machine learning algorithms for a living is probably not a good fit for you.

Many groups are trying to develop softwares that enable machine learning without having to program: DataRobot, Auto-WEKA, RapidMiner, BigML, and AutoML, among many others. The pros and cons and successes and failures of these softwares warrants a separate blog post to itself (one I intend to write eventually). As of now though, these have not replaced programming languages in either practical ability to develop complex machine learning algorithms and in demonstrating that you have the technical computational/programming skills for the field.

For a beginner to the data science scene, learning a single programming is the most helpful way to enter the field. Use learning a programming language to assess whether data science is for you: if you struggle and do not like programming, then data science where you would be developing machine learning algorithms for a living is probably not a good fit for you. Depending on where you work or type of field/tasks you are doing, you might end up using the language(s) or software(s) your team works with so that you can easily work jointly on projects with them. For some areas of work or tasks might prefer certain packages and languages. If you demonstrate that you can already know a complex programming language like Python (or Java or C++), even if that is not the preferred language of their team, then you will likely demonstrate to any hiring manager that you can learn their specific language or software.

Photo credit #1: ThisIsEngineering at https://www.pexels.com/photo/woman-coding-on-computer-3861958/

Photo credit #2: Hitesh Choudhary at https://unsplash.com/photos/D9Zow2REm8U

Photo credit #3: thekirbster at https://www.flickr.com/photos/kirbyurner/30491542972/in/photolist-MQRUEh-2g3E1wf-Nsr8q9-HDKJxu-22VkHJU-2bWRXY2/lightbox/ (Yes, even though it is cool looking, this is not my code.)

Photo credit #4: Steinar Engeland at https://unsplash.com/photos/WDf1tEzQ_SY

Photo credit #5: Markus Spiske at https://unsplash.com/photos/jUWw_NEXjDw

The Stages of Learning a New Data or Programming Skill

Many people have admirably sought to learn data science, data analytics, a programming language, or some other data or programming skill in order to develop themselves professionally and/or seek a new career path. Excitingly, learning such skills has become significantly easier to do online. But this online learning can also foster unrealistic understandings of what learning one of these skills entails, since it can remove prospective learners from the physical community of experts who help introduce prospective learners to the expectations of that field.

The goal of this article is to help rectify that by explaining the basic steps typically needed to develop a mastery of a new data or programming skill. This will hopefully help inform high-level expectations for learning the skill would entail but also help you choose the right courses or set of courses to ensure you develop all three stages.

By data skill, I mean any data field like data science, data analytics, or data engineering, or any specific skill or practice within a data field that someone might seek to learn, and by programming skill I mean the skills necessary to learn and code in a programming language.

These are the three basic learning stages to master any of these topics:

Stage 1: Grasp the basic concepts of the topic
Stage 2: Complete a guided project
Stage 3: Complete a self-directed project

Stage 1: Grasping Basic Concepts

Grasping basic concepts entails learning the relevant vocabulary, syntax, and key approaches. Often programs teach each concept distinctly, one at a time. For example, when learning a new programming language, you might learn the major commands and syntax rules, and for data science, you might learn about each of the most prominent machine learning models one at a time.

This is different from applying the concepts widely, and at this stage, you may not be able to handle mixing all the concepts together in a complex problem yet (that’s Stage 2). Programs often teach the material at this point sequentially (even though that can be difficult for nonlinear learners).

For example, W3Schools provides grounded Stage 1 teaching for most programming languages and data science skills. They provide sequential exercises working through the basic syntax components of a new language, ever so slightly increasing in complexity along the way.

Now, only performing the first stage does not entail a full mastery of topic. After practicing each piece one at a time, you must also transition into Stage 2 where you start to learn how to combine them when completing a more complex problem.

Stage 2: Guided Project

Here you practice putting all the pieces together through a guided project(s). This guided project is a model for how each of the components fit together in an actual project. I liken these to building a Lego kit: following step-by-step instructions to build a cool model (instead of building your own object from scratch, which is Stage 3). They hold your hand through its completion to illustrate what putting all of the isolated skills and concepts together during a complicated project would entail.

Stage 3: Independent Project

In the third stage, you bring everything have learned together to complete a project on your own. Unlike in Stage 2, when they held your hand, you now have the freedom to struggle, which is necessary to learn. You are developing the skills involved in forming and carrying out a project on your own.

At the same time, you are learning what it looks like to implement those skills “in the wild” of a real-life project. In the previous stages, instructors often coddle their students: providing cleaned and perfectly ready-to-do example problems that you might find in a textbook, necessary to learn the basic concepts. Like a Lego kit, the components of the project have been groomed to make what you are producing. In Stage 3, you often start to experience the types of messiness common in real-world projects, when you have to find the pieces you need and/or figure out how to make do with the ones what you have.

For example, among data science learners, this stage is when students first learn to deal with the complexities of finding the right data for their problem; determining the best questions for a given dataset; and/or cleaning inconsistent data. Beforehand, most examples probably had already cleaned data that matched the specific task they were built for.

A certain amount of trial by fire is often needed to learn how to develop your own project. Your instructor(s) might take a little more of a backseat role during this process, looking over what you have done, answering any questions you might have, and nudging you when necessary. In my experience, exploring strategies yourself is the best way to learn Stage 3. Hopefully, at the end of it all, you will produce a nifty project that you can show prospective employers or whoever else you might wish to impress.

Conclusion

These are the three most common stages to develop initial mastery of a new data or programming skill or field. Now, they are the skill levels generally necessary to learn the new skill, but there are plenty of further levels of learning after you complete these. For example, grasping basic data science concepts, completing a guided project, and learning how to conduct your own self-directed data science project would be enough to make you a new inductee into the data science community, but you would still be a newbie data scientist. It is only the tip of the iceberg for what you can learn and how you would grow as a data scientist.

Now, despite calling them stages, not everyone learns them in sequential order, especially given the variety of extenuating circumstances and learning styles. For example, some might complete all three stages for a specific subset of skills in the field they are learning, and then go back to Stage 1 for another subset. Most education programs will include all three stages, more or less in order.

Some education programs, however, might completely lack or provide insufficient resources for one or two stages. Assessing whether a program adequately includes all three can be an effective way to determine how good they are at teaching and whether they are worth your money and/or time. When choosing to learn a new skill, I would recommend a program or combination of programs that includes all three. If a program you want to do or are currently completing lacks one or two of these stages, you can try to find another (hopefully free) way to complete that stage yourself online. For example, online courses and tutorials very frequently fail to provide Stage 3 (and in some cases, Stage 2), so after you complete one, I would recommend finding a project to work on.

Finally, when you are encountering a difficulty learning, it might be because you need to go backwards to a previous stage. For example, when many learners move to Stage 2, they must periodically swing back into Stage 1 to review a few core concepts when they see those concepts applied in a new way. Similarly, when completing a project in Stage 3, there is nothing wrong with reviewing Stage 2 or even Stage 1 materials.  

Now, be careful because you can falsely attribute this. Learning anything can be frustrating. Sometimes the difficulties you are having are not rooted in the need to review or relearn past material, but you simply need to push through with the new material until you start to get it. In those cases, some students revert backwards into a set of material in which they can feel safe and confident instead of challenging themselves. Even in those cases, however, like rocking a car by going into reverse and then drive to get over a bump, quickly going backwards can help launch you forward over the hurdle. In such cases, what is most important is to know yourself – your learning tendencies and how you typically respond – and check in as much as you can with instructors and/or experts in the field who have been there and done that to help you determine the best ways to overcome whatever challenge you are having.  

Photo credit #1: Jukan Tateisi at https://unsplash.com/photos/bJhT_8nbUA0

Photo credit #2: qimono at https://pixabay.com/illustrations/cog-wheels-gear-wheel-machine-2125178/

Photo credit #3: Bonneval Sebastien at https://unsplash.com/photos/lG-6_ox_UXE

Photo credit #4: Holly Mandarich at https://unsplash.com/photos/UVyOfX3v0Ls

Photo credit #5: George Bakos at https://unsplash.com/photos/VDAzcZyjun8