For Beginners to Data Science and Machine Learning

Data Visualization 101: The Most Important Rule for Developing a Graph

I suspect everyone has seen a bad graph, a mess of bars, lines, pie slices, or what have you that you dreaded having to look at. Maybe you have even made one, which you look at today and wonder what on earth you were thinking.

These graphs violate the most basic graph-making rule in data visualization:

A graph is like a sentence, expressing one idea.

This rule applies to all uses of graphs, whether you are a data scientist, data analyst, statistician, or just making graphs for your friends for fun.

In grade school, your grammar teachers likely explained that a sentence, at its most basic, expresses on thought or idea. Graphs are visual sentences: they should state one and only one thought or idea about the data.

When you look at a graph, you should be able to say, in one sentence, what the graph is saying: such as “Group A is greater than Group B,” or “Y at first improved but is now declining.” If you cannot, then you have yourself a run-on graph.

For example, the above graph is trying to say too many statements: trying to depict the immigration patterns of twenty-two different countries over the course of nearly a century. There are likely useful statements in this data, but the representation as one graph prevents a viewer/reader from being able to easily decipher them.

Likewise, this graph shows way too many lens sizes to meaningfully express a single, coherent idea, leaving the reader/viewer struggling to determine which fields to focus on.

Potential Objection #1: But I have more to say about the data than a single statement.

Great! Then provide more than one graph. Say everything you need to say about the data; just use one graph for each of your statements.

Don’t fall into the One-Graph-to-Rule-Them-All Fallacy: trying to use one graph to express all your statements about the data that ends up a visual mess of incomprehensibility. Create multiple easy-to-read graphs where each graph demonstrates one of your points at a time. Condensing everything into one graph just prevents your viewers from determining what you have to say at all.

Bar Chart, Chart, Statistics, Analytics, Data Analytics — One-Graph-to-Rule-Them-All Fallacy: Trying to use one graph to express all your thoughts about the data that ends up a visual mess of incomprehensibility

Statistics, Graph, Chart, Data, Information, Growth — Instead, use one graph for each of your points

Potential Objection #2: I want the viewers to interpret the findings for themselves, not just impart my own ideas/conclusions.

Fair point. When presenting/communicating data, there is a time for showing your own insights and a time to open-endedly display the information for your viewers/readers to interpret for themselves. Graphs are tools for the former, and for the latter, use tables. Tables, among other potential uses, convey a wide scope of information for the reader/viewer to interpret on their own.

Remember that first example above about U.S. immigration from various parts of Europe? A table (see below) would convey that information much more easily and allow readers to track whatever places, patterns, or questions they would to learn about. Are you in a situation where you would like to report a large amount of information that your readers can use for their own purposes? Then tables are a much better starting point than graphs.

Some situations require that I lean towards sharing my insights/analysis and others towards encouraging my readers/viewers to form their own conclusions, but since most situations require a combination of the two, I generally combine graphs and tables. I try, when I can, to put smaller tables in the document or slides themselves and, when I cannot, include full tables in an Appendix.

Potential Objection #3: My main idea/point has multiple subpoints.

Many sentences have multiple subpoints needed to express the single idea as well, which does not prevent the sentence structure from meaningfully capturing those ideas. The fancy grammar word for such a subpoint is a claus. Even though some sentences are simple and straightforward with only one subject and predicate, many (like this very sentence) require multiple sets of subjects and predicates to express its thought.

Likewise, some graphical ideas require multiple subordinate or compounded subpoints, and there are types of graphs that allow this. Consider Joint Plots, like the one below. To present the relationships and combined distribution between the two variables adequately, they also display each variable’s individual distributions above and to the right. That way, the viewer can see how both distributions might be influencing the combined distribution. Thus, it displays each variable’s distribution on the side like a subordinate clause.

The darker colors in this graph signify a higher density of data points, showing the combined joint distribution of the variables.

These are advanced graphs to make, since like with multi-part sentences, one must present the subpoints carefully to make clear what the main point is. Multi-part sentences, likewise, require carefulness in how to organize multiple clauses cohesively. I intend to write a post later describing how to develop these multi-part graphs in more detail.

The general rule still applies for these more complicated graphs:

Can you summarize what the graph is saying in one coherent sentence?

If you cannot, do not use/show that graph. Our brains are very good at intuiting whether a sentence carries one thought, so use this to determine whether your graph is effective.

Photo/Graph credit #1: kreatikar at https://pixabay.com/illustrations/statistics-graph-chart-data-3411473/

Photo/Graph credit #2: Linux Screenshots at https://www.flickr.com/photos/xmodulo/23635690633/

Photo/Graph credit #3: Andrew Guyton at https://www.flickr.com/photos/disavian/4435971394/

Photo/Graph credit #4: TymonOziemblewski at https://pixabay.com/illustrations/bar-chart-chart-statistics-1264756/

Photo/Graph credit #5 (the first graph again): kreatikar at https://pixabay.com/illustrations/statistics-graph-chart-data-3411473/

Photo/Graph credit #6: Michael Waskom provides a helpful tutorial that formed the inspiration behind the random graph I created.

Three Key Differences between Data Science and Statistics

Data science’s popularity has grown in the last few years, and many have confused it with its older, more familiar relative: statistics. As someone who has worked both as a data scientist and as a statistician, I frequently encounter such confusion. This post seeks to clarify some of the key differences between them.

Before I get into their differences, though, let’s define them. Statistics as a discipline refers to the mathematical processes of collecting, organizing, analyzing, and communicating data. Within statistics, I generally define “traditional” statistics as the the statistical processes taught in introductory statistics courses like basic descriptive statistics, hypothesis testing, confidence intervals, and so on: generally what people outside of statistics, especially in the business world, think of when they hear the word “statistics.”

Data science in its most broad sense is the multi-disciplinary science of organizing, processing, and analyzing computational data to solve problems. Although they are similar, data science differs from both statistics and “traditional” statistics:

Difference	Statistics	Data Science
#1	Field of Mathematics	Interdisciplinary
#2	Sampled Data	Comprehensive Data
#3	Confirming Hypothesis	Exploratory Hypotheses

Difference #1: Data Science Is More than a Field of Mathematics

Statistics is a field of mathematics; whereas, data science refers to more than just math. At its simplest, data science centers around the use of computational data to solve problems,[i] which means it includes the mathematics/statistics needed to break down the computational data but also the computer science and engineering thinking necessary to code those algorithms efficiently and effectively, and the business, policy, or other subject-specific “smarts” to develop strategic decision-making based on that analysis.

Thus, statistics forms a crucial component of data science, but data science includes more than just statistics. Statistics, as a field of mathematics, just includes the mathematical processes of analyzing and interpreting data; whereas, data science also includes the algorithmic problem-solving to do the analysis computationally and the art of utilizing that analysis to make decisions to meet the practical needs in the context. Statistics clearly forms a crucial part of the process of data science, but data science generally refers to the entire process of analyzing computational data. On a practical level, many data scientists do not come from a pure statistics background but from a computer science or engineering, leveraging their coding expertise to develop efficient algorithmic systems.

Difference #2: Comprehensive vs Sample Data

In statistical studies, researchers are often unable to analyze the entire population, that is the whole group they are analyzing, so instead they create a smaller, more manageable sample of individuals that they hope represents the population as a whole. Data science projects, however, often involves analyzing big, summative data, encapsulating the entire population.

The tools of traditional statistics work well for scientific studies, where one must go out and collect data on the topic in question. Because this is generally very expensive and time-consuming, researchers can only collect data on a subset of the wider population most of the time.

Recent developments in computation, including the ability to gather, store, transfer, and process greater computational data, have expanded the type of quantitative research now possible, and data science has developed to address these new types of research. Instead of gathering a carefully chosen sample of the population based on a heavily scrutinized set of variables, many data science projects require finding meaningful insights from the myriads of data already collected about the entire population.

Difference #3: Exploratory vs Confirming

Data scientists often seek to build models that do something with the data; whereas, statisticians through their analysis seek to learn something from the data. Data scientists thus often assess their machine learning models based on how effectively they perform a given task, like how well it optimizes a variable, determines the best course of action, correctly identifies features of an image, provides a good recommendation for the user, and so on. To do this, data scientists often compare the effectiveness or accuracy of the many models based on a chosen performance metric(s).

In traditional statistics, the questions often center around using data to understand the research topic based on the findings from a sample. Questions then center around what the sample can say about the wider population and how likely its results would represent or apply to that wider population.

In contrast, machine learning models generally do not seek to explain the research topic but to do something, which can lead to very different research strategy. Data scientists generally try to determine/produce the algorithm with the best performance (given whatever criteria they use to assess how a performance is “better”), testing many models in the process. Statisticians often employ a single model they think represents the context accurately and then draw conclusions based on it.

Thus, data science is often a form of exploratory analysis, experimenting with several models to determine the best one for a task, and statistics confirmatory analysis, seeking to confirm how reasonable it is to conclude a given hypothesis or hypotheses to be true for the wider population.

A lot of scientific research has been theory confirming: a scientist has a model or theory of the world; they design and conduct an experiment to assess this model; then use hypothesis testing to confirm or negate that model based on the results of the experiment. With changes in data availability and computing, the value of exploratory analysis, data mining, and using data to generate hypotheses has increased dramatically (Carmichael 126).

Data science as a discipline has been at the forefront of utilizing increased computing abilities to conduct exploratory work.

person holding gold-colored pocket watch

Conclusion

A data scientist friend of mine once quipped to me that data science simply is applied computational statistics (c.f. this). There is some truth in this: the mathematics of data science work falls within statistics, since it involves collecting, analyzing, and communicating data, and, with its emphasis and utilization of computational data, would definitely be a part of computational statistics. The mathematics of data science is also very clearly applied: geared towards solving practical problems/needs. Hence, data science and statistics interrelate.

They differ, however, both in their formal definitions and practical understandings. Modern computation and big data technologies have had a major influence on data science. Within statistics, computational statistics also seeks to leverage these resources, but what has become “traditional” statistics does not (yet) incorporate these. I suspect in the next few years or decades, developments in modern computing, data science, and computational statistics will reshape what people consider “traditional” or “standard” statistics to be a bit closer to the data science of today.

For more details, see the following useful resources:

Ian Carmichael’s and J.S. Marron’s “Data science vs. statistics: two cultures?” in the Japanese Journal of Statistics and Data Science: https://link.springer.com/article/10.1007/s42081-018-0009-3

“Data Scientists Versus Statisticians” at https://opendatascience.com/data-scientists-versus-statisticians/ and https://medium.com/odscjournal/data-scientists-versus-statisticians-8ea146b7a47f

“Differences between Data Science and Statistics” at https://www.educba.com/data-science-vs-statistics/

Photo credit #1: Andrea Piacquadio at https://www.pexels.com/photo/woman-draw-a-light-bulb-in-white-board-3758105/

Photo credit #2: Carlos Muza at https://unsplash.com/photos/hpjSkU2UYSU

Photo credit #3: Hans-Peter Gauster at https://unsplash.com/photos/3y1zF4hIPCg

Photo credit #4: Kendall Lane at https://unsplash.com/photos/yEDhhN5zP4o

[i] Carmichael 118.

Data Visualization 102: The Most Important Rules for Making Data Tables

In a previous post about data visualization in data science and statistics, I discussed what I consider the single most important rule of graphing data. In this post, I am following up to discuss the most important rules for making data tables. I will focus on data tables in reporting/communicating findings to others, as opposed to the many other uses of tables in data science say to store, organize, and mine data.

To summarize, graphs are like sentences, conveying one clear thought to the viewer/reader. Tables, on the other hand, can function more like paragraphs, conveying multiple sentences or thoughts to get an overall idea. Unlike graphs, which often provide one thought, tables can be more exploratory, providing information for the viewer/reader to analyze and draw his or her own conclusions from.

Table Rule #1: Don’t be afraid to provide as much or as little information as you need.

Paragraphs can use multiple sentences to convey a series of thoughts/statements, and tables are no different. One can convey multiple pieces of information that viewers/readers can look through and analyze at his or her own leisure, using the data to answer their own questions, so feel free to take up the space as you need. Several page long tables are fair game and, in many cases, absolutely necessary (although often end up in appendices for readers/viewers needing a more in-depth take).

In my previous data visualization post, I gave this bar chart as an example of trying to say too many statements for a graph:

This is a paragraphs-worth of information, and a table would represent it much better.[i] In a table, the reader/viewer can explore the table values by country and year themselves and answer whatever questions he or she might have. For example, if someone wanted to analyze how a specific country changed overtime, he or she could do so easily with a table, and/or if he or want to analyze compare the immigration ratios between countries of a specific decade, that is possible as well. In the graph above, each country’s subsegment starts in a different place vertically for each decade column, making it hard to compare the sizes visually, and since each decade has dozens of values, that the latter analysis is visually difficult to decipher as well.

But, at the same time, do not be afraid to convey a sentence- or graphs-worth of data into a table, especially when such data is central for what you are saying. Sometimes writers include one-sentence paragraphs when that single thought is crucial, and likewise, a single statement table can have a similar effect. For example, writing a table for a single variable does helps convey that that variable is important:

Gender	Some Crucial Result
Male	36%
Female	84%

Now, sometimes in these single statement instances, you might want to use a graph instead of a table (or both), which I discuss in more detail in Rule #3.

Table Rule #2: Keep columns consistent for easy scanning.

I have found that when viewers/readers scan tables, they generally subconsciously assume that all variables in a column are the same: same units and type of value. Changing values of a column between rows can throw off your viewer/reader when he or she looks at it. For example, consider this made-up study data:

	Control Group (n = 100)	Experimental Group (n = 100)
Mean Age	45	44
Median Age	43	42
Male No. (%)	45 (45%)	36 (36%)
Female No. (%)	55 (55%)	64 (64%)

In this table, the rows each mean different values and/or units. So, for example, going down the control column, the first column is mean age measured in years. The second column switches to median age, a different type of value than mean (although the same unit of years). The final two rows convey the number and percentages of males and females of each: both a different type of value and a different unit (number and percent unlike years). This can be jarring for viewers/readers who often expect columns to be of the same values and units and naturally compare them as if they are similar types of values.

I would recommend transposing it like this, so that the columns represent the similar variables and the rows the two groups:

	Mean Age	Median Age (IQR)	Male No. (%)	Female No. (%)
Control Group (n = 100)	45	43 (25, 65)	45 (45%)	55 (55%)
Experimental Group (n = 100)	44	42 (27, 63)	36 (36%)	64 (64%)

Table Rule #3: Don’t be afraid to also use a graph to convey magnitude, proportion, or scale

A table like the gender table in Rule #1 conveys pertinent information numerically, but numbers themselves do not visually show the difference between the values.

Gender	Some Crucial Result
Male	36%
Female	84%

Graphs excel at visually depicting the magnitude, proportion, and/or scale of data, so, if in this example, it is important to convey how much greater the “Some Crucial Result” is for females than males, then a basic bar graph allows the reader/viewer to see that the percent is more than double for the females than for the males.

Now, to convey this visual clarity, the graph loses the ability to precisely relate the exact numbers. For example, looking at only this graph, a reader/viewer might be unsure whether the males are at 36%, 37%, or 38%. People have developed many graphing strategies to deal with this (ranging from making the grid lines sharper, writing the exact numbers on top of, next to, or around the segment, among others), but combining the graph and table in instances where one both needs to convey the exact numbers and to convey a sense of their magnitude, proportion, or scale can also work well:

Finally, given that tables can convey multiple statements, feel free to use several graphs to depict the magnitude, proportion, or scale of one table. Do not try to overload a multi-statement table into a single, incomprehensible graph. Break down each statement you are trying to relate with that table and depict each separately in a single graph.

Conclusion

If graphs are sentences, then tables can function more like paragraphs, conveying a large amount of information that make more than one thought or statement. This gives space for your reader/viewer to explore the data and interpret it on their own to answer whatever questions they have.

Photo/Table credit #1: Mika Baumeister at https://unsplash.com/photos/Wpnoqo2plFA

Photo/Table credit #2: Linux Screenshots at https://www.flickr.com/photos/xmodulo/23635690633/

[i] Unfortunately, I do not have the data myself that this chart uses, or I would make a table for it to show what I mean.

The Best Programming Languages for Data Science and Machine Learning

Newcomers to data science or artificial intelligence frequently ask me the best programming language to learn to build machine learning algorithms. Thus, I wrote this article as a reference for anyone who wants to know the answer to that question. These are what I consider the three most important languages, ranked in terms of usefulness based on both overall popularity within the data science community and my own personal experiences:

Best Programming Languages for Machine Learning:

#1 Choice: Python

#2 Choice: R

#3 Choice: Java

#4 Choice: C/C++

#1 Programming Language: Python

Python is the most popular language to use for machine learning and for three good reasons.

First, it’s package-based style allows you to utilize efficient machine learning and statistical packages that others have made, preventing you from having to constantly reinvent the wheel for common problems. Many if not most of the best packages (like NumPy, pandas, scikit learn, etc.) are in Python. This almost allows you to “cheat” when programing machine learning algorithms.

Second, Python is a powerful and flexible all-purpose language, so if you are building a machine learning algorithm to do something, then you can easily build the code for the other overall product or system in which you will use the algorithm without having to switch languages or softwares. It supports object-oriented, functional, and procedure-oriented programming styles, giving the programmer flexibility in how to code, allowing you to use whatever style or combination of various styles you like best or fits the specific context.

Third, unlike a language like Java or C++, Python does not require elaborate setup to program a single line of code. Even though you can easily build the coding infrastructure if you need to, if you only need to run a simple command or test, you can start immediately.

When I program in Python, I personally love using Jupyter Notebook, since its interface allows me to both code and to easily show my code and findings as a report or document. Another data scientist can simultaneously read and analyze my code and its output at the same time. I personally wish more data scientists published their papers and reports in Jupyter Notebook or other notebooks like it because of this.

If you have time to learn a single programming language for machine learning, I would strongly recommend it be Python. The next three languages, R, Java, and C++, do not match its ease and popularity within data science.

#2 Programming Language: R

R is a popular language for statisticians, a programming language that is specifically tailored for advanced statistical analysis. It includes many well-developed packages for machine learning but is not as popular with data scientists as Python. For example, in Towards Data Science’s survey, 57% of data scientists reported using Python, with 33% prioritizing it, and only 31% reported using R, with 17% prioritizing it. This seems to show that R is a complementary, not primary language for data science and machine learning. Most R packages have their equivalent in Python (and to some extent the other way around). Unlike Python, which is an all-purpose language, able to do other wonders other than analyzing data and developing machine learning algorithms, R is specifically tailored to statistics and data analysis, not able to do much beyond that. Saying this, though, R programmers are increasingly developing more and more packages for it, allowing it to do more and more.

#3 Programming Language: Java

Java was once the most popular language around, but Python has dethroned it in the last few years. As an avid Java programmer who programs in Java for fun, it breaks my heart to put it so far down the list, but Python is clearly a better language for data science and machine learning. If you are working in an organization or other context that still uses Java for part or all of its software infrastructure, then you may be stuck using it, but most recent developments, particularly in machine learning, have occurred in Python and in R (and a few other languages). Thus, if you use Java, you’ll frequently find yourself having to unnecessarily reinventing the wheel.

Plus, one major con of Java is that conducting quick, on-the-go analysis is not possible, since one must write a whole coding system before one can do a single line of code. Java can be popular in certain contexts, where the surrounding applications/software that utilize the machine learning algorithms are in Java, common in finance, front-end development, and companies that have been using Java-based software.

#4 Programming Languages: C/C++

The same Towards Data Science survey I mentioned above lists C/C++ as the second most popular data science and machine learning language after Python. Java follows them closely, yet I included Java and not C/C++ as third because I personally find Java to be a better overall language than C or C++. In C or C++, you may frequently find yourself reinventing the wheel – having to develop machine learning algorithms that others have already built in Python – but in some backend systems that have been built C or C++ like in engineering and electronics, you do not have much of an option. C++ has a similar problem with Java as well: lacking the ability to do quick on-the-go coding without having to build a whole infrastructure.

Conclusion

For a beginner to the data science scene, learning a single programming is the most helpful way to enter the field. Use learning a programming language to assess whether data science is for you: if you struggle and do not like programming, then developing machine learning algorithms for a living is probably not a good fit for you.

Many groups are trying to develop softwares that enable machine learning without having to program: DataRobot, Auto-WEKA, RapidMiner, BigML, and AutoML, among many others. The pros and cons and successes and failures of these softwares warrants a separate blog post to itself (one I intend to write eventually). As of now though, these have not replaced programming languages in either practical ability to develop complex machine learning algorithms and in demonstrating that you have the technical computational/programming skills for the field.

For a beginner to the data science scene, learning a single programming is the most helpful way to enter the field. Use learning a programming language to assess whether data science is for you: if you struggle and do not like programming, then data science where you would be developing machine learning algorithms for a living is probably not a good fit for you. Depending on where you work or type of field/tasks you are doing, you might end up using the language(s) or software(s) your team works with so that you can easily work jointly on projects with them. For some areas of work or tasks might prefer certain packages and languages. If you demonstrate that you can already know a complex programming language like Python (or Java or C++), even if that is not the preferred language of their team, then you will likely demonstrate to any hiring manager that you can learn their specific language or software.

Photo credit #1: ThisIsEngineering at https://www.pexels.com/photo/woman-coding-on-computer-3861958/

Photo credit #2: Hitesh Choudhary at https://unsplash.com/photos/D9Zow2REm8U

Photo credit #3: thekirbster at https://www.flickr.com/photos/kirbyurner/30491542972/in/photolist-MQRUEh-2g3E1wf-Nsr8q9-HDKJxu-22VkHJU-2bWRXY2/lightbox/ (Yes, even though it is cool looking, this is not my code.)

Photo credit #4: Steinar Engeland at https://unsplash.com/photos/WDf1tEzQ_SY

Photo credit #5: Markus Spiske at https://unsplash.com/photos/jUWw_NEXjDw

How to Analyze Texts with Data Science

flat lay photography of an open book beside coffee mug

A friend and fellow professor, Dr. Eve Pinkser, asked me to give a guest lecture on quantitative text analysis techniques within data science for her Public Health Policy Research Methods class with the University of Illinois at Chicago on April 13^th, 2020. Multiple people have asked me similar questions about how to use data science to analyze texts quantitatively, so I figured I would post my presentation for anyone interested in learning more.

It provides a basic introduction of the different approaches so that you can determine which to explore in more detail. I have found that many people who are new to data science feel paralyzed when trying to navigate through the vast array of data science techniques out there and unsure where to start.

Many of her students needed to conduct quantitative textual analysis as part of their doctoral work but struggled in determining what type of quantitative research to employ. She asked me to come in and explain the various data science and machine learning-based textual analysis techniques, since this was out of her area of expertise. The goal of the presentation was to help the PhD students in the class think through the types of data science quantitative text analysis techniques that would be helpful for their doctoral research projects.

Hopefully, it would likewise allow you to determine the type or types of text analysis you might need so that you can then look those up in more detail. Textual analysis, as well as the wider field of natural language processing within which it is a part of, is a quickly up-and-coming subfield within data science doing important and groundbreaking work.

Download [707.13 KB]

Photo credit: fotografierende at https://www.pexels.com/photo/flat-lay-photography-of-an-open-book-beside-coffee-mug-3278768/

Data Science and the Myth of the “Math Person”

“Data science is doable,” a fellow attendee of the EPIC’s 2018 conference in Honolulu would exclaim like a mantra. The conference was for business ethnographers and UX researchers interested in understanding and integrating data science and machine learning into their research. She was specifically trying to address a tendency she has noticed– which I have seen as well: qualitative researchers and other so-called “non-math people” frequently believe that data science is far too technical for them. This seems ultimately rooted in cultural myths about math and math-related fields like computer science, engineering, and now data science, and in a similar vein as her statement, my goal in this essay is to discuss these attitudes and show that data science, like math, is relatable and doable if you treat it as such.

The “Math Person”

In the United States, many possess an implied image of a “math person:” a person supposedly naturally gifted at mathematics. And many who do not see themselves as fitting that image simply decry that math simply isn’t for them. The idea that some people are inherently able and unable to do math is false, however, and prevents people from trying to become good at the discipline, even if they might enjoy and/or excel at it.

Most skills in life, including mathematical skills, are like muscles: you do not innately possess or lack that skill, but rather your skill develops as you practice and refine that activity. Anybody can develop a skill if they practice it enough.

Scholars in anthropology, sociology, psychology, and education have documented how math is implicitly and explicitly portrayed as something some people can do and some cannot do, especially in math classes in grade school. Starting in early childhood, we implicitly and sometimes explicitly learn the idea that some people are naturally gifted at math but for others, math is simply not their thing. Some internalize that they are gifted at math and thus take the time to practice enough to develop and refine their mathematical skills; while others internalize that they cannot do math and thus their mathematical abilities become stagnant. But this is simply not true.

Anyone can learn and do math if he or she practices math and cultivates mathematical thinking. If you do not cultivate your math muscle, then well it will become underdeveloped and, then, yes, math becomes harder to do. Thus, as a cruel irony someone internalizing that he or she cannot do math can turn into a self-fulfilling prophecy: he or she gives up on developing mathematical skills, which leads to its further underdevelopment.

Similarly, we cultivate another false myth that people skilled in mathematics (or math-related fields like computer science, engineering, and data science) in general do not possess strong social and interpersonal communication skills. The root for this stereotype lies in how we think of mathematical and logical thinking than actual characteristics of mathematicians, computer scientists, or engineers. Social scientists who have studied the social skills of mathematicians, computer scientists, and engineers have found no discernable difference in social and interpersonal communication skills with the rest of the world.

Quantitative and Qualitative Specialties

Anyone can learn and do math if he or she practices math and cultivates mathematical thinking.

The belief that some people are just inherently good at math and that such people do not possess strong social and interpersonal communication skills contributes to the division between quantitative and qualitative social research, in both academic and professional contexts. These attitudes help cultivate the false idea that quantitative research and qualitative research are distinct skill sets for different types of people: that supposedly quantitative research can only be done “math people” and qualitative research by “people people.” They suddenly become separate specialties, even though social research by its very nature involves both. Such a split unnecessarily stifles authentic and holistic understanding of people and society.

In professional and business research contexts, both qualitative and quantitative researchers should work with each other and eventually through that process, slowly learn each other’s skills. If done well, this would incentivize researchers to cultivate both mathematical/quantitative, and interpersonal/qualitative research skills.

It would reward professional researchers who develop both skillsets and leverage them in their research, instead of encouraging researchers to specialize in one or the other. It could also encourage universities to require in-depth training of both to train their students to become future workers, instead of requiring that students choose among disciplines that promote one track over the other.

Working together is only the first step, however, whose success hinges on whether it ultimately leads to the integration of these supposedly separate skillsets. Frequently, when qualitative and quantitative research teams work together, they work mostly independently – qualitative researchers on the qualitative aspect of the project and quantitative researchers on the quantitative aspects of the project – thus reinforcing the supposed distinction between them. Instead, such collaboration should involve qualitative researchers developing quantitative research skills by practicing such methods and quantitative researchers similarly developing qualitative skills.

Conclusion

Anyone can develop mathematics and data science skills if they practice at it. The same goes with the interpersonal skills necessary for ethnographic and other qualitative research. Depicting them as separate specialties – even if they come together to do each of their specialized parts in a single research projects – functions stifles their integration as a singular set of tools for an individual and reinforces the false myths we have been teaching ourselves that data science is for math, programming, or engineering people and that ethnography is for “people people.” This separation stifles holistic and authentic social research, which inevitably involves qualitative and quantitative approaches.

Photo credit #1: Andrea Piacquadio at https://www.pexels.com/photo/woman-holding-books-3768126/

Photo credit #2: Antoine Dautry at https://unsplash.com/photos/_zsL306fDck

Photo credit #3: Mike Lawrence at https://www.flickr.com/photos/157270154@N05/28172146158/ and http://www.creditdebitpro.com/

Photo credit #4: Ryan Jacobson at https://unsplash.com/photos/rOYhgmDIOg8

When Is Machine Learning Useful?

In a past blog post, I defined and described what machine learning is. I briefly highlighted four instances where machine learning algorithms are useful. This is what I wrote:

Autonomy: To teach computers to do a task without the direct aid/intervention of humans (e.g. autonomous vehicles)
Fluctuation: Help machines adjust when the requirements and data change over time
Intuitive Processing: Conduct or assist in tasks humans do but are unable to explain how computationally/algorithmically (e.g. image recognition)
Big Data: Breaking down data that is too large to handle otherwise

The goal of this blog post is to explain each in more detail.

Case #1: Autonomy

The first major use of machine learning centers around teaching computers to do a task or tasks without the direct aid or intervention of humans. Self-driving vehicles are a high-profile example of this: teaching a vehicle to drive (scanning the road and determining how to respond to what is around it) without the aid of or with minimal direct oversight from a human driver.

There are two types basic types of tasks that machine learning systems might perform autonomously:

Tasks humans frequently perform
Tasks humans are unable to perform.

Self-driving cars exemplify the former: humans drive cars, but self-driving cars would perform all or part of the driving process. Another example would be chatbots and virtual assistants like Alexa, Cortana, and Ok Google, which seek to converse with users independently. Such tasks might completely or partially complete the human activity: for example, some customer service chatbots are designed to determine the customer’s issue but then to transfer to a human when the issue has a certain complexity.

Humans have also sought to build autonomous machine learning algorithms to perform tasks that humans are unable to perform. Unlike self-driving cars, which conduct an activity many people do, people might also design a self-driving rover or submarine to drive and operate in a world that humans have so far been unable to inhabit, like other planets in our Solar System or the deep ocean. Search engines are another example: Google uses machine learning to help refine search results, which involves analyzing a massive amount of web data beyond what a human could normally do.

Case #2: Fluctuating Data

Business, Success, Curve, Hand, Draw, Present, Trend

Machine learning is also powerful tool for making sense of and incorporating fluctuating data. Unlike other types of models with fixed processes for how it predicts its values, machine learning models can learn from current patterns and adjust both if the patterns fluctuate overtime or if new use cases arise. This can be especially helpful when trying to forecast the future, allowing the model to decipher new trends if and when they emerge. For example, when predicting stock prices, machine learning algorithms can learn from new data and pick up changing trends to make the model better at predicting the future.

Of course, humans are notorious for changing overtime, so fluctuation is often helpful in models that seek to understand human preferences and behavior. For example, user recommendations – like Netflix’s, Hulu’s, or YouTube’s video recommendation systems – adjust based on the usage overtime, enabling them to respond to individual and/or collective changes in interests.

Case #3: Intuitive Processing

Flat, Recognition, Facial, Face, Woman, System

Data scientist frequently develop machine learning algorithms to teach computers how to do processes that humans do naturally but for which we are unable to fully explain how computationally. For example, popular applications of machine learning center around replicating some aspect of sensory perception: image recognition, sound or speech recognition, etc. These replicate the process of inputting sensory information (e.g. sight and sound) and processing, classifying, and otherwise making sense of that information. Language processing, like chatbots, form another example of this. In these contexts, machine learning algorithms learn a process that humans can do intuitively (see or hear stimuli and understand language) but are unable to fully explain how or why.

Many early forms of machine learning arose out of neurological models of how human brains work. The initial intention of neural nets, for instance, were to model our neurological decision-making process or processes. Now, much contemporary neurological scholarship since has disproven the accuracy of neural nets in representing how our brains and minds work.[i] But, whether they represent how human minds work at all, neural networks have provided a powerful technique for computers to use to process and classify information and make decisions. Likewise, many machine learning algorithms replicate some activity humans do naturally, even if the way they conduct that human task has little to do with how humans would.

Case #4: Big Data

Technology, 5G, Aerial, Abstract Background

Machine learning is a powerful tool when analyzing data that is too large to break down through conventional computational techniques. Recent computer technologies have increased the possibility of data collection, storage, and processing, a major driver in big data. Machine learning has arisen as a major, if not the major, means of analyzing this big data.

Machine learning algorithms can manage a dizzying array of variables and use them to find insightful patterns (like lasso regression for linear modeling). Many big data cases involve hundreds, thousands, and maybe even tens or hundreds of thousands of input variables, and many machine learning techniques (like best subsets selection, stepwise selection, and lasso regression) process the myriads of variables in big data and determine the best ones to use.

Recent developments computing provides the incredible processing power necessary to do such work (and debatably, machine learning is currently helping to push computational power and provide a demand for greater computational abilities). Hand-calculations and computers several decades ago were often unable to handle the calculations necessary to analyze large information: demonstrated, for example, by the fact that computer scientists invented the now popular neural networks many decades ago, but they did not gain popularity as a method until recent computer processing made them easy and worthwhile to run.

Tractors and other large-scale agricultural techniques coincided historically with the enlargement of farm property sizes, where the such machinery not only allowed farmers to manage large tracks of land but also incentivized larger farms economically. Likewise, machine learning algorithms provide the main technological means to analyze big data, both enabling and in turn incentivized by rise of big data in the professional world.

Conclusion

Here I have described four major uses of machine learning algorithms. Machine learning has become popular in many industries because of at least one of these functionalities, but of course, they are not the only potential current uses. In addition, as we develop machine learning tools, we are constantly inventing more. Given machine learning’s newness compared to many other century-old technologies, time will tell all the ways humans utilize it.

Photo credit #1: Mike MacKenzie at https://www.flickr.com/photos/mikemacmarketing/30212411048/

Photo credit #2: julientromeur at https://pixabay.com/illustrations/car-automobile-3d-self-driving-4343635/

Photo credit #3: geralt at https://pixabay.com/illustrations/business-success-curve-hand-draw-1989130/

Photo credit #4: geralt at https://pixabay.com/illustrations/flat-recognition-facial-face-woman-3252983/

Photo credit #5: mohamed_hassan at https://pixabay.com/illustrations/technology-5g-aerial-4816658/

[i] See Richard, Nagyfi. The differences between Artificial and Biological Neural Networks. 4 September 2018. https://towardsdatascience.com/the-differences-between-artificial-and-biological-neural-networks-a8b46db828b7; and Tcheang, Lili. Are Artificial Neural Networks like the Human Brain? And does it matter? 7 November 2018. https://medium.com/digital-catapult/are-artificial-neural-networks-like-the-human-brain-and-does-it-matter-3add0f029273.

The Stages of Learning a New Data or Programming Skill

Many people have admirably sought to learn data science, data analytics, a programming language, or some other data or programming skill in order to develop themselves professionally and/or seek a new career path. Excitingly, learning such skills has become significantly easier to do online. But this online learning can also foster unrealistic understandings of what learning one of these skills entails, since it can remove prospective learners from the physical community of experts who help introduce prospective learners to the expectations of that field.

The goal of this article is to help rectify that by explaining the basic steps typically needed to develop a mastery of a new data or programming skill. This will hopefully help inform high-level expectations for learning the skill would entail but also help you choose the right courses or set of courses to ensure you develop all three stages.

By data skill, I mean any data field like data science, data analytics, or data engineering, or any specific skill or practice within a data field that someone might seek to learn, and by programming skill I mean the skills necessary to learn and code in a programming language.

These are the three basic learning stages to master any of these topics:

Stage 1: Grasp the basic concepts of the topic

Stage 2: Complete a guided project

Stage 3: Complete a self-directed project

Stage 1: Grasping Basic Concepts

Grasping basic concepts entails learning the relevant vocabulary, syntax, and key approaches. Often programs teach each concept distinctly, one at a time. For example, when learning a new programming language, you might learn the major commands and syntax rules, and for data science, you might learn about each of the most prominent machine learning models one at a time.

This is different from applying the concepts widely, and at this stage, you may not be able to handle mixing all the concepts together in a complex problem yet (that’s Stage 2). Programs often teach the material at this point sequentially (even though that can be difficult for nonlinear learners).

For example, W3Schools provides grounded Stage 1 teaching for most programming languages and data science skills. They provide sequential exercises working through the basic syntax components of a new language, ever so slightly increasing in complexity along the way.

Now, only performing the first stage does not entail a full mastery of topic. After practicing each piece one at a time, you must also transition into Stage 2 where you start to learn how to combine them when completing a more complex problem.

Stage 2: Guided Project

Here you practice putting all the pieces together through a guided project(s). This guided project is a model for how each of the components fit together in an actual project. I liken these to building a Lego kit: following step-by-step instructions to build a cool model (instead of building your own object from scratch, which is Stage 3). They hold your hand through its completion to illustrate what putting all of the isolated skills and concepts together during a complicated project would entail.

Stage 3: Independent Project

In the third stage, you bring everything have learned together to complete a project on your own. Unlike in Stage 2, when they held your hand, you now have the freedom to struggle, which is necessary to learn. You are developing the skills involved in forming and carrying out a project on your own.

At the same time, you are learning what it looks like to implement those skills “in the wild” of a real-life project. In the previous stages, instructors often coddle their students: providing cleaned and perfectly ready-to-do example problems that you might find in a textbook, necessary to learn the basic concepts. Like a Lego kit, the components of the project have been groomed to make what you are producing. In Stage 3, you often start to experience the types of messiness common in real-world projects, when you have to find the pieces you need and/or figure out how to make do with the ones what you have.

For example, among data science learners, this stage is when students first learn to deal with the complexities of finding the right data for their problem; determining the best questions for a given dataset; and/or cleaning inconsistent data. Beforehand, most examples probably had already cleaned data that matched the specific task they were built for.

A certain amount of trial by fire is often needed to learn how to develop your own project. Your instructor(s) might take a little more of a backseat role during this process, looking over what you have done, answering any questions you might have, and nudging you when necessary. In my experience, exploring strategies yourself is the best way to learn Stage 3. Hopefully, at the end of it all, you will produce a nifty project that you can show prospective employers or whoever else you might wish to impress.

Conclusion

These are the three most common stages to develop initial mastery of a new data or programming skill or field. Now, they are the skill levels generally necessary to learn the new skill, but there are plenty of further levels of learning after you complete these. For example, grasping basic data science concepts, completing a guided project, and learning how to conduct your own self-directed data science project would be enough to make you a new inductee into the data science community, but you would still be a newbie data scientist. It is only the tip of the iceberg for what you can learn and how you would grow as a data scientist.

Now, despite calling them stages, not everyone learns them in sequential order, especially given the variety of extenuating circumstances and learning styles. For example, some might complete all three stages for a specific subset of skills in the field they are learning, and then go back to Stage 1 for another subset. Most education programs will include all three stages, more or less in order.

Some education programs, however, might completely lack or provide insufficient resources for one or two stages. Assessing whether a program adequately includes all three can be an effective way to determine how good they are at teaching and whether they are worth your money and/or time. When choosing to learn a new skill, I would recommend a program or combination of programs that includes all three. If a program you want to do or are currently completing lacks one or two of these stages, you can try to find another (hopefully free) way to complete that stage yourself online. For example, online courses and tutorials very frequently fail to provide Stage 3 (and in some cases, Stage 2), so after you complete one, I would recommend finding a project to work on.

Finally, when you are encountering a difficulty learning, it might be because you need to go backwards to a previous stage. For example, when many learners move to Stage 2, they must periodically swing back into Stage 1 to review a few core concepts when they see those concepts applied in a new way. Similarly, when completing a project in Stage 3, there is nothing wrong with reviewing Stage 2 or even Stage 1 materials.

Now, be careful because you can falsely attribute this. Learning anything can be frustrating. Sometimes the difficulties you are having are not rooted in the need to review or relearn past material, but you simply need to push through with the new material until you start to get it. In those cases, some students revert backwards into a set of material in which they can feel safe and confident instead of challenging themselves. Even in those cases, however, like rocking a car by going into reverse and then drive to get over a bump, quickly going backwards can help launch you forward over the hurdle. In such cases, what is most important is to know yourself – your learning tendencies and how you typically respond – and check in as much as you can with instructors and/or experts in the field who have been there and done that to help you determine the best ways to overcome whatever challenge you are having.

Photo credit #1: Jukan Tateisi at https://unsplash.com/photos/bJhT_8nbUA0

Photo credit #2: qimono at https://pixabay.com/illustrations/cog-wheels-gear-wheel-machine-2125178/

Photo credit #3: Bonneval Sebastien at https://unsplash.com/photos/lG-6_ox_UXE

Photo credit #4: Holly Mandarich at https://unsplash.com/photos/UVyOfX3v0Ls

Photo credit #5: George Bakos at https://unsplash.com/photos/VDAzcZyjun8

What Is Data Science and Machine Learning? A Short Guide for the Unsure

What is data science, and what is machine learning? This is a short overview for someone who has never heard of either.

What Is Data Science?

In the abstract, data science is an interdisciplinary field that seeks to use algorithms to organize, process, and analyze data. It represents a shift towards using computer programing, specifically machine learning algorithms, and other, related computational tools to process and analyze data.

By 2008, companies starting using the term data scientists to refer to a growing group of professionals utilizing advanced computing to organize and analyze large datasets,[i] and thus from the get-go, the practical needs of professional contexts have shaped the field. Data science combines strands from computer science, mathematics (particularly statistics and linear algebra), engineering, the social sciences, and several other fields to address specific real-world data problems.

On a practical level, I consider a data scientist someone who helps develop machine learning algorithms to analyze data. Machine learning algorithms form the central techniques/tools around what constitutes data science. For me personally, if it does not involve machine learning, it is not data science.

What Is Machine Learning?

Machine learning is a complex term: What to say that a machine “learns”? Overtime data scientists have provided many intricate definitions of machine learning, but its most basic, machine learning algorithms are algorithms that adapt/modify how their approach to a task based on new data/information overtime.

Herbert Simon provides a commonly used technical definition: “Learning denotes changes in the system that are adaptive in the sense that they enable the system to do the task or tasks drawn from the same population more efficiently and more effectively the next time.”[ii] As this definition implies, machine learning algorithms adapt by iteratively testing its performance against the same or similar data. Data scientists (and others) have developed several types of machine learning algorithms, including decision tree modeling, neural networks, logistic regression, collaborative filtering, support vector machines, cluster analysis, and reinforcement learning among others.

Data scientists generally split machine learning algorithms into two categories: supervised and unsupervised learning. Both involve training the algorithm to complete a given task but differ on how they test the algorithm’s performance. In supervised learning, the developer(s) provide a clear set of answers as a basis for whether the prediction is correct; while for unsupervised learning, whether the algorithm’s performance is much more open-ended. I liken the difference to be like the exams teachers gave us in school: some tests, like multiple choice exams, have clear, right and wrong answers or solutions, but other exams, like essays, are open-ended with qualitative means of determining goodness. Just like the nature of the curriculum determines the best type of exam, which type of learning to performs depends on the project context and nature of the data.

Here are four instances where machine learning algorithms are useful in these types of tasks:

Autonomy: To teach computers to do a task without the direct aid/intervention of humans (e.g. autonomous vehicles)
Fluctuation: Help machines adjust when the requirements or data change over time
Intuitive Processing: Conduct (or assist in) tasks humans do naturally but are unable to explain how computationally/algorithmically (e.g. image recognition)
Big Data: Breaking down data that is too large to handle otherwise

Machine learning algorithms have proven to be a very powerful set of tools. See this article for a more detailed discussion of when machine learning is useful.

[i] Berkeley School of Information. (2019). What is Data Science? Retrieved from https://datascience.berkeley.edu/about/what-is-data-science/.

[ii] Simon in Kononenko, I., & Kukar, M. (2007). Machine Learning and Data Mining. Elsevier: Philadelphia.

Photo credit #1: Frank V at https://unsplash.com/photos/zbLW0FG8XU8

Photo credit #2: Brett Jordan at https://unsplash.com/photos/HzOclMmYryc

Potential Objection #1: But I have more to say about the data than a single statement.

Potential Objection #2: I want the viewers to interpret the findings for themselves, not just impart my own ideas/conclusions.

Potential Objection #3: My main idea/point has multiple subpoints.

Share this:

Difference #1: Data Science Is More than a Field of Mathematics

Difference #2: Comprehensive vs Sample Data

Difference #3: Exploratory vs Confirming

Conclusion

Share this:

Table Rule #1: Don’t be afraid to provide as much or as little information as you need.

Table Rule #2: Keep columns consistent for easy scanning.

Conclusion

Share this:

#1 Programming Language: Python

#2 Programming Language: R

#3 Programming Language: Java

#4 Programming Languages: C/C++

Conclusion

Share this:

Share this:

The “Math Person”

Quantitative and Qualitative Specialties

Conclusion

Share this:

Case #1: Autonomy

Case #2: Fluctuating Data

Case #3: Intuitive Processing

Case #4: Big Data

Conclusion

Share this:

Stage 1: Grasping Basic Concepts

Stage 2: Guided Project

Stage 3: Independent Project

Conclusion

Share this:

What Is Data Science?

What Is Machine Learning?

Share this: