I interviewed Olga Shiyan as part of my Interview Series. In it, she discusses her anti-corruption work in Kazakhstan with Transparency International. In particular, she highlights various projects that have integrated anthropology with data science and statistics.
Olga Shiyan is the Executive Director of the Transparency International’s chapter in Kazakhstan. She specializes in advocacy, legislation and draft laws, and democratic training programs. For this, she has developed research methods that combine anthropology and data science and statistics. In 2019, the Kazakhstan Geographic Society awarder for a medal for anti-corruption work.
To learn more about Olga, feel free to check out the following:
For Part 8 in my Interview Series, I interviewed Scarleth Herrera, a digital anthropologist and founder of Orez Anthropological Research. In it, we discuss her experiences as starting her own digital anthropology research company, transitioning into artificial intelligence-related work, and experiences conducting anthropological research outside of academia.
Scarleth lives in South Florida. Her Orez Anthropological Research is a non-profit dedicated to the exploration and advancing the research of digital anthropology. She is also a Research Scholar at the Ronin Institute in New Jersey. Her current research focus is on the implications artificial intelligence may have on society in general but particularly low-income communities, but she is also passionate about issues facing immigrant communities in the United States.
Earlier this week, Matt Artz, Astrid Countee, and I ran a workshop at the American Anthropological Association’s 2020 annual conference entitled “Breaking into Tech.” We discussed strategies for anthropologists interested in working in the tech world.
Here is the presentation for anyone who might find it useful but could not attend:
I suspect everyone has seen a bad graph, a mess of bars, lines, pie slices, or what have you that you dreaded having to look at. Maybe you have even made one, which you look at today and wonder what on earth you were thinking.
These graphs violate the most basic graph-making rule in data visualization:
A graph is like a sentence, expressing one idea.
This rule applies to all uses of graphs, whether you are a data scientist, data analyst, statistician, or just making graphs for your friends for fun.
In grade school, your grammar teachers likely explained that a sentence, at its most basic, expresses on thought or idea. Graphs are visual sentences: they should state one and only one thought or idea about the data.
When you look at a graph, you should be able to say, in one sentence, what the graph is saying: such as “Group A is greater than Group B,” or “Y at first improved but is now declining.” If you cannot, then you have yourself a run-on graph.
For example, the above graph is trying to say too many statements: trying to depict the immigration patterns of twenty-two different countries over the course of nearly a century. There are likely useful statements in this data, but the representation as one graph prevents a viewer/reader from being able to easily decipher them.
Likewise, this graph shows way too many lens sizes to meaningfully express a single, coherent idea, leaving the reader/viewer struggling to determine which fields to focus on.
Potential Objection #1: But I have more to say about the data than a single statement.
Great! Then provide more than one graph. Say everything you need to say about the data; just use one graph for each of your statements.
Don’t fall into the One-Graph-to-Rule-Them-All Fallacy: trying to use one graph to express all your statements about the data that ends up a visual mess of incomprehensibility. Create multiple easy-to-read graphs where each graph demonstrates one of your points at a time. Condensing everything into one graph just prevents your viewers from determining what you have to say at all.
Potential
Objection #2: I want the viewers to interpret the findings for themselves, not
just impart my own ideas/conclusions.
Fair point. When presenting/communicating data, there is a time for showing your own insights and a time to open-endedly display the information for your viewers/readers to interpret for themselves. Graphs are tools for the former, and for the latter, use tables. Tables, among other potential uses, convey a wide scope of information for the reader/viewer to interpret on their own.
Remember that first example above about U.S. immigration from various parts of Europe? A table (see below) would convey that information much more easily and allow readers to track whatever places, patterns, or questions they would to learn about. Are you in a situation where you would like to report a large amount of information that your readers can use for their own purposes? Then tables are a much better starting point than graphs.
Some situations require that I lean towards sharing my insights/analysis and others towards encouraging my readers/viewers to form their own conclusions, but since most situations require a combination of the two, I generally combine graphs and tables. I try, when I can, to put smaller tables in the document or slides themselves and, when I cannot, include full tables in an Appendix.
Potential Objection #3: My main idea/point has multiple subpoints.
Many sentences have multiple subpoints needed to express the single idea as well, which does not prevent the sentence structure from meaningfully capturing those ideas. The fancy grammar word for such a subpoint is a claus. Even though some sentences are simple and straightforward with only one subject and predicate, many (like this very sentence) require multiple sets of subjects and predicates to express its thought.
Likewise, some graphical ideas require multiple subordinate or compounded subpoints, and there are types of graphs that allow this. Consider Joint Plots, like the one below. To present the relationships and combined distribution between the two variables adequately, they also display each variable’s individual distributions above and to the right. That way, the viewer can see how both distributions might be influencing the combined distribution. Thus, it displays each variable’s distribution on the side like a subordinate clause.
These are advanced graphs to make, since like with multi-part sentences, one must present the subpoints carefully to make clear what the main point is. Multi-part sentences, likewise, require carefulness in how to organize multiple clauses cohesively. I intend to write a post later describing how to develop these multi-part graphs in more detail.
The general rule still applies for these more complicated graphs:
Can you summarize what the graph is saying in one coherent sentence?
If you cannot, do not use/show that graph. Our brains are very good at intuiting whether a sentence carries one thought, so use this to determine whether your graph is effective.
In a previous article, I have discussed the value of integrating data science and ethnography. On LinkedIn, people commented that they were interested and wanted to hear more detail on potential ways to do this. I replied, “I have found explaining how to conduct studies that integrate the two practically is easier to demonstrate through example than abstractly since the details of how to do it vary based on the specific needs of each project.”
In this article, I intend to do exactly that: analyze four innovative projects that in some way integrated data science and ethnography. I hope these will spur your creative juices to help think through how to creatively combine them for whatever project you are working on.
Synopsis:
Project:
How It Integrated Data Science and Ethnography:
Link to Learn More:
No Show Model
Used ethnography to design machine learning software
Used ethnography to understand how users make sense of and behave towards a machine learning system they encounter and how this, in turn, shapes the development of the machine learning algorithm(s)
A medical clinic at a hospital system in New York City asked me to use machine learning to build a show rate predictor in order to inform an improve its scheduling practices. During the initial construction phase, I used ethnography to both understand in more depth understand the scheduling problem the clinic faced and determine an appropriate interface design.
Through an ethnographic inquiry, I discovered the most important question(s) schedulers ask when scheduling their appointments. This was, “Of the people scheduled for a given doctor on a particular day, how many of them are likely to actually show up?” I then built a machine learning model to answer this exact question. My ethnographic inquiry provided me the design requirements for the data science project.
In addition, I used my ethnographic inquiries to design the interface. I observed how schedulers interacted with their current scheduling software, which gave me a sense for what kind of visualizations would work or not work for my app.
This project exemplifies how ethnography can be helpful both in the development stage of a machine learning project to determine machine learning algorithm(s) needs and on the frontend when communicating the algorithm(s) to and assessing its successfulness with its users.
As both an ethnographer and a data scientist, I was able to translate my ethnographic insights seamlessly into machine learning modeling and API specifications and also conducted follow-up ethnographic inquiries to ensure that what I was building would meet their needs.
Project 2: Cybersensitivity Study
I conducted this project with Indicia Consulting. Its goal was to explore potential connections between individuals’ energy consumption and their relationship with new technology. This is an example of using ethnography to explore and determine potential social and cultural patterns in-depth with a few people and then using data science to analyze those patterns across a large population.
We started the project by observing and interviewing about thirty participants, but as the study progressed, we needed to develop a scalable method to analyze the patterns across whole communities, counties, and even states.
Ethnography is a great tool for exploring a phenomenon in-depth and for developing initial patterns, but it is resource-intensive and thus difficult to conduct on a large group of people. It is not practical for saying analyzing thousands of people. Data science, on the other hand, can easily test the validity across an entire population of patterns noticed in smaller ethnographic studies, yet because it often lacks the granularity of ethnography, would often miss intricate patterns.
Ethnography is also great on the back end for determining whether the implemented machine learning models and their resulting insights make sense on the ground. This forms a type of iterative feedback loop, where data science scales up ethnographic insights and ethnography contextualizes data science models.
Thus, ethnography and data science cover each other’s weaknesses well, forming a great methodological duo for projects centered around trying to understand customers, users, colleagues, or other users in-depth.
Project 3: Facebook Newsfeed Folk Theories
In their study, Motahhare Eslami and her team of researchers conducted an ethnographic inquiry into how various Facebook users conceived of how the Facebook Newsfeed selects which posts/stories rise to the top of their feeds. They analyze several different “folk theories” or working theories by everyday people for the criteria this machine learning system uses to select top stories.
How users think the overall system works influences how they respond to the newsfeed. Users who believe, for example, that the algorithm will prioritize the posts of friends for whom they have liked in the past will often intentionally like the posts of their closest friends and family so that they can see more of their posts.
Users’ perspectives on how the Newsfeed algorithm works influences how they respond to it, which, in turn, affects the very data the algorithm learns from and thus how the algorithm develops. This creates a cyclic feedback loop that influences the development of the machine learning algorithmic systems over time.
Their research exemplifies the importance of understanding how people think about, respond to, and more broadly relate with machine learning-based software systems. Ethnographies into people’s interactions with such systems is a crucial way to develop this understanding.
In a way, many machine learning algorithms are very social in nature: they – or at least the overall software system in which they exist – often succeed or fail based on how humans interact with them. In such cases, no matter how technically robust a machine learning algorithm is, if potential users cannot positively and productively relate to it, then it will fail.
Ethnographies into the “social life” of machine learning software systems (by which I mean how they become a part of – or in some cases fail to become a part of – individuals’ lives) helps understand how the algorithm is developing or learning and determine whether they are successful in what we intended them to do. Such ethnographies require not only in-depth expertise in ethnographic methodology but also an in-depth understanding how machine learning algorithms work to in turn understand how social behavior might be influencing their internal development.
Project 4: Thing Ethnography
Elise Giaccardi and her research team have been pioneering the utilization of data science and machine learning to understand and incorporate the perspective of things into ethnographies. With the development of the internet of things (IOT), she suggests that the data from object sensors could provide fresh insights in ethnographies of how humans relate to their environment by helping to describe how these objects relate to each other. She calls this thing ethnography.
This experimental approach exemplifies one way to use machine learning algorithms within ethnographies as social processes/interactions in of themselves. This could be an innovative way to analyze the social role of these IOT objects in daily life within ethnographic studies. If Eslami’s work exemplifies a way to graft ethnographic analysis into the design cycle of machine learning algorithms, Giaccardi’s research illustrates one way to incorporate data science and machine learning analysis into ethnographies.
Conclusion
Here are four examples of innovative projects that involve integrating data science and ethnography to meet their respective goals. I do not intend these to be the complete or exhaustive account of how to integrate these methodologies but as food for thought to spur further creative thinking into how to connect them.
For those who, when they hear the idea of integrating data science and ethnography, ask the reasonable question, “Interesting but what would that look like practically?”, here are four examples of how it could look. Hopefully, they are helpful in developing your own ideas for how to combine them in whatever project you are working on, even if its details are completely different.
Data science’s popularity has grown in the last few years, and many have confused it with its older, more familiar relative: statistics. As someone who has worked both as a data scientist and as a statistician, I frequently encounter such confusion. This post seeks to clarify some of the key differences between them.
Before I get into their differences, though, let’s define them. Statistics as a discipline refers to the mathematical processes of collecting, organizing, analyzing, and communicating data. Within statistics, I generally define “traditional” statistics as the the statistical processes taught in introductory statistics courses like basic descriptive statistics, hypothesis testing, confidence intervals, and so on: generally what people outside of statistics, especially in the business world, think of when they hear the word “statistics.”
Data science in its most broad sense is the multi-disciplinary science of organizing, processing, and analyzing computational data to solve problems. Although they are similar, data science differs from both statistics and “traditional” statistics:
Difference
Statistics
Data Science
#1
Field of Mathematics
Interdisciplinary
#2
Sampled Data
Comprehensive Data
#3
Confirming Hypothesis
Exploratory Hypotheses
Difference
#1: Data Science Is More than a Field of Mathematics
Statistics is a field of mathematics; whereas, data science refers to more than just math. At its simplest, data science centers around the use of computational data to solve problems,[i] which means it includes the mathematics/statistics needed to break down the computational data but also the computer science and engineering thinking necessary to code those algorithms efficiently and effectively, and the business, policy, or other subject-specific “smarts” to develop strategic decision-making based on that analysis.
Thus, statistics forms a crucial component of data science, but data science includes more than just statistics. Statistics, as a field of mathematics, just includes the mathematical processes of analyzing and interpreting data; whereas, data science also includes the algorithmic problem-solving to do the analysis computationally and the art of utilizing that analysis to make decisions to meet the practical needs in the context. Statistics clearly forms a crucial part of the process of data science, but data science generally refers to the entire process of analyzing computational data. On a practical level, many data scientists do not come from a pure statistics background but from a computer science or engineering, leveraging their coding expertise to develop efficient algorithmic systems.
Difference
#2: Comprehensive vs Sample Data
In statistical studies, researchers are often unable to analyze the entire population, that is the whole group they are analyzing, so instead they create a smaller, more manageable sample of individuals that they hope represents the population as a whole. Data science projects, however, often involves analyzing big, summative data, encapsulating the entire population.
The tools of traditional statistics work well for scientific studies, where one must go out and collect data on the topic in question. Because this is generally very expensive and time-consuming, researchers can only collect data on a subset of the wider population most of the time.
Recent developments in computation, including the ability to gather, store, transfer, and process greater computational data, have expanded the type of quantitative research now possible, and data science has developed to address these new types of research. Instead of gathering a carefully chosen sample of the population based on a heavily scrutinized set of variables, many data science projects require finding meaningful insights from the myriads of data already collected about the entire population.
Difference
#3: Exploratory vs Confirming
Data scientists often seek to build models that do something with the data; whereas, statisticians through their analysis seek to learn something from the data. Data scientists thus often assess their machine learning models based on how effectively they perform a given task, like how well it optimizes a variable, determines the best course of action, correctly identifies features of an image, provides a good recommendation for the user, and so on. To do this, data scientists often compare the effectiveness or accuracy of the many models based on a chosen performance metric(s).
In traditional statistics, the questions often center around using data to understand the research topic based on the findings from a sample. Questions then center around what the sample can say about the wider population and how likely its results would represent or apply to that wider population.
In contrast, machine learning models generally do not seek to explain the research topic but to do something, which can lead to very different research strategy. Data scientists generally try to determine/produce the algorithm with the best performance (given whatever criteria they use to assess how a performance is “better”), testing many models in the process. Statisticians often employ a single model they think represents the context accurately and then draw conclusions based on it.
Thus, data science is often a form of exploratory analysis, experimenting with several models to determine the best one for a task, and statistics confirmatory analysis, seeking to confirm how reasonable it is to conclude a given hypothesis or hypotheses to be true for the wider population.
A lot of scientific research has been theory confirming: a scientist has a model or theory of the world; they design and conduct an experiment to assess this model; then use hypothesis testing to confirm or negate that model based on the results of the experiment. With changes in data availability and computing, the value of exploratory analysis, data mining, and using data to generate hypotheses has increased dramatically (Carmichael 126).
Data science as a discipline has been at the
forefront of utilizing increased computing abilities to conduct exploratory work.
Conclusion
A data scientist friend of mine once quipped to me that data science simply is applied computational statistics (c.f. this). There is some truth in this: the mathematics of data science work falls within statistics, since it involves collecting, analyzing, and communicating data, and, with its emphasis and utilization of computational data, would definitely be a part of computational statistics. The mathematics of data science is also very clearly applied: geared towards solving practical problems/needs. Hence, data science and statistics interrelate.
They differ, however, both in their formal definitions and practical understandings. Modern computation and big data technologies have had a major influence on data science. Within statistics, computational statistics also seeks to leverage these resources, but what has become “traditional” statistics does not (yet) incorporate these. I suspect in the next few years or decades, developments in modern computing, data science, and computational statistics will reshape what people consider “traditional” or “standard” statistics to be a bit closer to the data science of today.
For more details, see the following useful resources:
In a
previous post about data visualization in data
science and statistics,
I discussed what I consider the single most important rule of graphing data. In
this post, I am following up to discuss the most important rules for making
data tables. I will focus on data tables in reporting/communicating findings to
others, as opposed to the many other uses of tables in data science say to
store, organize, and mine data.
To
summarize, graphs are like sentences, conveying one clear thought to the
viewer/reader. Tables, on the other hand, can function more like paragraphs,
conveying multiple sentences or thoughts to get an overall idea. Unlike graphs,
which often provide one thought, tables can be more exploratory, providing
information for the viewer/reader to analyze and draw his or her own
conclusions from.
Table Rule #1: Don’t be afraid to
provide as much or as little information as you need.
Paragraphs
can use multiple sentences to convey a series of thoughts/statements, and tables
are no different. One can convey multiple pieces of information that
viewers/readers can look through and analyze at his or her own leisure, using
the data to answer their own questions, so feel free to take up the space as
you need. Several page long tables are fair game and, in many cases, absolutely
necessary (although often end up in appendices for readers/viewers needing a
more in-depth take).
In my
previous data visualization post, I gave this bar chart as an
example of trying to say too many statements for a graph:
This is a
paragraphs-worth of information, and a table would represent it much better.[i]
In a table, the reader/viewer can explore the table values by country and year
themselves and answer whatever questions he or she might have. For example, if
someone wanted to analyze how a specific country changed overtime, he or she
could do so easily with a table, and/or if he or want to analyze compare the
immigration ratios between countries of a specific decade, that is possible as
well. In the graph above, each country’s subsegment starts in a different place
vertically for each decade column, making it hard to compare the sizes visually,
and since each decade has dozens of values, that the latter analysis is
visually difficult to decipher as well.
But, at the
same time, do not be afraid to convey a sentence- or graphs-worth of data into
a table, especially when such data is central for what you are saying. Sometimes
writers include one-sentence paragraphs when that single thought is crucial,
and likewise, a single statement table can have a similar effect. For example,
writing a table for a single variable does helps convey that that variable is
important:
Gender
Some Crucial Result
Male
36%
Female
84%
Now,
sometimes in these single statement instances, you might want to use a graph
instead of a table (or both), which I discuss in more detail in Rule #3.
Table Rule #2: Keep columns
consistent for easy scanning.
I have found
that when viewers/readers scan tables, they generally subconsciously assume
that all variables in a column are the same: same units and type of value.
Changing values of a column between rows can throw off your viewer/reader when
he or she looks at it. For example, consider this made-up study data:
Control Group (n = 100)
Experimental Group (n = 100)
Mean Age
45
44
Median Age
43
42
Male No. (%)
45 (45%)
36 (36%)
Female No. (%)
55 (55%)
64 (64%)
In this
table, the rows each mean different values and/or units. So, for example, going
down the control column, the first column is mean age measured in years. The
second column switches to median age, a different type of value than mean (although
the same unit of years). The final two rows convey the number and percentages
of males and females of each: both a different type of value and a different
unit (number and percent unlike years). This can be jarring for viewers/readers
who often expect columns to be of the same values and units and naturally
compare them as if they are similar types of values.
I would
recommend transposing it like this, so that the columns represent the similar
variables and the rows the two groups:
Mean Age
Median Age (IQR)
Male No. (%)
Female No. (%)
Control Group (n =
100)
45
43 (25, 65)
45 (45%)
55 (55%)
Experimental Group (n
= 100)
44
42 (27, 63)
36 (36%)
64 (64%)
Table Rule #3: Don’t be afraid to
also use a graph to convey magnitude, proportion, or scale
A table like
the gender table in Rule #1 conveys pertinent information numerically, but
numbers themselves do not visually show the difference between the values.
Gender
Some Crucial Result
Male
36%
Female
84%
Graphs excel at visually depicting the magnitude, proportion, and/or scale of data, so, if in this example, it is important to convey how much greater the “Some Crucial Result” is for females than males, then a basic bar graph allows the reader/viewer to see that the percent is more than double for the females than for the males.
Now, to convey this visual clarity, the graph loses the ability to precisely relate the exact numbers. For example, looking at only this graph, a reader/viewer might be unsure whether the males are at 36%, 37%, or 38%. People have developed many graphing strategies to deal with this (ranging from making the grid lines sharper, writing the exact numbers on top of, next to, or around the segment, among others), but combining the graph and table in instances where one both needs to convey the exact numbers and to convey a sense of their magnitude, proportion, or scale can also work well:
Finally, given
that tables can convey multiple statements, feel free to use several graphs to depict
the magnitude, proportion, or scale of one table. Do not try to overload a
multi-statement table into a single, incomprehensible graph. Break down each
statement you are trying to relate with that table and depict each separately
in a single graph.
Conclusion
If graphs are sentences, then tables can function more like paragraphs, conveying a large amount of information that make more than one thought or statement. This gives space for your reader/viewer to explore the data and interpret it on their own to answer whatever questions they have.
Newcomers to data science or artificial intelligence frequently ask me the best programming language to learn to build machine learning algorithms. Thus, I wrote this article as a reference for anyone who wants to know the answer to that question. These are what I consider the three most important languages, ranked in terms of usefulness based on both overall popularity within the data science community and my own personal experiences:
Best Programming Languages for Machine Learning:
#1 Choice: Python
#2 Choice: R
#3 Choice: Java
#4 Choice: C/C++
#1 Programming Language: Python
Python is the most popular language to use for machine learning and for three good reasons.
First, it’s package-based style allows you to utilize efficient machine learning and statistical packages that others have made, preventing you from having to constantly reinvent the wheel for common problems. Many if not most of the best packages (like NumPy, pandas, scikit learn, etc.) are in Python. This almost allows you to “cheat” when programing machine learning algorithms.
Second, Python is a powerful and flexible all-purpose language, so if you are building a machine learning algorithm to do something, then you can easily build the code for the other overall product or system in which you will use the algorithm without having to switch languages or softwares. It supports object-oriented, functional, and procedure-oriented programming styles, giving the programmer flexibility in how to code, allowing you to use whatever style or combination of various styles you like best or fits the specific context.
Third, unlike a language like Java or C++, Python does not require elaborate setup to program a single line of code. Even though you can easily build the coding infrastructure if you need to, if you only need to run a simple command or test, you can start immediately.
When I program in Python, I personally love using Jupyter Notebook, since its interface allows me to both code and to easily show my code and findings as a report or document. Another data scientist can simultaneously read and analyze my code and its output at the same time. I personally wish more data scientists published their papers and reports in Jupyter Notebook or other notebooks like it because of this.
If you have time to learn a single programming language for machine learning, I would strongly recommend it be Python. The next three languages, R, Java, and C++, do not match its ease and popularity within data science.
#2 Programming Language: R
R is a popular language for statisticians, a programming language that is specifically tailored for advanced statistical analysis. It includes many well-developed packages for machine learning but is not as popular with data scientists as Python. For example, in Towards Data Science’s survey, 57% of data scientists reported using Python, with 33% prioritizing it, and only 31% reported using R, with 17% prioritizing it. This seems to show that R is a complementary, not primary language for data science and machine learning. Most R packages have their equivalent in Python (and to some extent the other way around). Unlike Python, which is an all-purpose language, able to do other wonders other than analyzing data and developing machine learning algorithms, R is specifically tailored to statistics and data analysis, not able to do much beyond that. Saying this, though, R programmers are increasingly developing more and more packages for it, allowing it to do more and more.
#3 Programming Language: Java
Java was once the most popular language around, but Python has dethroned it in the last few years. As an avid Java programmer who programs in Java for fun, it breaks my heart to put it so far down the list, but Python is clearly a better language for data science and machine learning. If you are working in an organization or other context that still uses Java for part or all of its software infrastructure, then you may be stuck using it, but most recent developments, particularly in machine learning, have occurred in Python and in R (and a few other languages). Thus, if you use Java, you’ll frequently find yourself having to unnecessarily reinventing the wheel.
Plus, one major con of Java is that conducting quick, on-the-go analysis is not possible, since one must write a whole coding system before one can do a single line of code. Java can be popular in certain contexts, where the surrounding applications/software that utilize the machine learning algorithms are in Java, common in finance, front-end development, and companies that have been using Java-based software.
#4 Programming Languages: C/C++
The same Towards Data Science survey I mentioned above lists C/C++ as the second most popular data science and machine learning language after Python. Java follows them closely, yet I included Java and not C/C++ as third because I personally find Java to be a better overall language than C or C++. In C or C++, you may frequently find yourself reinventing the wheel – having to develop machine learning algorithms that others have already built in Python – but in some backend systems that have been built C or C++ like in engineering and electronics, you do not have much of an option. C++ has a similar problem with Java as well: lacking the ability to do quick on-the-go coding without having to build a whole infrastructure.
Conclusion
For a beginner to the data science scene, learning a single programming is the most helpful way to enter the field. Use learning a programming language to assess whether data science is for you: if you struggle and do not like programming, then developing machine learning algorithms for a living is probably not a good fit for you.
Many groups are trying to develop softwares that enable machine learning without having to program: DataRobot, Auto-WEKA, RapidMiner, BigML, and AutoML, among many others. The pros and cons and successes and failures of these softwares warrants a separate blog post to itself (one I intend to write eventually). As of now though, these have not replaced programming languages in either practical ability to develop complex machine learning algorithms and in demonstrating that you have the technical computational/programming skills for the field.
For a beginner to the data science scene, learning a single programming is the most helpful way to enter the field. Use learning a programming language to assess whether data science is for you: if you struggle and do not like programming, then data science where you would be developing machine learning algorithms for a living is probably not a good fit for you. Depending on where you work or type of field/tasks you are doing, you might end up using the language(s) or software(s) your team works with so that you can easily work jointly on projects with them. For some areas of work or tasks might prefer certain packages and languages. If you demonstrate that you can already know a complex programming language like Python (or Java or C++), even if that is not the preferred language of their team, then you will likely demonstrate to any hiring manager that you can learn their specific language or software.
In a past blog post,
I defined and described what machine learning is. I briefly highlighted four
instances where machine learning algorithms are useful. This is what I wrote:
Autonomy:
To teach computers to do a task without the direct aid/intervention of humans
(e.g. autonomous vehicles)
Fluctuation:
Help machines adjust when the requirements and data change over time
Intuitive
Processing: Conduct or assist in tasks humans do but
are unable to explain how computationally/algorithmically (e.g. image
recognition)
Big
Data: Breaking down data that is too large to handle
otherwise
The goal of this blog
post is to explain each in more detail.
Case
#1: Autonomy
The first major use of
machine learning centers around teaching computers to do a task or tasks
without the direct aid or intervention of humans. Self-driving vehicles are a
high-profile example of this: teaching a vehicle to drive (scanning the road and
determining how to respond to what is around it) without the aid of or with
minimal direct oversight from a human driver.
There are two types basic
types of tasks that machine learning systems might perform autonomously:
Tasks humans frequently perform
Tasks humans are unable to perform.
Self-driving cars
exemplify the former: humans drive cars, but self-driving cars would perform
all or part of the driving process. Another example would be chatbots and
virtual assistants like Alexa, Cortana, and Ok Google, which seek to converse
with users independently. Such tasks might completely or partially complete the
human activity: for example, some customer service chatbots are designed to
determine the customer’s issue but then to transfer to a human when the issue
has a certain complexity.
Humans have also sought
to build autonomous machine learning algorithms to perform tasks that humans
are unable to perform. Unlike self-driving cars, which conduct an activity many
people do, people might also design a self-driving rover or submarine to drive
and operate in a world that humans have so far been unable to inhabit, like other
planets in our Solar System or the deep ocean. Search engines are another
example: Google uses machine learning to help refine search results, which
involves analyzing a massive amount of web data beyond what a human could
normally do.
Case
#2: Fluctuating Data
Machine learning is also
powerful tool for making sense of and incorporating fluctuating data. Unlike
other types of models with fixed processes for how it predicts its values,
machine learning models can learn from current patterns and adjust both if the
patterns fluctuate overtime or if new use cases arise. This can be especially
helpful when trying to forecast the future, allowing the model to decipher new trends
if and when they emerge. For example, when predicting stock prices, machine
learning algorithms can learn from new data and pick up changing trends to make
the model better at predicting the future.
Of course, humans are
notorious for changing overtime, so fluctuation is often helpful in models that
seek to understand human preferences and behavior. For example, user
recommendations – like Netflix’s, Hulu’s, or YouTube’s video recommendation
systems – adjust based on the usage overtime, enabling them to respond to individual
and/or collective changes in interests.
Case
#3: Intuitive Processing
Data scientist frequently
develop machine learning algorithms to teach computers how to do processes that
humans do naturally but for which we are unable to fully explain how
computationally. For example, popular applications of machine learning center
around replicating some aspect of sensory perception: image recognition, sound
or speech recognition, etc. These replicate the process of inputting sensory
information (e.g. sight and sound) and processing, classifying, and otherwise
making sense of that information. Language processing, like chatbots, form
another example of this. In these contexts, machine learning algorithms learn a
process that humans can do intuitively (see or hear stimuli and understand
language) but are unable to fully explain how or why.
Many early forms of
machine learning arose out of neurological models of how human brains work. The
initial intention of neural nets, for instance, were to model our neurological
decision-making process or processes. Now, much contemporary neurological
scholarship since has disproven the accuracy of neural nets in representing how
our brains and minds work.[i] But, whether they
represent how human minds work at all, neural networks have provided a powerful
technique for computers to use to process and classify information and make
decisions. Likewise, many machine learning algorithms replicate some activity
humans do naturally, even if the way they conduct that human task has little to
do with how humans would.
Case
#4: Big Data
Machine learning is a
powerful tool when analyzing data that is too large to break down through
conventional computational techniques. Recent computer technologies have
increased the possibility of data collection, storage, and processing, a major
driver in big data. Machine learning has arisen as a major, if not the major,
means of analyzing this big data.
Machine learning
algorithms can manage a dizzying array of variables and use them to find insightful
patterns (like lasso regression for linear modeling). Many big data cases
involve hundreds, thousands, and maybe even tens or hundreds of thousands of
input variables, and many machine learning techniques (like best subsets
selection, stepwise selection, and lasso regression) process the myriads of
variables in big data and determine the best ones to use.
Recent developments
computing provides the incredible processing power necessary to do such work
(and debatably, machine learning is currently helping to push computational power
and provide a demand for greater computational abilities). Hand-calculations
and computers several decades ago were often unable to handle the calculations
necessary to analyze large information: demonstrated, for example, by the fact
that computer scientists invented the now popular neural networks many decades
ago, but they did not gain popularity as a method until recent computer
processing made them easy and worthwhile to run.
Tractors and other
large-scale agricultural techniques coincided historically with the enlargement
of farm property sizes, where the such machinery not only allowed farmers to manage
large tracks of land but also incentivized larger farms economically. Likewise,
machine learning algorithms provide the main technological means to analyze big
data, both enabling and in turn incentivized by rise of big data in the
professional world.
Conclusion
Here I have described
four major uses of machine learning algorithms. Machine learning has become
popular in many industries because of at least one of these functionalities, but
of course, they are not the only potential current uses. In addition, as we
develop machine learning tools, we are constantly inventing more. Given machine
learning’s newness compared to many other century-old technologies, time will
tell all the ways humans utilize it.