Data Science Archives - Page 4 of 6

Data Scientist, Entrepreneur, and Artist: Interview with Emi Harry Part 3 of 3 (Interview #6 in the Interview Series)

This is the third part of my interview with Emi Harry as part of my Interview Series. In it, she discusses her dual identify as a data scientist and entrepreneur, including how what it takes to be an entrepreneur, her experiences starting a data science company and recommendations she has for any data scientists considering starting their own.

Emi Harry is the co-founder of Naina Tech Inc., a New York-based tech startup that is poised to launch an adaptive learning platform for early childhood education in the U.S. and Nigeria’s underserved communities. As a highly skilled data scientist and social entrepreneur, Harry is also on the board of Alula Learning, an EdTech learning management systems provider, and Manna, a health and nutrition company, both in Nigeria. She has had a diverse professional experience, having worked in the food, oil and gas, entertainment, and fashion industries in Nigeria, as well as the entertainment, non-profit, and education industries in the United States. Currently, she balances her time between working in tech, creative writing, and fashion designing.

Her educational qualifications include B.S. in Mathematics, University of Lagos, Nigeria; Master’s in Social Entrepreneurship, Hult International Business School, San Francisco; M.Sc. in Data Analytics/Science, Fordham University, New York, and is on track to earn a M.Sc. in Computer Science from Pace University New York.

Links to the first two parts of the interview:

To learn more about Emi Harry, check these out:

EPIC Data Scientists + Ethnographers Group

I recently organized a professional group called EPIC Data Scientists + Ethnographers along with a few others who are both data scientists and ethnographers. Our goal is to form a virtual community to discuss ways to incorporate ethnography and data science, just like I strive to do on this website.

If you are interested in working with others on this or simply interested in learning more, feel free to join. Whether you are both a data scientist and ethnographer, only one of them, or neither, we would love to hear your perspective.

Thank you, EPIC, for helping to develop this and giving us a platform.

Photo credit: deepak pal at https://www.flickr.com/photos/158301585@N08/46085930481/

Breaking into Tech: A Career Workshop

code projected over woman — Photo by ThisIsEngineering on Pexels.com

Earlier this week, Matt Artz, Astrid Countee, and I ran a workshop at the American Anthropological Association’s 2020 annual conference entitled “Breaking into Tech.” We discussed strategies for anthropologists interested in working in the tech world.

Here is the presentation for anyone who might find it useful but could not attend:

Download [6.86 MB]

Thank you, Astrid and Matt, for your help in developing and running this workshop.

Data Visualization 101: The Most Important Rule for Developing a Graph

I suspect everyone has seen a bad graph, a mess of bars, lines, pie slices, or what have you that you dreaded having to look at. Maybe you have even made one, which you look at today and wonder what on earth you were thinking.

These graphs violate the most basic graph-making rule in data visualization:

A graph is like a sentence, expressing one idea.

This rule applies to all uses of graphs, whether you are a data scientist, data analyst, statistician, or just making graphs for your friends for fun.

In grade school, your grammar teachers likely explained that a sentence, at its most basic, expresses on thought or idea. Graphs are visual sentences: they should state one and only one thought or idea about the data.

When you look at a graph, you should be able to say, in one sentence, what the graph is saying: such as “Group A is greater than Group B,” or “Y at first improved but is now declining.” If you cannot, then you have yourself a run-on graph.

For example, the above graph is trying to say too many statements: trying to depict the immigration patterns of twenty-two different countries over the course of nearly a century. There are likely useful statements in this data, but the representation as one graph prevents a viewer/reader from being able to easily decipher them.

Likewise, this graph shows way too many lens sizes to meaningfully express a single, coherent idea, leaving the reader/viewer struggling to determine which fields to focus on.

Potential Objection #1: But I have more to say about the data than a single statement.

Great! Then provide more than one graph. Say everything you need to say about the data; just use one graph for each of your statements.

Don’t fall into the One-Graph-to-Rule-Them-All Fallacy: trying to use one graph to express all your statements about the data that ends up a visual mess of incomprehensibility. Create multiple easy-to-read graphs where each graph demonstrates one of your points at a time. Condensing everything into one graph just prevents your viewers from determining what you have to say at all.

Bar Chart, Chart, Statistics, Analytics, Data Analytics — One-Graph-to-Rule-Them-All Fallacy: Trying to use one graph to express all your thoughts about the data that ends up a visual mess of incomprehensibility

Statistics, Graph, Chart, Data, Information, Growth — Instead, use one graph for each of your points

Potential Objection #2: I want the viewers to interpret the findings for themselves, not just impart my own ideas/conclusions.

Fair point. When presenting/communicating data, there is a time for showing your own insights and a time to open-endedly display the information for your viewers/readers to interpret for themselves. Graphs are tools for the former, and for the latter, use tables. Tables, among other potential uses, convey a wide scope of information for the reader/viewer to interpret on their own.

Remember that first example above about U.S. immigration from various parts of Europe? A table (see below) would convey that information much more easily and allow readers to track whatever places, patterns, or questions they would to learn about. Are you in a situation where you would like to report a large amount of information that your readers can use for their own purposes? Then tables are a much better starting point than graphs.

Some situations require that I lean towards sharing my insights/analysis and others towards encouraging my readers/viewers to form their own conclusions, but since most situations require a combination of the two, I generally combine graphs and tables. I try, when I can, to put smaller tables in the document or slides themselves and, when I cannot, include full tables in an Appendix.

Potential Objection #3: My main idea/point has multiple subpoints.

Many sentences have multiple subpoints needed to express the single idea as well, which does not prevent the sentence structure from meaningfully capturing those ideas. The fancy grammar word for such a subpoint is a claus. Even though some sentences are simple and straightforward with only one subject and predicate, many (like this very sentence) require multiple sets of subjects and predicates to express its thought.

Likewise, some graphical ideas require multiple subordinate or compounded subpoints, and there are types of graphs that allow this. Consider Joint Plots, like the one below. To present the relationships and combined distribution between the two variables adequately, they also display each variable’s individual distributions above and to the right. That way, the viewer can see how both distributions might be influencing the combined distribution. Thus, it displays each variable’s distribution on the side like a subordinate clause.

The darker colors in this graph signify a higher density of data points, showing the combined joint distribution of the variables.

These are advanced graphs to make, since like with multi-part sentences, one must present the subpoints carefully to make clear what the main point is. Multi-part sentences, likewise, require carefulness in how to organize multiple clauses cohesively. I intend to write a post later describing how to develop these multi-part graphs in more detail.

The general rule still applies for these more complicated graphs:

Can you summarize what the graph is saying in one coherent sentence?

If you cannot, do not use/show that graph. Our brains are very good at intuiting whether a sentence carries one thought, so use this to determine whether your graph is effective.

Photo/Graph credit #1: kreatikar at https://pixabay.com/illustrations/statistics-graph-chart-data-3411473/

Photo/Graph credit #2: Linux Screenshots at https://www.flickr.com/photos/xmodulo/23635690633/

Photo/Graph credit #3: Andrew Guyton at https://www.flickr.com/photos/disavian/4435971394/

Photo/Graph credit #4: TymonOziemblewski at https://pixabay.com/illustrations/bar-chart-chart-statistics-1264756/

Photo/Graph credit #5 (the first graph again): kreatikar at https://pixabay.com/illustrations/statistics-graph-chart-data-3411473/

Photo/Graph credit #6: Michael Waskom provides a helpful tutorial that formed the inspiration behind the random graph I created.

Resources on Integrating Data Science and Ethnography

Here is a list of resources about integrating data science and ethnography. Even though it is an up and coming field without a consistent list of publications, several fascinating and insightful resources do exist.

If there are any resources about integrating data science and ethnography that you have found useful, feel free to share them as well.

General Overviews:

Curran, John. “Big Data or ‘Big Ethnographic Data’? Positioning Big Data within the Ethnographic Space.” EPIC (2013). (Found here: https://www.epicpeople.org/big-data-or-big-ethnographic-data-positioning-big-data-within-the-ethnographic-space/)
Patel, Neal. “For a Ruthless Criticism of Everything Existing: Rebellion Against the Quantitative-Qualitative Divide.” EPIC (2013): 43-60.
Nick Seaver. “Bastard Algebra.” Boellstorff, Tom and Bill Maurer. Data, Now Bigger and Better. Chicago: Prickly Paradigm Press, 2015. 27-46.
Slobin, Adrian and Todd Cherkasky. “Ethnography in the Age of Analytics.” EPIC (2010).
Nafus, Dawn and Tye Rattenbury. Data Science and Ethnography: What’s Our Common Ground, and Why Does It Matter? 7 3 2018. <https://www.epicpeople.org/data-science-and-ethnography/>.
Nick Seaver. “The nice thing about context is that everyone has it.” Media, Culture & Society (2015).

Books:

Nafus, Dawn and Hannah Knox. Ethnography for a Data-Saturated World. Manchester: Manchester Univeristy Press, 2018.
Boellstorff, Tom and Bill Maurer. Data, Now Bigger and Better! Chicago: Prickly Paradigm Press, 2015.
Mackenzie, Adrian. Machine Learners: Archaeology of a Data Practice. Cambridge: The MIT Press, 2017.

Examples and Case Studies:

“Autonomous Drive: Teaching Cars Human Behaviour” by Melissa Cefkin on the Youtube Channel DrivingTheNation: https://www.youtube.com/watch?v=6koKuDegHAM
Eslami, Motahhare, et al. “First I “like” it, then I hide it: Folk Theories of Social Feeds.” Curation and Algorithms (2016).
Giaccardi, Elisa, Chris Speed and Neil Rubens. “Things Making Things: An Ethnography of the Impossible.” (2014).

Elish, M. “The Stakes of Uncertainty: Developing and Integrating Machine Learning in Clinical Care.” EPIC (2018).
Madsen, Matte My, Anders Blok and Morten Axel Pedersen. “Transversal collaboration: an ethnography in/of computational social science.” Nafus, Dawn. Ethnography for a Data-saturated World. Manchester: Manchester Univeristy Press, 2018.
Thomas, Suzanne, Dawn Nafus and Jamie Sherman. “Algorithms as fetish: Faith and possibility in algorithmic work.” Big Data & Society (2018): 1-11.

Articles and Blog Posts:

“An Engineering Anthropologist: Why tech companies need to hire software developers with ethnographic skills” by Astrid Countee: http://ethnographymatters.net/blog/2016/06/22/an-engineering-anthropologist-why-tech-companies-need-to-hire-software-developers-with-ethnographic-skills/
“Cross-disciplinary Insights Teams: Integrating Data Scientists and User Researchers at Spotify” by Sara Belt and Peter Gilks: https://www.epicpeople.org/cross-disciplinary-insights-teams-integrating-data-scientists-and-user-researchers-at-spotify/
“Data is a stakeholder” by Schaun Wheeler: https://towardsdatascience.com/data-is-a-stakeholder-31bfdb650af0
“Why Big Data Needs Thick Data” by Tricia Wang: https://medium.com/ethnography-matters/why-big-data-needs-thick-data-b4b3e75e3d7

My Own Articles on This Website:

Podcasts and Lectures:

“Computational Anthropology: Quali-quantitative Analyses of Attention Economies during the Covid-19 Lockdown” by Morten Axel Pedersen: https://www.material.city/recordings/mortenaxelpedersen
“Human-Driven Machine Learning with Saleema Amershi”: https://datastori.es/115-human-driven-machine-learning-with-saleema-amershi/#t=29:00.204
“Welcome to Dataworld, by Alexander Taylor”: https://player.fm/series/camthropod/episode-13-welcome-to-dataworld-by-alex-taylor
“Machine Learning for Artists with Gene Kogan”: https://datastori.es/114-machine-learning-for-artists-with-gene-kogan/#t=34:28.738

Ethical Considerations:

“Caroline Sinders on Ethical Product Design for Machine Learning”: https://design.blog/2017/03/23/caroline-sinders-on-ethical-product-design-for-machine-learning/
“The Trouble with Bias” by Kate Crawford: https://www.youtube.com/watch?v=fMym_BKWQzk
“Justice for ‘Data Janitors’” by Lilly Irani: http://www.publicbooks.org/justice-for-data-janitors/
Elish, Madeleine. “Moral Crumple Zones: Cautionary Tales in Human-Robot Interaction.” Engaging Science, Technology, and Society (2019).
boyd, danah and Kate Crawford. “Critical Questions for Big Data: Provocations for a Cultural, Technological, and Scholarly Phenomenon.” Information, Communication, & Society (2012): 662-679.

Four Innovative Projects that Integrated Data Science and Ethnography

In a previous article, I have discussed the value of integrating data science and ethnography. On LinkedIn, people commented that they were interested and wanted to hear more detail on potential ways to do this. I replied, “I have found explaining how to conduct studies that integrate the two practically is easier to demonstrate through example than abstractly since the details of how to do it vary based on the specific needs of each project.”

In this article, I intend to do exactly that: analyze four innovative projects that in some way integrated data science and ethnography. I hope these will spur your creative juices to help think through how to creatively combine them for whatever project you are working on.

Synopsis:

Project:	How It Integrated Data Science and Ethnography:	Link to Learn More:
No Show Model	Used ethnography to design machine learning software	https://ethno-data.com/show-rate-predictor/
Cybersensitivity Study	Used machine learning to scale up the scope of an ethnographic inquiry to a larger population	https://ethno-data.com/masters-practicum-summary/
Facebook Newsfeed Folk Theories	Used ethnography to understand how users make sense of and behave towards a machine learning system they encounter and how this, in turn, shapes the development of the machine learning algorithm(s)	https://dl.acm.org/doi/10.1145/2858036.2858494
Thing Ethnography	Used machine learning to incorporate objects’ interactions into ethnographic research	https://dl.acm.org/doi/10.1145/2901790.2901905 and https://www.semanticscholar.org/paper/Things-Making-Things%3A-An-Ethnography-of-the-Giaccardi-Speed/2db5feac9cc743767fd23aeded3aa555ec8683a4?p2df

Project 1: No Show Model

A medical clinic at a hospital system in New York City asked me to use machine learning to build a show rate predictor in order to inform an improve its scheduling practices. During the initial construction phase, I used ethnography to both understand in more depth understand the scheduling problem the clinic faced and determine an appropriate interface design.

Through an ethnographic inquiry, I discovered the most important question(s) schedulers ask when scheduling their appointments. This was, “Of the people scheduled for a given doctor on a particular day, how many of them are likely to actually show up?” I then built a machine learning model to answer this exact question. My ethnographic inquiry provided me the design requirements for the data science project.

In addition, I used my ethnographic inquiries to design the interface. I observed how schedulers interacted with their current scheduling software, which gave me a sense for what kind of visualizations would work or not work for my app.

This project exemplifies how ethnography can be helpful both in the development stage of a machine learning project to determine machine learning algorithm(s) needs and on the frontend when communicating the algorithm(s) to and assessing its successfulness with its users.

As both an ethnographer and a data scientist, I was able to translate my ethnographic insights seamlessly into machine learning modeling and API specifications and also conducted follow-up ethnographic inquiries to ensure that what I was building would meet their needs.

Project 2: Cybersensitivity Study

I conducted this project with Indicia Consulting. Its goal was to explore potential connections between individuals’ energy consumption and their relationship with new technology. This is an example of using ethnography to explore and determine potential social and cultural patterns in-depth with a few people and then using data science to analyze those patterns across a large population.

We started the project by observing and interviewing about thirty participants, but as the study progressed, we needed to develop a scalable method to analyze the patterns across whole communities, counties, and even states.

Ethnography is a great tool for exploring a phenomenon in-depth and for developing initial patterns, but it is resource-intensive and thus difficult to conduct on a large group of people. It is not practical for saying analyzing thousands of people. Data science, on the other hand, can easily test the validity across an entire population of patterns noticed in smaller ethnographic studies, yet because it often lacks the granularity of ethnography, would often miss intricate patterns.

Ethnography is also great on the back end for determining whether the implemented machine learning models and their resulting insights make sense on the ground. This forms a type of iterative feedback loop, where data science scales up ethnographic insights and ethnography contextualizes data science models.

Thus, ethnography and data science cover each other’s weaknesses well, forming a great methodological duo for projects centered around trying to understand customers, users, colleagues, or other users in-depth.

Project 3: Facebook Newsfeed Folk Theories

In their study, Motahhare Eslami and her team of researchers conducted an ethnographic inquiry into how various Facebook users conceived of how the Facebook Newsfeed selects which posts/stories rise to the top of their feeds. They analyze several different “folk theories” or working theories by everyday people for the criteria this machine learning system uses to select top stories.

How users think the overall system works influences how they respond to the newsfeed. Users who believe, for example, that the algorithm will prioritize the posts of friends for whom they have liked in the past will often intentionally like the posts of their closest friends and family so that they can see more of their posts.

Users’ perspectives on how the Newsfeed algorithm works influences how they respond to it, which, in turn, affects the very data the algorithm learns from and thus how the algorithm develops. This creates a cyclic feedback loop that influences the development of the machine learning algorithmic systems over time.

Their research exemplifies the importance of understanding how people think about, respond to, and more broadly relate with machine learning-based software systems. Ethnographies into people’s interactions with such systems is a crucial way to develop this understanding.

In a way, many machine learning algorithms are very social in nature: they – or at least the overall software system in which they exist – often succeed or fail based on how humans interact with them. In such cases, no matter how technically robust a machine learning algorithm is, if potential users cannot positively and productively relate to it, then it will fail.

Ethnographies into the “social life” of machine learning software systems (by which I mean how they become a part of – or in some cases fail to become a part of – individuals’ lives) helps understand how the algorithm is developing or learning and determine whether they are successful in what we intended them to do. Such ethnographies require not only in-depth expertise in ethnographic methodology but also an in-depth understanding how machine learning algorithms work to in turn understand how social behavior might be influencing their internal development.

Project 4: Thing Ethnography

Elise Giaccardi and her research team have been pioneering the utilization of data science and machine learning to understand and incorporate the perspective of things into ethnographies. With the development of the internet of things (IOT), she suggests that the data from object sensors could provide fresh insights in ethnographies of how humans relate to their environment by helping to describe how these objects relate to each other. She calls this thing ethnography.

This experimental approach exemplifies one way to use machine learning algorithms within ethnographies as social processes/interactions in of themselves. This could be an innovative way to analyze the social role of these IOT objects in daily life within ethnographic studies. If Eslami’s work exemplifies a way to graft ethnographic analysis into the design cycle of machine learning algorithms, Giaccardi’s research illustrates one way to incorporate data science and machine learning analysis into ethnographies.

Conclusion

Here are four examples of innovative projects that involve integrating data science and ethnography to meet their respective goals. I do not intend these to be the complete or exhaustive account of how to integrate these methodologies but as food for thought to spur further creative thinking into how to connect them.

For those who, when they hear the idea of integrating data science and ethnography, ask the reasonable question, “Interesting but what would that look like practically?”, here are four examples of how it could look. Hopefully, they are helpful in developing your own ideas for how to combine them in whatever project you are working on, even if its details are completely different.

Photo credit #1: StartupStockPhotos at https://pixabay.com/photos/startup-meeting-brainstorming-594090/

Photo credit #2: DarkoStojanovicat at https://pixabay.com/photos/medical-appointment-doctor-563427/

Photo credit #3: NASA at https://unsplash.com/photos/Q1p7bh3SHj8

Photo credit #4: Kon Karampelas at https://unsplash.com/photos/HUBofEFQ6CA

Photo credit #5: Pixabay at https://www.pexels.com/photo/app-business-connection-device-221185/

Using Data Science and Ethnography to Build a Show Rate Predictor

I recently integrated ethnography and data science to develop a Show Rate Predictor for an (anonymous) hospital system. Many readers have asked for real-world examples of this integration, and this project demonstrates how ethnography and data science can join to build machine learning-based software that makes sense to users and meets their needs.

Part 1: Scoping out the Project

A particular clinic in the hospital system was experiencing a large number of appointment no-shows, which produced wasted time, frustration, and confusion for both its patients and employees. I was asked to use data science and machine learning to better understand and improve their scheduling.

I started the project by conducting ethnographic research into the clinic to learn more about how scheduling occurs normally, what effect it was having on the clinic, and what driving problems employees saw. In particular, I observed and interviewed scheduling assistants to understand their day-to-day work and their perspectives on no-shows.

One major lesson I learned through all this was that when scheduling an appointment, schedulers are constantly trying to determine how many people to schedule on a given doctor’s shift to ensure the right number of people show up. For example, say 12-14 patients is a good number of patients for Dr. Rodriguez’s (made up name) Wednesday morning shift. When deciding whether to schedule an appointment for the given patient with Dr. Rodriguez on an upcoming Wednesday, the scheduling assistants try to determine, given the appointments currently scheduled then, whether they can expect 12-14 patients to show up. This was often an inexact science. They would often have to schedule 20-25 patients on a particular doctor’s shift to ensure their ideal window of 12-14 patients would actually come that day. This could create the potential for chaos, however, where too many patients arriving on some days and too few on others.

This question – how many appointments can we expect or predict to occur on a given doctor’s shift – became my driving question to answer with machine learning. After checking in with the various stakeholders at the clinic to make sure this was in fact an important and useful question to answer with machine learning, I started building.

Part 2: Building the Model

Now that I had a driving, answerable question, I decided to break it down into two sequential machine learning models:

The first model learned to predict the probability that a given appointment would occur, learning from the history of occurring or no-show appointments.
The second model, using the appointment probabilities from the first model, estimated how many appointments might occur for every doctors’ shift.

The first model combined three streams of data to assess the no-show probability: appointment data (such as how long ago it was scheduled, type of appointment, etc.); patient information, especially past appointment history; and doctor information. I performed extensive feature selection to determine the best subset of variables to use and tested several types of machine learning models before settling on gradient boosting.

The second model used the probabilities in the first model as input data to predict how many patients to expect to come on each doctors’ shift. I settled on a neural network for the model.

Part 3: Building an App

Next, I worked with the software engineers on my team to develop an app to employ these models in real time and communicate the information to schedulers as they scheduled appointments. My ethnographic research was invaluable for developing how to construct the app.

On the back end, the app calculated the probability that all future appointments would occur, updating with new calculations for newly scheduled or edited appointments. Once a week, it would incorporate that week’s new appointment data and shift attendance to each model’s training data and update those models accordingly.

Through my ethnographic research, I observed how schedulers approached scheduling appointments, including what software they used in the process and how they used each. I used that to determine the best ways to communicate that information, periodically showing my ideas to the schedulers to make sure my strategy would be helpful.

I constructed an interface to communicate the information that would complement the current software they used. In addition to displaying the number of patients expected to arrive, if the machine learning algorithm was predicting that a particular shift was underbooked, it would mark the shift in green on the calendar interface; yellow if the shift was projected to have the ideal number of patients, and red if already expected have too many patients. The color-coding allowed easy visualization of the information in the moment: when trying to find an appointment time for a patient, they could easily look for the green shifts or yellow if they had to, but steer clear of the red. When zooming in on a specific shift, each appointment would be color-coded (likely, unlikely, and in the middle) as well based on the probability that it would occur.

Conclusion

This is one example of a projects that integrates data science and ethnography to build a machine learning app. I used ethnography to construct the app’s parameters and framework. It tethered the app in the needs of the schedulers, ensuring that the machine learning modeling I developed was useful to those who would use it. Frequent check-ins before each step in their development also helped confirm that my proposed concept would in fact help meet their needs.

My data science and machine learning expertise helped guide me in the ethnographic process as well. Being an expert in how machine learning worked and what sorts of questions it could answer allowed me to easily synthesize the insights from my ethnographic inquiries into buildable machine learning models. I understood what machine learning was capable (and not capable) of doing, and I could intuitively develop strategic ways to employ machine learning to address issues they were having.

Hence, my dual role as an ethnography and data scientist benefitted the project greatly. My listening skills from ethnography enabled me to uncover the underlying questions/issues schedulers faced, and my data science expertise gave me the technical skills to develop a viable machine learning solution. Without listening patiently through extensive ethnography, I would not have understood the problem sufficiently, but without my data science expertise, I would have been unable to decipher which questions(s) or issue(s) machine learning could realistically address and how.

This exemplifies why a joint expertise in data science and ethnography is invaluable in developing machine learning software. Two different individuals or teams could complete each separately – an ethnographer(s) analyze the users’ needs and a data scientist(s) then determine whether machine learning modeling could help. But this seems unnecessarily disjointed, potentially producing misunderstanding, confusion, and chaos. By adding an additional layer of people, it can easily lead to either the ethnographer(s) uncovering needs way too broad or complex for a machine learning-based solution to help or the data scientist(s) trying to impose their machine learning “solution” to a problem the users do not have.

Developing expertise in both makes it much easier to simultaneously understand the problems or questions in a particular context and build a doable data science solution.

Photo credit #1: DarkoStojanovic at https://pixabay.com/photos/medical-appointment-doctor-563427/

Photo credit #2: geralt at https://pixabay.com/illustrations/time-doctor-doctor-s-appointment-481445/

Photo credit #3: Pixabay at https://www.pexels.com/photo/light-road-red-yellow-46287/

How Do I Become a Data Scientist? The Four Basic Strategies to Learn Data Science

Aspiring data scientists will frequently ask me for recommendations about the best way to learn data science. Should they try a bootcamp or enroll in an online data science course, or any of the myriad options out there?

In the last several years, we have seen the development of many different types of educational programs that teach data science, ranging from free online tutorials to bootcamps to advanced degrees at universities, and the pandemic has seemed to have fostered the establishment of even more programs to meet the increased demand for remote learning. Although probably overall a good thing, having more options increases the complexity of deciding which one to do and the potential noise of programs upselling their services.

This article is a high-level survey of the four basic types of data science education programs to help you think about which might work best for you. Without already knowing data science, it can be difficult to assess how effective a program is at teaching it. Hopefully, this article will help break that chicken-and-the-egg conundrum.

These are the four basic ways to learn data science:

Do-it-yourself learning
Online courses
Bootcamps
Master’s degree or other university degree in data science (or related field)

I will discuss them in order from the cheapest to most expensive. I also included two hybrid strategies that combine a few of these that are worth considering as well. This table provides a quick, high-level synopsis of each one:

Option 1: Do-It-Yourself Online

There are tons of free, online data science resources that can either teach data science from scratch or explain just about any data science content you could possibly want to know. These range from tutorials for those who learn by doing like W3Schools, videos on YouTube and other sites for audio learners like Andrew Ng’s YouTube series, articles for visual learners who enjoy reading like Towards Data Science. You could scour the internet and teach yourself. It has the pros of being free and perfectly flexible to tailor to your schedule.

But as a former teacher, I have found independent learning is not for everyone. You must be entirely self-motivated and self-structured to teach yourself like this. So, know yourself: are you the type of person who could learn well completely independently like this?

Education programs tend to provide these resources that you might lack if you went it alone:

1) Curriculum Oversight: Data science experts in any education program generally establish some kind of data science curriculum for you that includes the necessary topics in the field. Many people who are new to data science do not know yet what data science concepts and skills are most important to learn about. This can create a chicken and egg problem for self-learners who must learn the field at least a little to know the most important items to learn in the first place. Data science programs help circumvent this by giving you an initial curriculum to started with.

2) Guidance of the Norms of the Field: In addition to the teaching the material, education programs implicitly introduce students to data science norms and ways of thinking. Even though there are times to deviate from the established custom, they are important when first working on teams with fellow data scientists. Sometimes self-learners learn the literal material but do not gather the implicit perspectives that enables their incorporation into the data science community.

3) External Social Accountability: Education programs provide a form of social accountability that subtly encourages you to get the work done. Self-learners must rely almost exclusively on their own self-motivation and self-accountability, which, in my experience, works for some people but not others.

4) Social Resources: Education programs (especially ones that meet either in-person or virtually) provide various people – teachers, students, and in some cases mentees/underlings – with whom one can talk through problems with, help you discover your weaknesses and shortcomings, and determine ways to address them. Minute programming details that are easily overlooked by beginners, but experts might easily spot can cause your entire program to fail. To learn independently, you will have to either solve all of these yourself or find data science friends or family who are willing to help you.

5) Certification of Skills: Education programs bestow degrees, grades, and other certifications as external proof that you do, in fact, possess the requisite skills in a data science role. Learning on your own, you must prove that you have these skills to employers by yourself. Developing a portfolio of thought-provoking projects, you have done is the best way to demonstrate this.

6) Guidance in Forming Projects: An impressive project works wonders for showcasing your data science skills. In my experience, beginners to data science often do not yet possess the skills to create, complete, and market a thought-provoking yet doable project, and one of the most important roles data science educators can have is helping students think through how to develop one. You must do this yourself when learning alone.

One can overcome each of these deficits. I have found that for people who learn well independently, its cost and flexibility advantages easily outweigh these cons. Thus, the crucial question is, Would this form of independent learning work for you? In my experience, it works for a comparatively small percent of people, but for those it works for, it is a great option.

If you do decide to teach yourself, I would recommend considering the following:

1) Be conscientious about your learning style when crafting your material. For example, if you are a visual learner, then reading online material resources would be best, but if you are more of an auditory learner, then I would recommend watching video tutorials/lectures on say YouTube.

2) If you have data science friends willing to help you, they can be a great asset, particularly in determining what data science materials to learn, troubleshooting any coding issues you might have, and/or developing a good project(s).

3) People in general learn data science best by doing data science. Avoid the common trap of only reading about data science without getting your hands dirty and experimenting yourself (preferably with unclean, annoying, real-world data, not already trimmed, “textbook perfect” data). Using pristine data to first learn the concepts is fine, but make sure you graduate yourself to practicing with real-life dirty data.

Option 2: Online Course

A variety of online courses exist. Most of them are relatively cheap (usually around $20-$50 a month or $100-$200 per course). For example, at the time of writing this, Udemy has an introductory data science course for a flat rate of $94.99, and Coursera a course for $19.99 a month (both with prices varying based on discounts and other special deals). Online courses are generally the cheapest of the courses you can enroll in, and because of the length of most, you will probably have to take several levels of courses (introductory to advanced) to learn the field.

Another advantage is that they are flexible: You can learn at your own pace, based on the needs of your schedule. This is really valuable for people who also working a job and studying on the side, with family commitments, and/or other obligations complicating their schedules. Keep in mind, though, that because you often pay per month, how many months you take often dictates the final cost. At the end of the day, spending an extra $100 or so to take a few more months to complete the course is still much cheaper than the other course options.

On the other hand, however, like doing it yourself, they tend to lack the social benefits of classroom learning: instructors to ask questions to and provide external social accountability, and fellow students to work alongside. In my experience, this makes it a very challenging for some learners, but others are not as comparatively affected by it.

In addition, many online courses provide more of a cursory summary of data science and lack the complex projects that are both necessary to learn data science and to market yourself to others. Even though there are exceptions, online courses are often good at introducing data science concepts rather than an in-depth exploration. Many focus on canned problems with already cleaned, ready-to-do data instead of letting you practice on the messy, complex, and often just plain silly data most data scientists actually have to use at their jobs. They also often lack the personnel for one-on-one coaching to mentor each student through portfolio-building projects with complex data.

Thus, online courses tend to provide good, cost-effective introductions to data science, helpful to see whether you like the field (see Hybrid #1 below), but do not generally provide the refined training necessary to become a data scientist. Now, some programs are evolving their courses. Especially as the pandemic increases demand for remote learning, online learning platforms are developing more robust online data science courses. If you choose to learn by taking online courses, I recommend supplementing it with your own projects to get experience practicing data science work and showcase in job interviews.

Hybrid #1: Use an Online course to Introduce Data Science (or Programming)

If you are completely new to data science, an online course can provide a low-cost, structured space to get a sense for what the field entails and determine whether it is a good fit for you. I have seen many people enroll in several thousand-dollar bootcamps or university degree programs only to learn there that they do not like doing data science work. An online course is a much cheaper space to discern that.

You could always explore data science yourself for free to decide whether you like it (see Option 1) instead of taking an online course, but I have found that many people who have never seen data science before do not know what to look up in the field to get started. An introductory online course is not that expensive, and the initial orientation into the major topic areas can be well worth the cost.

There are three basic versions of this approach:

1) If you do not already know a programming language, take an online programming course. I explained in this article why I would recommend Python as the language to learn (with Julia as a close second). If you do not like programming, then you have learned the lesson that you should not become a data scientist, and even if you do not end up in data science, programming is such a valuable skill that having some training in it will only help your occupational prospects in most other related fields.

2) If you do know a programming language, take an introductory data science course. These often provide a high-level overview of data science, especially helpful for people who need to work with data scientists and understand what they are talking about. If you need a math refresher, this is a great option as well.

3) I have seen prospective data scientists take online data analytics courses to prepare them for and determine their potential interest in data science. I would not recommend this, however. Even though data scientists will sometimes treat data analytics as a “diet” or “basic” version of data science, data analytics is different field requiring different skills. For example, data analytics courses typically do not include the rigorous programming. They generally focus on R and SQL if they teach programming at all, which are fine languages for data analytics and statistics but not enough for data science (for which you would want a language like Python). Data analytics and data science also generally emphasize different fields of math: data analytics tends to rely on statistics while data science on linear algebra, for example. Thus, what you would learn in those courses would not apply to data science as much as you would think. Now, if you are unsure of whether you would like to become a data scientist or data analyst, then a data analytics course might help you understand and get a feel for data analytics, but I would not use them to assess whether data science is a good fit for you.

Once you complete the online course, if you still think you would enjoy doing data science work, then you can choose any of the options to learn the field in more depth. This may seem like just getting you back to square one, but by taking an introductory programming or data science course, you have levelled yourself up so to speak and are more ready to face the “boss battle” of becoming a data scientist.

Option 3: Data Science Bootcamps

Data science bootcamps have also become popular. They tend to be several weeks long (in my experience often ranging from 2 to 6 months) intensive training programs. The traditional pre-pandemic bootcamp was in-person and would often cost around $10,000 to $15,000. Metis’s bootcamp is a good example of what they often look like.

Their biggest pros are that they offer the advantages of classroom education far more cheaply and in much less time than getting a university degree. They are a significant step-up cost-wise than the previous options (see Con 2 below), but they seek to provide a comparable (but less academically advanced and in-depth) scope of knowledge as a master’s degree in data science for a significantly lower price and in a fraction of the time. Even though it can often make their pace feel intense, the good bootcamps tend to mostly succeed at providing this. This makes them a great option for anyone who knows they want to become a data scientist. Finally, unlike the previous options, you get a teacher(s) to ask questions to and motivate you, and a set of fellow students to struggle through concepts with. The best programs offer the occupational coaching and build strong networks in data science communities to help their students find jobs afterwards.

They have some major cons, however:

1) They can feel fast-paced, unloading complex concepts in a short amount of time. Many of my friends who have done bootcamps have reported feeling cognitive whiplash. Expect those weeks/months to be mentally intense and to subsume your life. Data science bootcamps are often 9-5 full-time jobs during that time, and you will likely be too mentally exhausted to work on other things in the evenings or weekend (plus in some cases you will have homework to complete then anyways). A few weeks or months is not terribly long for such an ordeal, but it makes them much less flexible than the previous options. For example, this forces many students to take time from their current jobs to complete the bootcamp and to limit their social, familial, and other obligations as much as they can during their bootcamp. This makes it difficult for anyone unable to take time off work, with busy social or familial lives, or otherwise with a lot going on.

2) At several thousand dollars, they are clearly noticeably more expensive of the than the previous options (but still much cheaper than universities). Some offer scholarships and other services on a need-basis, but even then, the opportunity cost of having to put a job on hold can still be expensive. Given their general high salaries, landing a data science job would likely make the money back, but it takes a hefty initial investment.

This makes it an especially poor option for anyone thinking about data science but not sure whether they want to do it. $10,000 is a lot to spend to simply learn you do not like the field, and there are many cheaper ways to initially explore the field (see especially Hybrid #1). The cost still might be worth it, however, for anyone who really wants to become a data scientist but does not yet possess key skills and knowledge.

3) At the time of writing this, the Covid-19 pandemic has forced most data science bootcamps to meet remotely anyways, making their services far more similar to the much cheaper online courses. That said, many have sought to simulate the classroom environment virtually, trying to provide some type of social environment, but the classroom environment was a major advantage that made their significant increase in costs over the previous options worthwhile.

4) They tend to exist in large cities (especially tech centers). For example, bootcamps in the United States tend to concentrate in New York City, Los Angeles, Chicago, San Francisco, etc. Prior to the pandemic, anyone not living in those places would have to travel and temporarily reside in wherever their chosen bootcamp was, an additional expense.

5) They are often difficult for people who do not know programming and for those who do not know college-level mathematics like linear algebra, calculus, and statistics. If you do not know programming, I would recommend learning a programming language like Python (for more see this article I wrote explaining why to learn Python of all languages) through either a cheap online course and/or online tutorials first. Some data science bootcamps offer a preparatory introduction online course that teaches the prerequisite coding and math skills for those who do not understand it. They are worth consideration as well, but keep in mind the equivalent online course might be cheaper with roughly the same educational value.

If you decide to do a bootcamp, these criteria are important when researching which bootcamp to choose:

1) Project Orientation: How well do they enable you to practice data science through portfolio-building projects, and how impressive are the projects its alum did? The best data science bootcamps are generally teach in a project-oriented fashion.

2) Job-Finding Resources and/or Job Guarantee: What resources or coaching do they give to help you find a job afterwards? Help networking, presenting yourself, and interviewing, for example, are important skills to finding a job as a data scientist, and in addition to teaching you technical curriculum, the best programs tend to find occupational coaches to help specifically with the job-finding process. Also, some programs give a job guarantee: if you do not find a data science job after a certain number of months after graduating then they refund tuition. This generally shows they take job finding important enough to risk their own money on it (although do check at the fine print on the guarantee to see the exact terms they are agreeing to).

3) Alum Resources: A surprisingly import detail to consider is how much resources a bootcamp invests in cultivating alumni networks. I was surprised by how receptive to meeting/networking alum of the online bootcamp I did, and how satisfied alum tend to be with the bootcamp. The effort a bootcamp makes to work with and maintain relationships with its alum impact this significantly. Connectedness with alum can be difficult to assess when researching programs from afar, but asking whether you can speak with alum(s) to learn about their experiences with the program, checking a bootcamp’s alum activity on LinkedIn and other social media websites, and asking about what kind of networking opportunities with alum they facilitate can be great ways to assess how intentional a program is about cultivating relationships.

4) Scholarship Options: Some programs offer full or at least partial scholarships based on need. Clearly, ways to knock down the cost of the bootcamp would be great, especially if a bootcamp seems like an ideal option for you, but the cost seems too daunting.

Hybrid #2: Online Bootcamp

Online bootcamps tend to possess the schedule flexibility of online courses but offer more rigorous, personal (albeit remote) learning, allowing you to combine the best of aspects of data science bootcamps and online programs. They are also generally cheaper than traditional bootcamps (yet also more expensive than an online course). Finally, they tend to be a much better option for those who do not live in a major city that happens to have a local data science bootcamp program. The pandemic, if anything, has probably helped produce even more online bootcamp programs, since it has forced data science bootcamps to teach virtually.

I enrolled in Springboard’s online data science bootcamp in 2017, a great example of an online bootcamp. At the time, they cost roughly $1,000 a month (at the time of writing their standard rate is $1,490 a month and state their program generally takes six months). This is cheaper than traditional bootcamps but still a few totaling around $10,000 for six months. They had online curriculum typical of online courses but also provided weekly virtual meetings with an instructor to discuss the material and any issues you are having. Now they seem to include virtual lessons online. This individualized training and remote classroom environment are the main value adds over an online course, and you must assess whether, for you, they would be worth the additional cost. They are self-paced, providing much greater flexibility on when and how often you work than typical bootcamps. They also refunded your money if you did not find a job in six months after completion.

If you choose this option, be aware of the potential pitfalls of both online courses and traditional bootcamps. Just like with online programs, you will need to evaluate whether you are comfortable learning the curriculum by yourself (even you can meet with a mentor for major issues once a week, you would be doing the bulk learning by yourself throughout the week). Like with traditional bootcamps, expect the learning to be mentally intense and make sure they help you develop portfolio-building projects and provide job-finding resources and training.

Option 4: Master’s Degree or Other University Degree

The final option is to go back to school to get a degree in data science. This is the most expensive and time-consuming option: a master’s degree (a logical choice if you already have a bachelor’s) is generally the shortest, taking two years. But they cost upwards of $100,000. Even if partial or full scholarships decrease that cost, the opportunity cost of spending several years of your life in school is still higher than any of the other options. It can give a resume boost, however, if you know how to leverage it properly, which will likely increase your salary to make up for the initial cost. I would only recommend getting a master’s degree if you already know you love data science (say because you have already been working in the field, preferably if you also have already figured out the specific area of data science you want to do) but want to take your skills, technique, and/or theoretical knowledge of how the models work to the next level.

The best way to refine your data science skills is by doing data science: finding or creating contexts to push you as you practice data science. Graduate schools are not the only potential environment to refine one’s data science skills (e.g., all the previous options could involve that if done well), and even though graduate schools can be great at providing rigor, these other options can be a lot cheaper and more flexible. Finally, at the time of writing this, at least, the demand for data scientists exceeds the number of actual people in the field, and so getting a data science job without an “official” university degree in data science is pretty realistic.

University data science degree programs are relatively new – generally only a few years old. Thus, not all universities have literal data science degrees or departments but instead require that you enroll in a related program like computer science, statistics, or engineering to learn data science. This does not always mean these other programs are bad or unhelpful, but it often means you will have to perform extraneous or semi-extraneous tasks to data science proper in order to complete your degree (in some cases with minimal help from faculty from other fields).

When considering a program, you should make sure they are proactive about teaching professional and not just academic data science skillsets. These are the specific questions I would research to assess how well they might prepare you for non-academic data science jobs:

1) What proportion of their faculty currently work or at least have worked in the industry as a data scientist (or other similar job title)?

2) How well connected is the department with local organizations, and might they be able to leverage these relationships to help you work with these organizations through a work-study program or internship during the program and/or employment afterwards?

3) Will they help you build – or at least give you the flexibility to build – one’s thesis into an applied data science project that would boost your resume to future employers?

If your chosen program lacks these, I would strongly recommend building resume/portfolio-boosting projects and networking with local data scientists on the side while completing the program. This takes considerable time and energy, so ideally your department would actively help you in this work, instead of requiring that you do it on your own while also completing all their work.

Funding options is something else to consider. Are they willing to fund your degree fully or at least partly? Work-study programs where you work while getting your master’s can be a great way to graduate with no debt and gain resume-building work experiences (although they can make you busy). I benefitted greatly from working as a data scientist while completing my master’s, both because I graduated with no debt and because it allowed me to practice and refine my skills.

Finally, most universities require that you live nearby and attend physically (at least before and likely after the pandemic). Thus, you might have to find a place near you or be willing to relocate for a few years if there is not a data science degree program nearby. If so, you should factor moving expenses into the cost of doing the program.

Conclusion

Learning data science can be an awesome yet daunting prospect, and finding the right strategy for you is complicated, particularly given all the pedagogical, logistical, and financial considerations. Hopefully, this article has helped you think through how to journey forward.

Photo credit #1: geralt at https://pixabay.com/photos/woman-programming-glasses-reflect-3597101/

Photo credit #2: Anastase Maragos at https://unsplash.com/photos/OaFESrP2hhw

Photo credit #3: mohamed_hassan at https://pixabay.com/photos/training-course-3207841/

Photo credit #4: Jukan Tateisi at https://unsplash.com/photos/bJhT_8nbUA0

Photo credit #5: heylagostechie at https://unsplash.com/photos/IgUR1iX0mqM

Photo credit #6: Brooke Cagle at https://unsplash.com/photos/WHWYBmtn3_0

Photo credit #7: A_Ginard at https://pixabay.com/photos/architecture-modern-buildings-5084075/

Data Visualization 102: The Most Important Rules for Making Data Tables

In a previous post about data visualization in data science and statistics, I discussed what I consider the single most important rule of graphing data. In this post, I am following up to discuss the most important rules for making data tables. I will focus on data tables in reporting/communicating findings to others, as opposed to the many other uses of tables in data science say to store, organize, and mine data.

To summarize, graphs are like sentences, conveying one clear thought to the viewer/reader. Tables, on the other hand, can function more like paragraphs, conveying multiple sentences or thoughts to get an overall idea. Unlike graphs, which often provide one thought, tables can be more exploratory, providing information for the viewer/reader to analyze and draw his or her own conclusions from.

Table Rule #1: Don’t be afraid to provide as much or as little information as you need.

Paragraphs can use multiple sentences to convey a series of thoughts/statements, and tables are no different. One can convey multiple pieces of information that viewers/readers can look through and analyze at his or her own leisure, using the data to answer their own questions, so feel free to take up the space as you need. Several page long tables are fair game and, in many cases, absolutely necessary (although often end up in appendices for readers/viewers needing a more in-depth take).

In my previous data visualization post, I gave this bar chart as an example of trying to say too many statements for a graph:

This is a paragraphs-worth of information, and a table would represent it much better.[i] In a table, the reader/viewer can explore the table values by country and year themselves and answer whatever questions he or she might have. For example, if someone wanted to analyze how a specific country changed overtime, he or she could do so easily with a table, and/or if he or want to analyze compare the immigration ratios between countries of a specific decade, that is possible as well. In the graph above, each country’s subsegment starts in a different place vertically for each decade column, making it hard to compare the sizes visually, and since each decade has dozens of values, that the latter analysis is visually difficult to decipher as well.

But, at the same time, do not be afraid to convey a sentence- or graphs-worth of data into a table, especially when such data is central for what you are saying. Sometimes writers include one-sentence paragraphs when that single thought is crucial, and likewise, a single statement table can have a similar effect. For example, writing a table for a single variable does helps convey that that variable is important:

Gender	Some Crucial Result
Male	36%
Female	84%

Now, sometimes in these single statement instances, you might want to use a graph instead of a table (or both), which I discuss in more detail in Rule #3.

Table Rule #2: Keep columns consistent for easy scanning.

I have found that when viewers/readers scan tables, they generally subconsciously assume that all variables in a column are the same: same units and type of value. Changing values of a column between rows can throw off your viewer/reader when he or she looks at it. For example, consider this made-up study data:

	Control Group (n = 100)	Experimental Group (n = 100)
Mean Age	45	44
Median Age	43	42
Male No. (%)	45 (45%)	36 (36%)
Female No. (%)	55 (55%)	64 (64%)

In this table, the rows each mean different values and/or units. So, for example, going down the control column, the first column is mean age measured in years. The second column switches to median age, a different type of value than mean (although the same unit of years). The final two rows convey the number and percentages of males and females of each: both a different type of value and a different unit (number and percent unlike years). This can be jarring for viewers/readers who often expect columns to be of the same values and units and naturally compare them as if they are similar types of values.

I would recommend transposing it like this, so that the columns represent the similar variables and the rows the two groups:

	Mean Age	Median Age (IQR)	Male No. (%)	Female No. (%)
Control Group (n = 100)	45	43 (25, 65)	45 (45%)	55 (55%)
Experimental Group (n = 100)	44	42 (27, 63)	36 (36%)	64 (64%)

Table Rule #3: Don’t be afraid to also use a graph to convey magnitude, proportion, or scale

A table like the gender table in Rule #1 conveys pertinent information numerically, but numbers themselves do not visually show the difference between the values.

Gender	Some Crucial Result
Male	36%
Female	84%

Graphs excel at visually depicting the magnitude, proportion, and/or scale of data, so, if in this example, it is important to convey how much greater the “Some Crucial Result” is for females than males, then a basic bar graph allows the reader/viewer to see that the percent is more than double for the females than for the males.

Now, to convey this visual clarity, the graph loses the ability to precisely relate the exact numbers. For example, looking at only this graph, a reader/viewer might be unsure whether the males are at 36%, 37%, or 38%. People have developed many graphing strategies to deal with this (ranging from making the grid lines sharper, writing the exact numbers on top of, next to, or around the segment, among others), but combining the graph and table in instances where one both needs to convey the exact numbers and to convey a sense of their magnitude, proportion, or scale can also work well:

Finally, given that tables can convey multiple statements, feel free to use several graphs to depict the magnitude, proportion, or scale of one table. Do not try to overload a multi-statement table into a single, incomprehensible graph. Break down each statement you are trying to relate with that table and depict each separately in a single graph.

Conclusion

If graphs are sentences, then tables can function more like paragraphs, conveying a large amount of information that make more than one thought or statement. This gives space for your reader/viewer to explore the data and interpret it on their own to answer whatever questions they have.

Photo/Table credit #1: Mika Baumeister at https://unsplash.com/photos/Wpnoqo2plFA

Photo/Table credit #2: Linux Screenshots at https://www.flickr.com/photos/xmodulo/23635690633/

[i] Unfortunately, I do not have the data myself that this chart uses, or I would make a table for it to show what I mean.

The Best Programming Languages for Data Science and Machine Learning

Newcomers to data science or artificial intelligence frequently ask me the best programming language to learn to build machine learning algorithms. Thus, I wrote this article as a reference for anyone who wants to know the answer to that question. These are what I consider the three most important languages, ranked in terms of usefulness based on both overall popularity within the data science community and my own personal experiences:

Best Programming Languages for Machine Learning:

#1 Choice: Python

#2 Choice: R

#3 Choice: Java

#4 Choice: C/C++

#1 Programming Language: Python

Python is the most popular language to use for machine learning and for three good reasons.

First, it’s package-based style allows you to utilize efficient machine learning and statistical packages that others have made, preventing you from having to constantly reinvent the wheel for common problems. Many if not most of the best packages (like NumPy, pandas, scikit learn, etc.) are in Python. This almost allows you to “cheat” when programing machine learning algorithms.

Second, Python is a powerful and flexible all-purpose language, so if you are building a machine learning algorithm to do something, then you can easily build the code for the other overall product or system in which you will use the algorithm without having to switch languages or softwares. It supports object-oriented, functional, and procedure-oriented programming styles, giving the programmer flexibility in how to code, allowing you to use whatever style or combination of various styles you like best or fits the specific context.

Third, unlike a language like Java or C++, Python does not require elaborate setup to program a single line of code. Even though you can easily build the coding infrastructure if you need to, if you only need to run a simple command or test, you can start immediately.

When I program in Python, I personally love using Jupyter Notebook, since its interface allows me to both code and to easily show my code and findings as a report or document. Another data scientist can simultaneously read and analyze my code and its output at the same time. I personally wish more data scientists published their papers and reports in Jupyter Notebook or other notebooks like it because of this.

If you have time to learn a single programming language for machine learning, I would strongly recommend it be Python. The next three languages, R, Java, and C++, do not match its ease and popularity within data science.

#2 Programming Language: R

R is a popular language for statisticians, a programming language that is specifically tailored for advanced statistical analysis. It includes many well-developed packages for machine learning but is not as popular with data scientists as Python. For example, in Towards Data Science’s survey, 57% of data scientists reported using Python, with 33% prioritizing it, and only 31% reported using R, with 17% prioritizing it. This seems to show that R is a complementary, not primary language for data science and machine learning. Most R packages have their equivalent in Python (and to some extent the other way around). Unlike Python, which is an all-purpose language, able to do other wonders other than analyzing data and developing machine learning algorithms, R is specifically tailored to statistics and data analysis, not able to do much beyond that. Saying this, though, R programmers are increasingly developing more and more packages for it, allowing it to do more and more.

#3 Programming Language: Java

Java was once the most popular language around, but Python has dethroned it in the last few years. As an avid Java programmer who programs in Java for fun, it breaks my heart to put it so far down the list, but Python is clearly a better language for data science and machine learning. If you are working in an organization or other context that still uses Java for part or all of its software infrastructure, then you may be stuck using it, but most recent developments, particularly in machine learning, have occurred in Python and in R (and a few other languages). Thus, if you use Java, you’ll frequently find yourself having to unnecessarily reinventing the wheel.

Plus, one major con of Java is that conducting quick, on-the-go analysis is not possible, since one must write a whole coding system before one can do a single line of code. Java can be popular in certain contexts, where the surrounding applications/software that utilize the machine learning algorithms are in Java, common in finance, front-end development, and companies that have been using Java-based software.

#4 Programming Languages: C/C++

The same Towards Data Science survey I mentioned above lists C/C++ as the second most popular data science and machine learning language after Python. Java follows them closely, yet I included Java and not C/C++ as third because I personally find Java to be a better overall language than C or C++. In C or C++, you may frequently find yourself reinventing the wheel – having to develop machine learning algorithms that others have already built in Python – but in some backend systems that have been built C or C++ like in engineering and electronics, you do not have much of an option. C++ has a similar problem with Java as well: lacking the ability to do quick on-the-go coding without having to build a whole infrastructure.

Conclusion

For a beginner to the data science scene, learning a single programming is the most helpful way to enter the field. Use learning a programming language to assess whether data science is for you: if you struggle and do not like programming, then developing machine learning algorithms for a living is probably not a good fit for you.

Many groups are trying to develop softwares that enable machine learning without having to program: DataRobot, Auto-WEKA, RapidMiner, BigML, and AutoML, among many others. The pros and cons and successes and failures of these softwares warrants a separate blog post to itself (one I intend to write eventually). As of now though, these have not replaced programming languages in either practical ability to develop complex machine learning algorithms and in demonstrating that you have the technical computational/programming skills for the field.

For a beginner to the data science scene, learning a single programming is the most helpful way to enter the field. Use learning a programming language to assess whether data science is for you: if you struggle and do not like programming, then data science where you would be developing machine learning algorithms for a living is probably not a good fit for you. Depending on where you work or type of field/tasks you are doing, you might end up using the language(s) or software(s) your team works with so that you can easily work jointly on projects with them. For some areas of work or tasks might prefer certain packages and languages. If you demonstrate that you can already know a complex programming language like Python (or Java or C++), even if that is not the preferred language of their team, then you will likely demonstrate to any hiring manager that you can learn their specific language or software.

Photo credit #1: ThisIsEngineering at https://www.pexels.com/photo/woman-coding-on-computer-3861958/

Photo credit #2: Hitesh Choudhary at https://unsplash.com/photos/D9Zow2REm8U

Photo credit #3: thekirbster at https://www.flickr.com/photos/kirbyurner/30491542972/in/photolist-MQRUEh-2g3E1wf-Nsr8q9-HDKJxu-22VkHJU-2bWRXY2/lightbox/ (Yes, even though it is cool looking, this is not my code.)

Photo credit #4: Steinar Engeland at https://unsplash.com/photos/WDf1tEzQ_SY

Photo credit #5: Markus Spiske at https://unsplash.com/photos/jUWw_NEXjDw

Share this:

Share this:

Share this:

Potential Objection #1: But I have more to say about the data than a single statement.

Potential Objection #2: I want the viewers to interpret the findings for themselves, not just impart my own ideas/conclusions.

Potential Objection #3: My main idea/point has multiple subpoints.

Share this:

Share this:

Synopsis:

Project 1: No Show Model

Project 2: Cybersensitivity Study

Project 3: Facebook Newsfeed Folk Theories

Project 4: Thing Ethnography

Conclusion

Share this:

Part 1: Scoping out the Project

Part 2: Building the Model

Part 3: Building an App

Conclusion

Share this:

Option 1: Do-It-Yourself Online

Option 2: Online Course

Hybrid #1: Use an Online course to Introduce Data Science (or Programming)

Option 3: Data Science Bootcamps

Hybrid #2: Online Bootcamp

Option 4: Master’s Degree or Other University Degree

Conclusion

Share this:

Table Rule #1: Don’t be afraid to provide as much or as little information as you need.

Table Rule #2: Keep columns consistent for easy scanning.

Conclusion

Share this:

#1 Programming Language: Python

#2 Programming Language: R

#3 Programming Language: Java

#4 Programming Languages: C/C++

Conclusion

Share this: