Why Python is The Best Language to Use for Data Science

February 03, 2022 in Python Articles

Written by Gunnar Kleemann


While there are multiple languages that data scientists can use, Python has essential advantages for data science. Each language has its specific history, purpose, strengths, and style. The choice of language should be informed by the type of work being done now and in the future. In this post, I make the (somewhat opinionated) case for why Python is the best choice as the primary language for data science coding shops.

In a word, Python gives data scientists leverage. I detail the features of the language that convey this benefit later in this post. However, this XKCD comic makes the point well and is only a slight exaggeration.

Python comic
source

Python is used for data science because it gives unmatched leverage to collect data, process data, and deliver insights from data. Furthermore, code that is written in the "Pythonic" style: clear, readable, and easily shared. This encourages iteration and exploration that leads to insights.

Python is a One-Stop Shop for Data Science Tasks

There are numerous languages that can be used to perform data science and advanced analytics, so what makes Python so well suited for the data scientist? To answer this, think about what a data scientist is. Data science is the general practice of taking data and using advanced analytics to derive insight. The generality comes from the varied data sources and knowledge domains in which we operate. As the saying goes: a data scientist is part developer, part statistician, and part data engineer.

 

Data scientist Venn Diagram
source

The generality and technical requirements require that the data scientist is prepared to perform all the steps of the analytics process. They should move from data to insight and be able to work in unfamiliar domains. The data scientist needs a tool that can do all pipeline steps well. Other languages like SPSS, SAS, and R are highly refined tools for statistics, but they suffer from deficiencies at other stages of analysis. Python does everything well. It is not refined for a single task; but it can do all tasks. The only constant in data science is that we are tasked with every step of analysis in all domains. Python makes this possible.

Python Gets Out of the Way so You Can Get to Analysis

Python is lean and powerful; it gets right to the analysis stage. There are other languages like C and Java that have more built-in guardrails. You need to specify memory allocation in C. In Java and Scala, the programmer needs to control object type carefully. In Python, many of these details are handled in the background. Python tends to abstract the user away from the gory details of the underlying computational backend. This level of abstraction does come with risks that memory allocation will be less than optimal or an object will be of the wrong type, but these errors can be fixed when it is essential. For the data scientist, who is mainly focused on analyzing data, this is a good trade-off.
Python lets the data scientist write lean and readable code that directly connects data to insight.

Python is a Developer Language

One of the most apparent advantages of Python is that it is a developer language. This is not to say it is the best developer language, but any functionality that a developer normally expects is available and robust. Python can handle most data types and database connections. Python even comes with SQLite, which can be used without a separate database. SQLite can be deployed as part of your program. Most API services have a python API client. If there is no client, you can use requests to use the API endpoints directly.

In Python, dependency issues are well-handled by dependency managers like conda and pip. In addition, virtual environments and Conda environments enable the user to maintain multiple versions of Python, with different suites of dependencies. Trying to set up multiple versions of R is much clumsier and error-prone.

Python has well-supported packages to deliver analytical endpoints and results to stakeholders and customers. Since data science is about communicating insights derived from data, these packages are major levers. Interactive graphics and dashboards can be made with Holoviews (panel) and plotly (dash). Webapps can be built on the Flask and Django frameworks.

Other aspects make Python usable for developer workflows. Packages like PySpark and Dask allow scalable analytics and cluster management. However, the main point is that Python has momentum in developer communities; if a developer-facing process or package does not have a Python implementation, it is such a disadvantage that one will be produced in short order.

Python Offers Excellent Data Visualization Tools

Python facilitates storytelling with an increasingly diverse set of data visualization tools. We can use matplotlib for fine-grained control of plots. Matplotlib. rcParams (eeRC stands for runtime compiled) is a dictionary that allows us to re-write the standard plot aesthetics. This is particularly useful to specify report or company-themed aesthetics. Try this code to see the variety of plotting preferences that you can customize.

import matplotlib
        
matplotlib.rcParams

A collection of more stylized and powerful data visualization packages is built on top of the matplotlib scaffolding. For easy and aesthetically pleasing plots we use Seaborn and ggplot2. These packages abstract away from the detailed control afforded by matplotlib and give great results with a few lines of code. For interactive web-based visualizations, we have Bokeh and Plotly. Plotly goes beyond web-based visuals with Dash which allows the user to make public or private dashboards. With a license. Dashboards can be seamlessly deployed on the Plotly servers, further reducing the overhead associated with web app deployment and security.

There are many other data visualization packages, but the Holoviews family of tools is worth discussing. Holoviews is part of a family of tools that was built by Anaconda. It includes datashader, which plots huge datasets as shades instead of points. This application specifically addresses big data and can be used to make some stunning visuals. In addition, Holoviews introduced Panel, a flavor of dashboarding app. The data visualization capacity of Python is formidable and dynamically developing.

Python Demo

Because it is easy to talk (or write) a case for anything, the careful reader might demand more. What is the proof that Python gives you leverage to do important data science with minimal fuss?

Let me show you...

For this article, I made the point that Python just works. Dream up a data science application, and with a few lines of clear and readable code, we can create an analysis that moves your team forward.

Let's demonstrate how easy we can go from a simple question to a data-backed analysis. The first question that comes to mind is: How popular is Python relative to other data science tools? Is this popularity changing year over year?

To measure this, we can look at a coding forum and ask about how many questions are asked for the language. What percentage of these questions are answered? This important piece of data science intelligence can be easily and quickly answered with a few lines of Python. We might use this to validate already published insights, but we can also change the code interactively so that we can look at the question from new angles as we think of them.

The use of Jupyter notebooks allows us to annotate the code in markdown; these notebooks can read much more like a paper than raw code. R has RMD files that are similar, but these must be knit. Knitting means that all of the code needs to run or the knitting fails. This leads to time lost troubleshooting and less rapid iteration.

The use of Conda environments makes the installation of totally new packages easy and quick. Conda environments make spinning up new projects very dynamic; they avoid costly dependency incompatibilities and can be simply removed when we are done with them.

Let's start with the stack overflow API:

StackAPI
https://stackapi.readthedocs.io/en/latest/

You will need to have Anaconda and PIP installed for this. Any code with a '$' proceeding it is done in the terminal the remainder of the code is executed in jupyter notebook cells

  1. Make a conda environment to keep these dependencies away from other projects. We will require Jupyter, pandas, and ipykernel so let's build them in to the initial environment.

    $ conda create -n codecom_api jupyter pandas ipykernel

  2. Activate the new environment.

  3. Install the stack overflow dependencies.

    $ conda activate codecom_api
    $ pip install stackapi

  4. Attach the environment kernel to Jupyter Notebook.

    $ python -m ipykernel install --user --name=codecom_api

  5. Open jupyter notebook then

    $jupyter notebook

    create a new notebook by selecting the codecom_api kernel on the right.

    Jupyter Notebook


  6. Import these packages (write this in the Jupyter notebook code cell).

    from   stackapi   import   StackAPI
    from   DateTime   import   DateTime
    import   pandas   as   pd
    import   seaborn   as   sns


  7. Set up the API query and dataframe.

  8. SITE = StackAPI('stackoverflow')
            
    SITE.max_pages=10
            
    results = pd.DataFrame(columns = ['year', 'language','answered', 'views', 'answers'])
    


  9. Grab a single day sample of posts for the last 10 years and load it into the DataFrame.

    Note: APIs will temporarily stop working if you make too many calls so run this only a few times or reduce the number of languages used while testing.

    #get the data from the Stackoverflow api
    for year in list(range(2012,2023)):
        for language in ['C', 'python', 'java', 'R']:
            post = SITE.fetch('questions', fromdate=datetime(year,1,1),\
                      todate=datetime(year,1,2), tagged = language)
                      
            # collect the data in a dataframe
            for item in post['items']:
                results=results.append(pd.DataFrame.from_dict({'year':[year],\
                'language':[language],\
                'answered':[item['is_answered']],\
                'views':[item['view_count']],\
                'answers':[item['answer_count']]}))


  10. Summarize the results and plot them using seaborn.


    summ=results.groupby(['language', 'year']).agg({'answered':'mean',\
       'views':['min', 'max', 'mean', 'size'],\
       'answers':['min', 'max', 'mean']})
    
    p = sns.lineplot(x='year' , y=summ['views', 'size'], style='language',
    data=summ)
    
    p.set( xlabel = "year sampled", ylabel = "questions asked/ day",\
       title ="stack overflow questions by language")


    Stack Overflow questons by language chart.

In summary, we had a question that could have very important implications for a business: Should we transition our data science to Python? Then we answered that question with a few lines of Python code. I want to pause to emphasize that this goes to the heart of the power of data science. We moved from a speculative discussion point to actionable primary data analysis in a single turn of very lean code. Now that we have identified a data source and rendered it into a visualization we can add to it easily, generating more nuanced analysis. Questions lead to answers and better questions. Python gets that out of the way.

Accelebrate offers in-person or live, online Python Data Science training for your team of 3 or more.

Set Up References


Written by Gunnar Kleemann

Gunnar Kleemann

Dr. Gunnar Kleemann runs a small friendly data science shop, Austin Capital Data. Gunnar has over 25 years of experience teaching a broad array of STEM fields; acting as a teacher and advisor to students in a number of contexts at institutions including at The Princeton University Genomics Institute, Barnard College, Albert Einstein College of Medicine, the University of Nebraska-Lincoln, K2, Data Society, the Princeton Review, and of course Accelebrate. Most recently he has been a Lecturer at UC Berkeley’s Master’s in Data Science (MIDS) program since 2016.

Gunnar is primarily interested in making the benefits of data science more broadly accessible since he believes that data science skills will be the core delimiters in the future world. To this end, he regularly presents his results at international conferences, most recently at All Things Open 2021. Gunnar has published research on physiological and behavioral genomics in the most prominent international journals, including Cell, Genetics, and the Journal of Neuroscience.


Learn faster

Our live, instructor-led lectures are far more effective than pre-recorded classes

Satisfaction guarantee

If your team is not 100% satisfied with your training, we do what's necessary to make it right

Learn online from anywhere

Whether you are at home or in the office, we make learning interactive and engaging

Multiple Payment Options

We accept check, ACH/EFT, major credit cards, and most purchase orders



Recent Training Locations

Alabama

Birmingham

Huntsville

Montgomery

Alaska

Anchorage

Arizona

Phoenix

Tucson

Arkansas

Fayetteville

Little Rock

California

Los Angeles

Oakland

Orange County

Sacramento

San Diego

San Francisco

San Jose

Colorado

Boulder

Colorado Springs

Denver

Connecticut

Hartford

DC

Washington

Florida

Fort Lauderdale

Jacksonville

Miami

Orlando

Tampa

Georgia

Atlanta

Augusta

Savannah

Hawaii

Honolulu

Idaho

Boise

Illinois

Chicago

Indiana

Indianapolis

Iowa

Cedar Rapids

Des Moines

Kansas

Wichita

Kentucky

Lexington

Louisville

Louisiana

New Orleans

Maine

Portland

Maryland

Annapolis

Baltimore

Frederick

Hagerstown

Massachusetts

Boston

Cambridge

Springfield

Michigan

Ann Arbor

Detroit

Grand Rapids

Minnesota

Minneapolis

Saint Paul

Mississippi

Jackson

Missouri

Kansas City

St. Louis

Nebraska

Lincoln

Omaha

Nevada

Las Vegas

Reno

New Jersey

Princeton

New Mexico

Albuquerque

New York

Albany

Buffalo

New York City

White Plains

North Carolina

Charlotte

Durham

Raleigh

Ohio

Akron

Canton

Cincinnati

Cleveland

Columbus

Dayton

Oklahoma

Oklahoma City

Tulsa

Oregon

Portland

Pennsylvania

Philadelphia

Pittsburgh

Rhode Island

Providence

South Carolina

Charleston

Columbia

Greenville

Tennessee

Knoxville

Memphis

Nashville

Texas

Austin

Dallas

El Paso

Houston

San Antonio

Utah

Salt Lake City

Virginia

Alexandria

Arlington

Norfolk

Richmond

Washington

Seattle

Tacoma

West Virginia

Charleston

Wisconsin

Madison

Milwaukee

Alberta

Calgary

Edmonton

British Columbia

Vancouver

Manitoba

Winnipeg

Nova Scotia

Halifax

Ontario

Ottawa

Toronto

Quebec

Montreal

Puerto Rico

San Juan