Pandas-monium: Analyzing Data with the Python Pandas Package (Part 2 of 2)

January 22, 2019 in Python Articles

Written by Ra Inta


In Part 1, we explored some of the great power offered by the pandas framework. The process of importation, through the exploration and cleaning and basic wrangling of data, is a simple matter. Time series, handling missing data and subsequent visualization of the results can be performed in a single line!

But what about more powerful tools, such as grouping, aggregation, joins, advanced string methods and other functions? Are they just as simple to carry out? And where are you most likely to see a UFO? All this—and more—in Part 2!  

Aggregation Rate of UFO Reports Over Time

We have a nice longitudinal dataset that spans from 1400 to 2018 A.D. How does the rate of UFO reports change over time?

We will have to aggregate over year and count the number of occurrences. We do this aggregation with the .groupby() operation:

ufo_df.groupby('year')['posted'].count().reset_index().plot(x='year', y='posted', xlim=[1940.0, 2018.0], legend=False)
Increase in UFO Reports Over Time

What happened in the mid-2000's? Does the introduction of so-called smart phones (while my own phone seems to keep getting dumber) have anything to do with the rate of sightings?

The first smart phone was introduced around the middle of 2007 (more or less):

ufo_df['smartphone_epoch'] = ['pre-cellphone' if x < 2007 else 'post- cellphone' for x in ufo_df['year']]
ufo_df[ufo_df['smartphone_epoch'] == 'post- cellphone']['month'].plot.hist(bins=12, color='blue', alpha=0.5, label="post-cellphone", legend=True) ufo_df[ufo_df['smartphone_epoch'] == 'pre- cellphone']['month'].plot.hist(bins=12, color='red', alpha=0.5, label="pre-cellphone", legend=True)

Note: again, because the plot is really a matplotlib object, you could also have obtained the legend by calling it 'directly':

plt.legend() 

Effect of Technology on Rreports by Month

Recall the median of report counts occurred in 2009, so there are more post-2007 reports in total. Interestingly, the reports from 2007 onwards peak in July, rather than in the well-defined June peak for the pre-2007 reports. There also appears to be a relatively heightened rate of reports introduced in January. Any thoughts on why that would be? This is not a rhetorical question!

What is the distribution of reported UFO shapes?

One of the columns displaying a great deal of variability is the 'shapes' field. How are these shapes distributed?

ufo_df.groupby('shape')['posted'].count().sort_values(ascending=False)
    .head(20).plot.bar()

Reported UFO Shape

By far the most common shape is 'Light'. Cigar-shaped objects were way down (12th) on the list, which surprised me, as that is the descriptor that always rang out 'technical UFO description' to me as a kid.

What cities saw the most UFOs?

Often in the EDA phase, it becomes clear that further cleaning is required. This is certainly the case here!

Recall we noted that there were 227 cities missing. This hasn't been a problem until now, because we didn't care about the city names. We need to decide how to handle these missing data.

In addition, there are quite a few other tasks required to clean up the city names. For example, there are many entries that attempt to clarify or condition with commentary in parentheses:

ufo_df.dropna()[ufo_df['city'].dropna().str.contains("\)$")].loc['date
    _time':'city'].sample(5)

                  city                                 date_time  year  month
95593  2001-03-0  Henderson (Las Vegas)                19:45:00   2001  3
104680 1995-12-0  Salluit (Canada)                     23:10:00   1995  12
83712  2003-11-2  Calhoun (north of)                   11:30:00   2003  11
51673  2010-08-0  North Kenai (Daniels Lake east end)  00:30:00   2010  8
52326  2010-06-1  Hatchet Lake (Canada)                21:18:00   2010  6

Note the use of vectorized string methods and string matching using regular expressions (regex) — the "\)$" means "look for a ')' at the very end of the string". More regex's to come!

There are:

ufo_df.dropna()[ufo_df['city'].dropna().str.contains("\)$")].count()[0]
8,673 of these observations!

This includes the following mysterious gem:

((town name temporarily deleted)), OK

This is the kind of thing we came here for right here!

Note the double use of .dropna() to prevent data misalignment — this is enforced by pandas, so you'll get an error if data threatens to become misaligned.

Speaking of which, we'll get rid of those no-good city NaNs. The .dropna() method has some helpful parameters to specify rows/columns and in-place filtering:

ufo_df.dropna(subset=['city'], inplace=True)

Perhaps we could split the city strings by " (" and retain the 0th element:

ufo_df['city'] = ufo_df['city'].str.split(sep=" (")[0]

Unfortunately pandas' str.split() method does not (yet) seem to have the rich support of the built-in str.split() method.

While we are at it, there are a few entries that have non-standard capitalization (such as 'AmArillo'). We enforce this with str.title():

ufo_df['city'] = [x.title().split(sep=" (")[0] for x in ufo_df['city']]

There are still 20 cities with parentheses, and at least one with '{' s. I was hoping it wouldn't come to this...

...but we're going to have to wade back, deeper into the murky waters of regular expressions (regex):

import re

Remove anything following a ( or {:

ufo_df['city'] = [re.split("\s*[\(\{]", x.title())[0] for x in ufo_df['city']]

To translate: \s is whitespace, * is a wildcard for 0 or more occurrences of the immediately preceding character. Characters in the [] are treated as logical OR, or ranges. The ( and { each need to be escaped with a \ because they're otherwise meaningful to the regex interpreter. So ("\s*[\(\{]" roughly translates to 'take any potential whitespace, followed directly by a ( or a
{'. In this case, we use this as a delimiter to split the string, and then we discard anything after this (complicated) delimiter.

You have probably worked out by now that regular expressions are (relatively) straight-forward to write, but often painful to read!

Check nothing went awry (as often occurs with prototyping regex):

ufo_df.dropna()[ufo_df['city'].dropna().str.contains("\)$|\(")]

There are only two entries with (these types of) entry errors left. We can manually input them:

ufo_df.loc[115268, 'city'] = 'Aliquippa'  # The largest 'city' in Beaver County, PA
ufo_df.loc[25404, 'city'] = 'Cedar Rapids'

So, how do the cities look now? Let's look at city names containing the string "New York":

ufo_df[ufo_df['city'].str.contains("New York")]['city'].unique()
  
array(['New York City', 'New York', 'New York Mills', 'West New York', 
   'New York State Thruway / Catskill', 'New York City, Manhattan',
'New York City/Far Rockaway', 'New York City/Staten Island', 'New York City/Philadelphia', 'New York/Philadelphia',
'New York City/Central Park', 'New York/San Francisco', 'Central New York', 'East New York', 'New York Worlds Fair'], dtype=object)

Still some way to go.

There are plenty of locations that are between main population centers, often separated by a '/' or a '&'. Most cases, the larger settlement is on the left of the slash or ampersand. Regex to the rescue again!

ufo_df['city'] = [re.sub("\s*/.*", "", x.title()) for x in ufo_df['city']]
ufo_df['city'] = [re.sub("\s*&.*", "", x.title()) for x in ufo_df['city']]
ufo_df['city'] = [re.sub("^[Bb]etween", ",", x.title()) for x in ufo_df['city']]

CHECK. Always check when transforming data for the first time.

ufo_df[ufo_df['city'].str.contains("New York")]['city'].unique()

Much better. However, there are a lot of cities with helpful-but-unhelpful directional modifiers, such as "East Los Angeles". Similarly, there are some suffixes, such as in "San Francisco Area" that could be purged. There are as many ways to do this as there are to skin a Xenomorph.

I approached this by defining two functions to perform these cleaning operations:

def purgeDirectionalModifiers(df, city_name):
"""Strip off directional prefixes North, South etc. from city names in DataFrame df"""
direction_list = ["North", "N.", "East", "E.", "South", "S.", "West", "W."]
direction_regex = re.compile(' |'.join(direction_list) + ' ') match_idx = df['city'].str.contains(city_name).index df['city'].loc[match_idx] = [re.sub(direction_regex, "", x) for x in df['city'].loc[match_idx]] def purgeCitySuffix(df, city_name):
"""Strip specifiers for city, county etc."""
suffix_list = ["City", "County", "Area", "Bay", "Airport", "D.C.", "Dc", ","] suffix_regex = re.compile(' ' + '$| '.join(suffix_list) + "$") match_idx = df['city'].str.contains(city_name).index df['city'].loc[match_idx] = [re.sub(suffix_regex, "", x) for x in df['city'].loc[match_idx]]

I then obtained a list of cities that have the worst of these problems7, and cleaned them:

cities_to_clean = ["Sacramento", "Seattle", "Milwaukee", "Baltimore", "Las Vegas", 
   "Boston", "San Francisco", "Washington", "Chicago", "Los Angeles", "New York"]
 
for city in cities_to_clean: 
   purgeDirectionalModifiers(ufo_df, city) 
   purgeCitySuffix(ufo_df, city)

How many cities are there now in this data set?

len(ufo_df['city'].unique())  # 18,366

The US cities aren't uniquely specified. We need to add their US state:

ufo_df['city'] = ufo_df['city'] + ', ' + ufo_df['state']

All right. Let's look at how many reports came from each city. Taking the top ten:

ufo_df.groupby('city')['posted'].count().sort_values(ascending=False). head(10).plot.barh()

plt.gca().invert_yaxis()
UFO Reports by City

It certainly makes sense that New York, with the most populous US city, would have the most UFO reports. However, Seattle seems to be pushing well above its weight. That explains the Space Needle!

5: Prepare to be boarded! Joining datasets

How about the observations per capita?

In the words of the great Data Scientist, Sherlock Holmes: "Data! Data! Data! I Can't Make Bricks Without Clay!" We will need some data from another source. The US census bureau has data available for the 770 most populous cities: https://factfinder.census.gov/faces/tableservices/jsf/pages/productview.xhtml

city_pop = pd.read_csv("PEP_2017_PEPANNRSIP.US12A_with_ann.csv", encoding='latin-1')

However, there are far too many columns for our needs. We'll rename the non-extraneous census-specific column labels and drop the rest:

city_pop = city_pop.rename(index=str, columns={'GC_RANK.rank-label': 'rank', 'GC_RANK.display-label.1': 'city_state', 'respop72017': 'pop'})

city_pop = city_pop[['rank', 'city_state', 'pop']]

Or we could have used the .drop() method as above. Further cleaning:

city_pop['city_state'] = city_pop['city_state'].str.replace('
\(balance\)', '')  # This term means something to the Census Bureau...

There are still a few (eight) cities with hyphens or official county designations or other weirdness. Let's impute them manually, otherwise we'd miss out on some major cities!

city_pop.loc['23', 'city_state'] = "Nashville city, Tennessee" 
city_pop.loc['59', 'city_state'] = "Lexington city, Kentucky" 
city_pop.loc['88', 'city_state'] = "Winston city, North Carolina" 
city_pop.loc['121', 'city_state'] = "Augusta city, Georgia" 
city_pop.loc['171', 'city_state'] = "Macon County, Georgia" 
city_pop.loc['219', 'city_state'] = "Athens County, Georgia" 
city_pop.loc['28', 'city_state'] = "Louisville city, Kentucky" 
city_pop.loc['55', 'city_state'] = "Honolulu city, Hawaii"

The convention is largely "X city, Y" where X is the name of the city and Y is the full name of the state. So we could have split by the " city, " delimiter.

Unfortunately, there are towns, cities, villages and counties in the mix. Take one step back — your doctor recommends another dose of regex!

city_delimiter = ' [Cc]ity, '
city_delimiter += '| town, '  # Note the pipe operator '|', used as logical OR
city_delimiter += '| [Cc]ounty, ' city_delimiter += '| village, ' city_delimiter += '| municipality, '

Now we can split the cities and states according to this (horrendous, just don't think about it) delimiter:

city_pop['city'] = [re.split(city_delimiter, x)[0] for x in city_pop['city_state']]
city_pop['state'] = [re.split(city_delimiter, x)[-1] for x in city_pop['city_state']]

A fun aside: how many cities have the same name in the US?

city_pop.groupby('city').count().sort_values('state', 
ascending=False)['state'].head(5)

city
Springfield	 5
Lakewood	 4
Albany		 3
Bloomington	 2
Charleston	 2

There are five Springfields! Matt Groening must have known this when he created The Simpsons.

There are 36 major cities in the US with duplicated names. We'll need to retain the state information to disambiguate. This is why people have to specify the US state. You don't want to get on the plane to Portland, only to realize you're heading to the complete opposite coast!

Again, make sure we imported fine and the data looks like it makes sense. In the biz', we call this a 'sanity check'.

city_pop.sample(n=5)

rank	 city_state                        pop    city               state
412 413  Westland city, Michigan           81747  Westland           Michigan
569 570  Blaine city, Minnesota            64557  Blaine             Minnesota
552 553  North Little Rock city, Arkansas  65911  North Little Rock  Arkansas
550 551  Laguna Niguel city, California	   66334  Laguna Niguel      California
506 507  Canton city, Ohio                 70909  Canton             Ohio

Let's check the ten most populous cities:

city_pop.sort_values('pop', ascending=False).head(10).plot.barh(x='city_abbrev', legend=False)
  
plt.gca().invert_yaxis()
10 Most Populous US Cities

Unfortunately, the good people at the Census Bureau write out their US states in long form. But we just want their abbreviations. I created a text file with all the states' names and abbreviations, 'state_abbrev.txt':

state_ref = pd.read_csv("state_abbrev.txt")

Let's add the abbreviation references to the city_pop DataFrame using the .merge() method:

city_pop = city_pop.merge(state_ref, on='state', how='left')

Combine the city and state into one column to make merge syntax easier (i.e. we only have to use a single key for each8):

city_pop['city_abbrev'] = city_pop['city'] + ', ' + city_pop['abbreviation']

Ugh. But we have too many columns. We only needed two!

city_pop = city_pop[['city_abbrev', 'pop']]

Finally, merge the city_pop references into the ufo_df DataFrame:

ufo_merged = ufo_df.merge(city_pop, left_on='city', right_on='city_abbrev', how='left')

Note: by doing this, we are automatically resigning ourselves to analyzing US-only data; the following transformations make no sense if there is no associated US state.

How much data will this discard?

ufo_df['state'].isnull().sum()/len(ufo_df)

Hence 7.2% of the entries have no defined state. By inspection, most of these are from outside of the US. This could be a follow-up investigation.

Also...

city_pop['pop'].min()

...cities with populations of at less than 47,929 will be omitted.

Get rid of the resulting city NaNs:

ufo_merged.dropna(subset=['city'], inplace=True)

Finally, we can answer the question 'what are the cities with highest recorded UFO sightings per capita?'

ufo_per_capita = ufo_merged.groupby('city')['posted'].count() 
ufo_per_capita = pd.DataFrame(ufo_per_capita)
ufo_per_capita = ufo_per_capita.join(city_pop.set_index('city_abbrev'), how='left') ufo_per_capita.dropna(subset=['pop'], inplace=True) ufo_per_capita['obs_per_1000'] = 1000*ufo_per_capita['posted']/ufo_per_capita['pop'] ufo_per_capita.sort_values('obs_per_1000', ascending=False)['obs_per_1000'].head(10).plot.barh(legend=False) plt.gca().invert_yaxis()

Ten Most Populous US Cities

Wow! Who would have thought that Tinley Park, Illinois would have the most UFO reports per capita?

The second, Sarasota, FL, isn't even close. Personally, I was expecting somewhere like Santa Fe, New Mexico, or Portland, Oregon, to have the highest reports per capita (the population of Roswell, NM didn't quite have enough population to be included).

What are the states with the most observations?

ufo_merged.groupby('state')['posted'].count().sort_values(ascending=Fa lse).head(10)

state observations
CA    13235
FL    6312
WA    5767
TX    4833
NY    4580
AZ    3909
PA    3784
IL    3521
OH    3470
MI    2922

It's reasonable that California has twice the reports of Florida, having almost twice the population. But Washington state, with a little less than half California also, has 1/7th the population...

Once again, Washington is punching well above its weight in terms of UFO reportage. I always thought Jimi Hendrix was out of this world!

Other investigations

There are plenty of avenues for exploration of this dataset using pandas.

For example, there are over 720 potential 'HOAX' flags in the summary pages. Adding this label to the observations may make a neat machine learning exercise (although it's unlikely we have enough information in the reports alone to make any useful prediction).

There are numerous famous UFO incidents that may warrant an in-depth study, such as the 'Phoenix Lights' in 1997, or the 'Chicago O'Hare Lights' in 2006. Unfortunately, the mysterious airship event in Aurora, Texas of 1897 has only a single entry.

Conclusion

Hopefully this tutorial has demonstrated some of the power and ease of application of the pandas module.

There is so much functionality that we could not cover all the features of pandas here. For instance, we have not showcased transitioning between wide (unstacked) and long data formats, or some of the powerful time-series methods, such as periodic analysis or window functions. But we have hopefully shown that data analysis doesn't always have to be difficult. Given the right dataset, it even be fun!

Until next time — see you on the seventh rock from 40 Eridani A, Earthling!


6 More technically, it's interacting with the graphical back-end of your system to display the image.

7 For more details, I refer you to the full code for this tutorial.

8 The on keyword in the .merge() method can easily take a list for primary and secondary keys, but we want to retain the US state information for plotting purposes too.


Accelebrate offers Python training onsite and online.


Written by Ra Inta

Ra Inta

Ra is originally from New Zealand, has a PhD in physics, is a data scientist, and has taught for Accelebrate in the US and in Africa. His specialties are R Programming and Python.
  


Learn faster

Our live, instructor-led lectures are far more effective than pre-recorded classes

Satisfaction guarantee

If your team is not 100% satisfied with your training, we do what's necessary to make it right

Learn online from anywhere

Whether you are at home or in the office, we make learning interactive and engaging

Multiple Payment Options

We accept check, ACH/EFT, major credit cards, and most purchase orders



Recent Training Locations

Alabama

Birmingham

Huntsville

Montgomery

Alaska

Anchorage

Arizona

Phoenix

Tucson

Arkansas

Fayetteville

Little Rock

California

Los Angeles

Oakland

Orange County

Sacramento

San Diego

San Francisco

San Jose

Colorado

Boulder

Colorado Springs

Denver

Connecticut

Hartford

DC

Washington

Florida

Fort Lauderdale

Jacksonville

Miami

Orlando

Tampa

Georgia

Atlanta

Augusta

Savannah

Hawaii

Honolulu

Idaho

Boise

Illinois

Chicago

Indiana

Indianapolis

Iowa

Cedar Rapids

Des Moines

Kansas

Wichita

Kentucky

Lexington

Louisville

Louisiana

New Orleans

Maine

Portland

Maryland

Annapolis

Baltimore

Frederick

Hagerstown

Massachusetts

Boston

Cambridge

Springfield

Michigan

Ann Arbor

Detroit

Grand Rapids

Minnesota

Minneapolis

Saint Paul

Mississippi

Jackson

Missouri

Kansas City

St. Louis

Nebraska

Lincoln

Omaha

Nevada

Las Vegas

Reno

New Jersey

Princeton

New Mexico

Albuquerque

New York

Albany

Buffalo

New York City

White Plains

North Carolina

Charlotte

Durham

Raleigh

Ohio

Akron

Canton

Cincinnati

Cleveland

Columbus

Dayton

Oklahoma

Oklahoma City

Tulsa

Oregon

Portland

Pennsylvania

Philadelphia

Pittsburgh

Rhode Island

Providence

South Carolina

Charleston

Columbia

Greenville

Tennessee

Knoxville

Memphis

Nashville

Texas

Austin

Dallas

El Paso

Houston

San Antonio

Utah

Salt Lake City

Virginia

Alexandria

Arlington

Norfolk

Richmond

Washington

Seattle

Tacoma

West Virginia

Charleston

Wisconsin

Madison

Milwaukee

Alberta

Calgary

Edmonton

British Columbia

Vancouver

Manitoba

Winnipeg

Nova Scotia

Halifax

Ontario

Ottawa

Toronto

Quebec

Montreal

Puerto Rico

San Juan