top of page

Kaggle Competition: US Baby Name Trend

Dataset Introduction

​

Data.gov releases the dataset with names of babies born in the US from 1980s up to today. This dataset is a good source for tracking and analyzing name trend. Note that only names with at least 5 babies born in the same year are included in the dataset for privacy. 

​

Questions to be Answered
​
  • What's the name trend for baby names in general? How about for some specific names? 

  • How does the diversity of baby names change? Are there more diverse names or more common names? 

  • How about the change in first letter of baby's name? Is there a significant change in the preference to names start with certain letter? 

  • What is the trend of one specific name or names that are similar to that name? Are they becoming more popular? If yes, are they becoming more popular in girls or boys? 

​

Content

​​

Step1: Prepare Data​

Step2: Analyze Name Popularity Change over Time

Step3: Analyze the Diversity of Names

Step4: The Trend of First Letter? 

Step5: Name Change Sex? 

​

​

Tools&Techniques

 

Python + Jupyter Notebook

Step1: Prepare Data

​

First of all, I assembled all the datasets downloaded from US Data.gov into one file: 

The next step is to play with the data and get an overall view of it. Using some methods, such as df.pivot_table() and df.groupby() from Pandas library in Python can be very helpful. In present case, baby names data was aggregated in both methods to show number of babies born in each year at year and sex level. Sometimes, visualization in this stage can come useful, especially when there are some obvious patterns in data. Of course, that does not apply to all datasets but it surely won't hurt to make some plots and take a look.

 

According to our dataset, the number of new borns are increasing overall, though with some ups and downs in the middle and the number of baby boys was constantly lower than baby girls until somewhere around 1940s; ever since then the number of new born baby boys has been more than baby girls and the trend has been kept till today. 

The next step is to play with the data and get an overall view of it. Using some methods, such as df.pivot_table() and df.groupby() from Pandas library in Python can be very helpful. In present case, baby names data was aggregated in both methods to show number of babies born in each year at year and sex level. Sometimes, visualization in this stage can come useful, especially when there are some obvious patterns in data. Of course, that does not apply to all datasets but it surely won't hurt to make some plots and take a look.

 

According to our dataset, the number of new borns are increasing overall, though with some ups and downs in the middle and the number of baby boys was constantly lower than baby girls until somewhere around 1940s; ever since then the number of new born baby boy has been more than baby girls and the trend has been kept till now. 

Step2: Analyze Name Popularity Change over Time

I am interested in knowing the trend change of a name over time. First of all, I only kept the top 1000 names and the best to see a trend to me is through plotting. I picked four common names: James, Will, Jasper and Robert: 

One thing worth noticing here is that the plot above only shows the change of the name trend on it's own along time but fails to give an overall comparison of name trend at one given time point. To do that:  

Through this plot, we are not only informed that the name "Jasper" is getting popular but also the fact that even then, this name is still a lot less used compared to James and Robert. 

Step3: Analyze the Diversity of Names

We all know the literal meaning of the word Diversity, but how can we convert that conception to some metrics that we can measure? This is how I define it in this case: among top 1000 names for boys and girls, if the proportion of each name is relatively lower then we conclude the numbers of each distinct name are more spread out, which indicates higher diversity. With that in mind, this is how I implemented it in Python: 

The results indicate that girl names have always been more diverse than boy names and they have only become more so over time. 

Step4: The Trend of First Letter? 

After extracting the fist letter from baby names and calculating the count of first letter frequency, I scaled the table by calculating proportion of first letter count within each year. With visualizing, we can clearly see the trend changing: 

Step5: Name Change Sex? 

Another fun trend to look at is how some names change sexes in popularity and the example used here is names contain "lesl". The procedure is very similar as how the data is processed in previous steps, we first calculate the frequency of such names grouped by gender and year, then with visualization we can see the trend clearly: 

Obviously, there is a upward trending in baby girls name with "lesl" in it and before 1940, those name are definitely more flavored by boys.  

Crystal Wang @ 2017

bottom of page