Capstone Project

The following is a case study I completed to apply skills like data cleaning, data processing, and data visualization that I learned from taking the Google Data Analytics Professional Course.

I had three options: selecting one of two prompts or creating my own adventure with a dataset of my choosing. For my first major project, I chose the Bellabeat prompt and utilized the R programming language for data cleaning and analysis. I learned so much!

By the way, Bellabeat is a real fitness company and their products are pretty sleek as far as fitness trackers go.

→ I built this webpage using HTML and CSS. View the repo here.

Objective

I have been asked (fictionally!) to focus on one of Bellabeat’s products and analyze smart device data to gain insight into how consumers are using their smart devices. The insights I discover will then help guide marketing strategy for the company. I will present my analysis to the Bellabeat executive team along with my high-level recommendations for Bellabeat’s marketing strategy.

Ask

This part is where a fictional senior leader, Sršen, asks me to analyze smart device usage data in order to gain insight into how consumers use non-Bellabeat smart devices. She then wants me to select one Bellabeat product to apply these insights to in your presentation. These questions will guide my analysis:

1. What are some trends in smart device usage?
2. How could these trends apply to Bellabeat customers?
3. How could these trends help influence Bellabeat marketing strategy?

Business Task

Identify new growth opportunities for Bellabeat’s Leaf product to inform marketing strategy by identifying trends in how consumers use their (non-Bellabeat) smart devices.

Prepare

Data Sources

TL;DR

In this part I describe the data, assess if it's quality, trustworthy, where it came from, where it lives, etc. Google uses the "Does the Data ROCC?" question to help guide and assess it. Overall the dataset is full of holes. If I were actually using this data, I'd have so many questions for the stakeholders about the project and data integrity. I would most definitely want more data than contained in the datasets! Read below to get the nitty gritty, picky observations I had.

Dataset and Licensing

FitBit Fitness Tracker Data (CC0: Public Domain, dataset made available through Mobius). It’s open source and the data is updated annually. The last update was 8 months ago.

→ FitBit Fitness Tracker Data

Data Collection

This Kaggle data set contains personal fitness tracker data from 35 Fitbit users over a two-month period between 03.12.2016-05.12.2016. Thirty-five eligible Fitbit users consented to the submission of quantitative personal tracker data, including minute-level output for physical activity, heart rate, and sleep monitoring. It includes information about daily activity, steps, and heart rate that can be used to explore users’ habits.

Data Organization (filtering and sorting)

The dataset contains primary, structured data in 29 CSV files; 11 CSV files are for the first month of data from 3.12.16 - 4.11.16 and the other 18 CSV files are for the second month of data from 4.12.16 - 5.12.16. The dataset is in long format with multiple entries per user Id. I used Google Sheets to preview an inspect the datasets.

Does this data ROCCC?

Reliable

It’s inconclusive whether the sample size represents the overall population, though the 35-user size meets the minimum Central Limit Theorem (CLT) requirements for validity. Much of the data is incomplete for each user and not all data sets contain data for each of the 35 users. Using a sample size calculator from SurveyMonkey would help determine the ideal sample size that would accurately reflect the greater population of users who use smart devices. Additionally, the metadata on the datasets lacks details about the confidence level and margin of error for the sample population.

Other Considerations

The timeframe for data collection was limited to 2 months, which may give a limited view of the device usage. This could be a problem since device usage may vary with seasonality and user adoption of a new technology or tool.
Calories doesn't indicate if it's calories burned or calories consumed. For the purpose of this study, we'll assume it's calories burned.
The data points are reliable as they were tracked and recorded by the devices.
The data does not specify the units of measurement for most variables involving distance or intensity. For example, we cannot determine whether distance is measured in miles, kilometers, feet, or another unit. Additionally, the method used to assess and categorize activity levels—ranging from sedentary to very active or moderately active—remains unclear.
It is unclear whether the distance recorded in "logged activities" is a manual input or a feature of the FitBit device.

Original

The data is an original data source.

Comprehensive

The sample size is small and lacks essential demographic information, which is crucial for addressing the business task. For instance, gender is particularly important for Bellabeat products, as they are marketed specifically to women. Ideally, the dataset should include gender information to help Bellabeat identify trends among smart device users based on gender. Additionally, the FitBit dataset does not specify the type of fitness tracker from which the data originates—whether it is a wearable device, an app, etc. This report will provide a preliminary analysis.

Current

This data is nearly a decade old. Since it was collected, technology and fitness trackers have evolved significantly, so making recommendations for 2024 based on old data may lead to poor outcomes, though it could be interesting to compare this study with data from more recent years to determine if these same trends still exist today.

Cited

The source was cited and vetted.

Privacy, Security, and Accessibility

The original dataset is stored in this folder in GDrive. Stakeholders have permission to view the dataset.

The cleaned dataset is stored in this folder in GDrive. Stakeholders have permission to view and comment on the dataset.

Process

Documentation Cleaning or Manipulation of Data

Renaming Datasets

I renamed each dataset I will use to reflect the timeframe it represents, allowing for easier merging in R. The datasets I will be using for this Case Study are as follows:

dailyActivity_3.12.16-4.11.16.csv

dailyActivity_4.12.16-45.12.16.csv

hourlyIntensities_3.12.16-4.11.16.csv

hourlyIntensities__4.12.16-45.12.16.csv

hourlyCalories_3.12.16-4.11.16.csv

hourlyCalories_4.12.16-45.12.16.csv

dailySteps_3.12.16-4.11.16.csv

dailySteps_4.12.16-45.12.16.csv

Installing Packages in R

I installed the following packages to clean and analyze the data.

▶▼{R} Script

{R} 
 install.packages(c("tidyverse", "skimr", "janitor", "lubridate", "here", "outliers"))
            
              library (tidyverse)
              library (skimr)
              library (janitor)
              library (lubridate)

From here, I realized that trying to clean and combine hourly data and daily data would be a mess in one script. I chose to split datasets into two different {R} Scripts for ease of analysis. The first set of scripts is for Hourly Activity. The second set of scripts is for Daily Activity.

Daily Activity Datasets

▶ ▼ Daily Activity Data Cleaning

Uploading and Combining Datasets

There are two total datasets for daily activity that need to be uploaded and combined.

▶▼Daily {R} Script: Combining Daily Datasets

 {R} 

                #Importing daily activity datasets.
                daily_activity_1 <- read_csv("dailyActivity_3.12.16-4.11.16.csv") head(daily_activity_1) colnames(daily_activity_1)

                    daily_activity_2 <- read_csv("dailyActivity_4.12.16-5.12.16.csv") head(daily_activity_2) colnames(daily_activity_2)

                        #Combining the two data sets for daily activity. Renamed to daily_activity.
                        daily_activity <- rbind(daily_activity_1, daily_activity_2) glimpse(daily_activity) str(daily_activity) summary(daily_activity)

Removing Duplicates and Trimming Whitespace

First I trimmed the whitespace, then I checked for duplicates. No duplicates found!

▶▼Daily {R} Script

Daily {R} 

                #trimming leading/trailing spaces
                daily_activity <-daily_activity %>%
                    mutate(across(everything(), trimws))

                #removing duplicates
                n_unique(daily_activity$Id)
                  sum(duplicated(daily_activity)) #no duplicates found!

Cleaning Column Names and Hiding Columns

I hid the unnecessary columns "logged activities"logged_activities_distance" and "tracker_distance" to reduce clutter in the dataframes.

▶▼Daily {R} Script

 {R} 

                #formatting and cleaning column names
                daily_activity <- daily_activity %>%
                    clean_names()
                    head(daily_activity)
                    colnames(daily_activity)

                #hiding unnecessary columns "logged activities"logged_activities_distance" and "tracker_distance"
                daily_activity <- daily_activity %>% select(-logged_activities_distance, -tracker_distance)

Converting Datatypes in Columns

I changed the date column into a date format from a character format and changed most of the other columns to numbers, with the excecption of id.

▶▼Daily {R} Script

 {R} 

                daily_activity$activity_date <- as.Date(daily_activity$activity_date, format="%m/%d/%Y" ) #Changed it from a character data type to a date.

                    daily_activity <- daily_activity %>%
                        mutate(across(c(
                        total_steps,
                        total_distance,
                        very_active_distance,
                        moderately_active_distance,
                        light_active_distance,
                        sedentary_active_distance,
                        very_active_minutes,
                        fairly_active_minutes,
                        lightly_active_minutes,
                        sedentary_minutes,
                        calories),
                        as.numeric))

                    str(daily_activity) #checking that the new formatting worked!

Creating New Columns

I created new columns: "day_type" to label each day as a weekday or weekend, "day_of_week" to label each day of the week, and "week_number" to number each week from 1-9.

▶▼Daily {R} Script

 {R} 

                #creating new columns for weekdays and weekends
                daily_activity_cleaned <- daily_activity %>%
                    mutate(day_type = ifelse(wday(activity_date) %in% c(1, 7), "Weekend", "Weekday"))

                #creating new column to label each day by the day of the week.
                daily_activity_cleaned$day_of_week <- wday(daily_activity_cleaned$activity_date, label=TRUE, abbr=FALSE) glimpse(daily_activity_cleaned)


                #creating new column to label each week of the data collected (1 -8)
                #date range
                start_date <- ymd("2016-03-12") end_date <- ymd("2016-05-12")
                    #sequence of dates
                    date_range <- seq.Date(start_date, end_date, by="day" )

                        #week number
                        daily_activity_cleaned$week_number <- ceiling(as.numeric(difftime(daily_activity_cleaned$activity_date, start_date, units="days" ) + 1 )/ 7) 

                            glimpse(daily_activity_cleaned) #checking that the new formatting worked!

Analyze

Hourly Activity Analysis

▶▼ {R} Script

Average Intensity by Hour

Grouping by time to get all of the ids exercising at 1pm, for example, to group together with average intensity at a specific hour.

{R}

            hourly_summary <- hourly_activity_final_cleaned %>%
                group_by(hour) %>% #group by hour
                summarize(
                avg_intensity_by_hour = mean(total_intensity, na.rm = TRUE),
                avg_steps_by_hour = mean(step_total, na.rm = TRUE),
                )
            
            print(hourly_summary)
            glimpse(hourly_summary)

Grouping Avgerage Intensity by Time of Day

Assigned labels to ranges of hours during the day like morning, afternoon, evening, and night.

{R}

            hourly_summary <- hourly_summary %>%
                mutate(
                time_of_day = case_when(
                hour >= 5 & hour < 12 ~ "Morning (5:00 AM - 11:59 AM)" , hour>= 12 & hour < 17 ~ "Afternoon (12:00 PM - 4:59 PM)" , hour>= 17 & hour < 21 ~ "Evening (5:00 PM - 8:59 PM)" , TRUE ~ "Night (9:00 PM - 4:59 AM)" ) )

Summary of Analysis for Hourly Activity

Calculations

Created new columns for date and time
Assigned labels to each time of day (night, morning, afternoon, evening)
Calculated the average intensity by hour for all users
Calculated the average steps by hour for all users

These calculations will help us visualize and determine if there is a specific time of day and/or hour when activity is higher for users.

Daily Activity Analysis

▶▼ {R} Script

Daily Activity Usage Summary by id

Finding the avereage days used by id from Low to High usage categories. Low usage is 1-20 days, Moderate is 21-40 days,High is 41-63 days.

{R}

            daily_usage_by_id_summary <- daily_activity_cleaned %>% 

                group_by(id) %>%

                summarize(
                days_used=sum(n())) %>%

                mutate(usage = case_when(
                days_used >= 1 & days_used <= 20 ~ "Low (1-20 days)" , days_used>= 21 & days_used <= 40 ~ "Moderate (21-40 days)" , days_used>= 41 & days_used <= 64 ~ "High (41-63 days)" , ))

Grouping the Average Days Used by Usage Category

Finding the avereage days used from Low to High usage categories. Low usage is 1-20 days, Moderate is 21-40 days,High is 41-63 days.

{R}

            #finding the average days used by usage category
              daily_avg_usage_summary <- daily_usage_by_id_summary %>%
                group_by(usage) %>%
                summarize(
                avg_days_used = mean(days_used, na.rm = TRUE)
                ) %>%
                mutate(
                percentage = (avg_days_used / sum(avg_days_used)) * 100
                )

                daily_avg_usage_summary$usage <- factor(daily_avg_usage_summary$usage, levels=c( "Low (1-20 days)" , "Moderate (21-40 days)" , "High (41-63 days)" )) str(daily_avg_usage_summary) daily_activity_percent_usage_summary <- daily_avg_usage_summary %>% #avg percentage of days used during data collection period (63 days)
                  group_by(usage) %>%
                  summarize (
                  percent_used = round((avg_days_used/63) *100,0)
                  )

Summarizing Weekly Averages by Distance Type

Very Active Distance | Moderately Active Distance | Lightly Active Distance

{R}

            week_distance_type_summary <- daily_activity_cleaned %>%
                group_by(week_number) %>%

                summarize (
                avg_very_active_distance = mean(very_active_distance, na.rm = TRUE),
                avg_moderately_active_distance = mean(moderately_active_distance, na.rm = TRUE),

                avg_lightly_active_distance = mean(light_active_distance, na.rm = TRUE)
                )


                #converting data into a long format for creating a chart.
                week_distance_type_summary_long <- week_distance_type_summary %>%

                  pivot_longer(cols = starts_with("avg"),

                  names_to = "activity_level", 

                  values_to = "average_distance"

                  )

Average Total Distance by Week

Finding the average distance by the week number from 1 to 9.

{R}

            week_count_distance_summary <- daily_activity_cleaned %>%
                group_by(week_number) %>%
                summarize (
                weekly_distance = mean(total_distance, na.rm = TRUE))

{R}

            weekday_distance_summary <- daily_activity_cleaned %>%
                group_by(day_of_week) %>%
                summarize (
                day_of_week_distance = mean(total_distance, na.rm = TRUE))

Summary of Analysis for Daily Activity

Calculations

Grouped users by days used and put them in newly created categories for usage level from low to high
Summarized the number of days used by each usage level.
Calculated the weekly averages by distance type.
Calculated the average total distance by the day of the week.
Calculated the average total distance by week count.

These calculations will help us determine if (1) the average intensity of activity that most users do (2) if there is a specific day of the week or week during the data collection period when activity is higher for users which is determined by distance.

The chart above reveals that users are most active between 8am and 8pm with the highest activity hours being 12:00pm and from 5:00pm - 8:00pm.

▶▼{R} Script for Chart #1

{R}

              ggplot(hourly_summary, aes (x = hour, y = avg_intensity_by_hour, fill = avg_intensity_by_hour)) + 

                geom_col() + 

                scale_fill_gradient(low = "green", high = "red") + 

                theme_minimal() + 

                labs(title = "Average Intensity by Hour", x = "Hour", y= "Average Intensity")

avg-intensity-by-time-of-day_column_chart

The chart above further shows that intensity is higher in the evening followed closely by the afternoon. I decided to group hours of the day by time of day, IE "morning" to help illustrate futher how activity varies by the time of the day.

▶▼{R} Script for Chart #2

{R}

                ggplot(hourly_summary, aes (x = hour, y = avg_intensity_by_hour, fill = time_of_day)) + 

                  geom_col() + 

                  scale_fill_manual(values= c("Morning (5:00 AM - 11:59 AM)" = "#caf0f8",
                  "Afternoon (12:00 PM - 4:59 PM)" = "#00b4d8",
                  "Evening (5:00 PM - 8:59 PM)" = "#023e8a",
                  "Night (9:00 PM - 4:59 AM)" = "#03045e")) + 

                  theme_minimal() + 


                  labs(title = "Average Intensity by Time of Day", y= "Average Intensity", x= "Hour", fill = "Time of Day")

avg-intensity-by-avg-calories-per-hour-scatterplot

The scatterplot above shows that there is a positive correlation between intensity and calories burned in an hour. It seems obvious but confirms that with more intensity, more calories are burned.

▶▼{R} Script for Chart #3

{R}

                  ggplot(hourly_summary, aes(x=avg_calories_by_hour, y=avg_intensity_by_hour)) +

                    geom_point(color="#48cae4", size = 5, alpha = 0.7) + 

                    geom_smooth(method = "lm", color = "#e76f51") + #trendline

                    labs(
                    title = "Average Intensity by Average Calories per Hour",
                    x = "Average Calories",
                    y = "Average Intensity"
                    ) + #adding labels
                    theme_minimal()+

                    theme(
                    plot.title = element_text(hjust = 0.5, size = 16, face = "bold"),
                    axis.title = element_text(size = 14),
                    axis.text = element_text(size = 12)
                    ) #adjusting sizes of labels

The scatterplot above shows that there is a positive correlation between intensity and steps each hour. It seems obvious but confirms that with more steps per hour, there is more intensity.

▶▼{R} Script for Chart #4

{R}

                    ggplot(hourly_summary, aes(x=avg_steps_by_hour, y=avg_intensity_by_hour)) +

                      geom_point(color="#48cae4", size = 5, alpha = 0.7) + 

                      geom_smooth(method = "lm", color = "#e76f51") + #trendline

                      labs(
                      title = "Average Intensity by Average Steps per Hour",
                      x = "Average Steps",
                      y = "Average Intensity"
                      ) + #adding labels
                      theme_minimal()+

                      theme(
                      plot.title = element_text(hjust = 0.5, size = 16, face = "bold"),
                      axis.title = element_text(size = 14),
                      axis.text = element_text(size = 12)
                      ) #adjusting sizes of labels

Daily Activity Visualizations

The pie chart above shows the distribution of average days used across three defined usage levels. It shows that out of the 35 unique users, about half fell into the low or moderate usage categories.

▶▼{R} Script for Chart #5

{R}

                        ggplot(daily_avg_usage_summary, aes(x= "", y =avg_days_used, fill = usage)) +

                          geom_bar(stat = "identity", width = 1) + # Use geom_bar for pie chart 

                          scale_fill_manual(values = c(
                          "Low (1-20 days)" = "#fde9ee",
                          "Moderate (21-40 days)" = "#f3c985",
                          "High (41-63 days)" = "#fd8f77")) +
                          coord_polar("y", start = 0) + # Convert to polar coordinates for a pie chart

                          theme_minimal() + 

                          theme(
                          axis.text.x = element_blank(), #Remove x axis text
                          axis.text.y = element_blank(), #Remove y axis text
                          axis.ticks = element_blank(), # Remove axis ticks
                          panel.grid = element_blank() # Remove grid lines
                          ) +

                          geom_label(aes(label = paste0(round(avg_days_used, 1))),  # Add percentage labels
                          position = position_stack(vjust = 0.4),  # Position the labels inside the slices
                          color = "white", size = 5, fontface = "bold",  # Text color and size
                          label.padding = unit(0.5, "lines"),  # Add padding around the text
                          fill = "black", alpha = 0.6) + 

                          labs(
                          title = "Breakdown of Average Usage by Level",
                          subtitle = "Low, Moderate, and High Usage Spanning 1–63 Days",
                          fill = "Usage Levels",
                          x = "",
                          y = ""
                          )

The chart above shows that Monday, Wednesday and Saturdays are the days with higher distances logged throughout the week.

▶▼{R} Script for Chart #6

{R}

                          ggplot(weekday_distance_summary, aes(x = day_of_week, y = day_of_week_distance))+ 

                            geom_col(fill = "#fd8f77") + 

                            theme_minimal() + 

                            labs(
                            title = "Average Total Distance by Weekday",
                            fill = "Weekday",
                            x = "Weekday",
                            y = "Distance (miles)"
                            )

avg-total-distance-by-week-and-distance-type-grouped-col-chart

The chart above shows that over a period of 9 weeks, the most users fall into the lightly active category and less activity overall in the first three weeks.

▶▼{R} Script for Chart #7

{R}

                            ggplot(week_distance_type_summary_long, aes(x = factor(week_number), y = average_distance, fill = activity_level)) + 

                              geom_bar(stat = "identity", position = "dodge") + 

                              scale_fill_manual(values= c("avg_lightly_active_distance" = "#f3c985",
                              "avg_moderately_active_distance" = "#fd8f77",
                              "avg_very_active_distance" = "#ec3672"),
                              labels= c("avg_lightly_active_distance" = "Light Active Distance",
                              "avg_moderately_active_distance" = "Moderate Active Distance",
                              "avg_very_active_distance" = "Very Active Distance")) + 

                              labs (
                              title = "Average Distance by Distance Type and Week",
                              x = "Week Number",
                              y = "Average Distance (miles)",
                              fill = "Distance Type") + 

                              theme_minimal()

The chart above shows that over a period of 9 weeks, the first three weeks have less distance logged.

▶▼{R} Script for Chart #8

{R}

                              
                                ggplot(week_count_distance_summary, aes(x = week_number, y = weekly_distance, fill = weekly_distance))+ 

                                geom_col() + 

                                scale_fill_gradient(low = "#fde9ee", high = "#ec3672") +
                                theme_minimal() + 

                                labs(
                                title = "Average Total Distance by Week",
                                fill = "Weekly Distance",
                                x = "Week Number",
                                y = "Average Distance (miles)"
                                ) + 

                                scale_x_continuous(breaks = seq(min(week_count_distance_summary$week_number), 

                                max(week_count_distance_summary$week_number), by = 1))

Act

Top High-Level Recommendations Based on My Analysis

Fitness Onboarding and Habit Formation

The data shows a trend that in the first three weeks of usage, there is lower activity overall, this suggests that new users wore their device less during the first three weeks as they built a habit of wearing it, or they were starting a new exercise regime. I recommend creating reminders or goals for new users, as a sort of “fitness onboarding” so that new users develop a habit of wearing their devices and thereby using the device more from the start. Fitness onboarding can help users form habits, especially if they are new to wearing such devices.
I suggest features like:

Weekly progress updates

Tips for building a routine with wearing the device

Rewards for Streaks

Since roughly half of the users fall into the Low to Moderate use categories, I’d create rewards for being on a streak of using the device. Gamifying the experience with rewards for streaks is a way to enhance engagement with the device and increase usage.
I suggest rewards like:

Badges for achieving milestones

Social sharing features to encourage friendly competition

Discounts for users who reach a specific streak milestone

Goal Setting for Running or Walking Events

Most users are “lightly active” based on distance, with “very active distance” being the second category. Since distance is measured by steps, I’d create goal-setting features and training plans by activity level for users training for running or walking events like marathons, 5Ks, etc. Most users being lightly active suggests that aspirational goal-setting features could motivate them to increase their activity levels.

Customizable Calorie Burning Goals

From the two scatterpolots we can see that as intensity increases, so do steps, and calories burned. I’d offer a way for users to pick their calorie-burning goal based on either intensity or steps. This will help users burn calories in a way they prefer. Giving users flexibility to align goals with their preferences (intensity vs. steps) adds personalization, which is valuable for user engagement.
I suggest features like:

Showing how specific activities contribute to calorie burn.

Reminding users to go on a walk if intensity is too low during the morning-evening.

Business Task

Data Sources

TL;DR

Dataset and Licensing

Data Collection

Data Organization (filtering and sorting)

Does this data ROCCC?

Reliable

Other Considerations

Original

Comprehensive

Current

Cited

Privacy, Security, and Accessibility

Documentation Cleaning or Manipulation of Data

Renaming Datasets

Installing Packages in R

Hourly Activity Datasets

Uploading and Combining Datasets

Viewing Structure and Summarizing Data

Removing Duplicates and Trimming Whitespace

Cleaning Column Names

Converting Data Types in Columns

Hiding Columns

Daily Activity Datasets

Uploading and Combining Datasets

Removing Duplicates and Trimming Whitespace

Cleaning Column Names and Hiding Columns

Converting Datatypes in Columns

Creating New Columns

Hourly Activity Analysis

Average Intensity by Hour

Grouping Avgerage Intensity by Time of Day

Summary of Analysis for Hourly Activity

Calculations

Daily Activity Analysis

Daily Activity Usage Summary by id

Grouping the Average Days Used by Usage Category

Summarizing Weekly Averages by Distance Type

Average Total Distance by Week

Summary of Analysis for Daily Activity

Calculations

Hourly Activity Visualizations

Daily Activity Visualizations

Top High-Level Recommendations Based on My Analysis

Fitness Onboarding and Habit Formation

Rewards for Streaks

Goal Setting for Running or Walking Events

Customizable Calorie Burning Goals