Bike Share Data Analysis - Capstone Project

benzifoo
Aug 2, 2021
7 min read

Updated: Sep 1, 2021

This is my first ever capstone project that I've done for data analytics topic. This project used R Programming for the analysis and was compiled via R Markdown.

Data analysis is important in every business that I could think of to understand problems facing in an organisation, and to explore data in meaningful ways. Data in itself is merely facts and figures instead of gut instinct. Data analysis organises, interprets, structures and presents the data into useful insights that provide meaningful context from the data. Then it would be used for decision-making processes and actions to be taken with the goal of improving organization productivity, efficiency, and profitability.

Please do download R Markdown for code details.

I’m a junior data analyst working in the marketing analyst team in a bike-share company hypothetically. Before starting to prepare the available data, data cleaning, visualization and analysis to be taken, first need to identify the right problems and identify the objective that is significantly important for decision-making process.

Problem

The business is currently in the stage of bottleneck as the revenue didn’t grow much. The company wants to understand how casual riders and annual members that using the bikes differently.

Objective

The objective is the director of the marketing department wants to maximize the number of annual memberships by taking the insights for the decision-making to increase annual subscription members.

Business Task

Analyze user behaviours between casual riders and annual members. Increase the market share of the organization by converting casual riders to annual subscription members.

Discover Datasets

I’ve downloaded the year 2020 structured data of .csv files that the company provided which contains data integrity and key insights of riding style ready for further analysis.

There are 9 datasets files

One dataset is the total of the 1st quarter (January – March) of the year 2020 which contains 13 attributes and 426,887 entries.

One dataset is the total of a month (April) of the year of 2020 which contains 13 attributes and 84,776 entries.

One dataset is the total of a month (May) of the year of 2020 which contains 13 attributes and 200,274 entries.

One dataset is the total of a month (June) of the year of 2020 which contains 13 attributes and 343,005 entries.

One dataset is the total of a month (July) of the year of 2020 which contains 13 attributes and 551,480 entries.

One dataset is the total of a month (August) of the year of 2020 which contains 13 attributes and 622,361 entries.

One dataset is the total of a month (September) of the year of 2020 which contains 13 attributes and 532,958 entries.

One dataset is the total of a month (October) of the year of 2020 which contains 13 attributes and 388,653 entries.

One dataset is the total of a month (November) of the year of 2020 which contains 13 attributes and 259,716 entries.

One dataset is the total of a month (December) of the year of 2020 which contains 13 attributes and 131,573 entries.

I organized the data for easier the work for analysis, I decided to concatenate these data sets into one data frame which having 13 attributes remained (Column), a total of bike-sharing data in 2020 contains 3,541,683 entries (Row).

Here are some of the steps are taken for the analysis:-

Data Cleaning

Remove duplicated entries.

Parsing Datetime Data

Altered the started and ended datetime structure.

Date: Year/Month/Day
Time: Hour/Minute/Second

Data Manipulation

To improve and precise on the analysis further

ride_time_m

Adding new column - total time of the bike ride for both member and casual

year_month

Separate the year and month into one column

Weekday

It would be easier to determine the pattern in a week. Whether weekday or weekend that people ride the most.

start_hour

Hour of the day also may be useful for intraday analysis

Here we can see having some new columns created which are ride_time_m, year_month, weekday, start_hour, and the organized datetime. When comes to big amount data like this, usually the best technique is calculation and distribution of sample size. For example, populations of 3,541,683, 99% confidence level and 5% margin of error, in result would be 666. If we’re using excel for further analysis, sample size technique is a must in this case. However, we’re still able to analyse with R to process these amounts of data, I will ignore this step for achieving its precision results.

I'm using R for this capstone project, for two main reasons: Because of the large dataset and to gather experience with the language. SQL is also able to load this large amount of data as well, that will be my next project.

Analyze

The data exploration will consist of building a profile for annual members and how they differ from casual riders.

Casual riders hold 38.6% and annual subscription members hold about 61.4%, which is not that bad at the first glance.

Analysis Based On Plots

As we could see in this bar chart, members having a bigger proportion compared to casual riders.

Monthly data distribution (Statistics)

Monthly data distribution (Chart)

The biggest data point was in August 2020 which accounted for 17.7% of the whole dataset proportionately. Every month of 2020, we have more annual subscription members more than casual rides, which it could be returning members. The distribution data looks cyclical, usually it would be correlated with weather seasons. However, Malaysia’s four seasons in a year are summer. Hence, the weather consideration would be eliminated.

Weekday Distribution (Statistics)

Weekday Distribution (Chart)

As we can see the highest volume is on Saturday and Sunday. Saturday has the biggest data points compared to Sunday. Even though members have the biggest data points overall, but Sunday casuals are. Typically, members would have a significant incline from Monday to Saturday.

Hour In A Day (Statistics)

Hour In A Day (Chart)

There's a bigger volume of bikers in the afternoon.
We have more members during the morning, mainly in between 1pm and to 11pm, casuals between 11pm to 12pm

This chart can be expanded by weekday

Distribution By hour In A Day Divided By Weekday

There's a clear difference between the midweek and weekends. Let's generate charts for these two configurations.

Distribution By Hour Of The Day In The Midweek

The two plots differ in some key ways:

While the weekends have a smooth flow of data points, the midweek has a steeper flow of data.
The count of data points doesn't have much meaning knowing each plot represents a different amount of days.
There's a big increase of data points in the midweek between 1 pm to 3 pm. Then it falls a bit. Another big increase is from 7 pm to 1 am.
During the weekend we have a bigger flow of casuals between 2 pm to 11 pm.

It's fundamental to question who are the riders who use the bikes during this time of day. We can assume some factors, one is that members may are people who use the bikes during the daily routine activities, like go to work and back from work. Let's check out which type of bikes are preferred by users.

Type Of Bikes (Statistics)

Type Of Bikes (Chart)

Docked bikes have the biggest volume of rides compared to electric bikes and classical bikes. However, this may be because the company have more docked bikes than others. Most of the members prefer classic bikes, 68% more.

Other Variables

Now let's get a look at some variables of the dataset and get summarized statistics from the dataset.

ride_time_m

The minimum and maximum will be impeding the analysis. Common sense tells us this may have some malfunction stations return bad dates.

Identify Outliers

We can see the difference between 0% and 100% is 185,495 minutes. The difference between 5% and 95% is 68 minutes. Hence, we need o use a subset of the dataset without outliers. The subset will contain 95% of the dataset.

Getting Outliers Remove

Multi-Variable Exploration (ride_time_m)

One of the first interactions between the columns and ride_length is a box plot, with subplots based on the casual_members column. Also the summarized data.

Riding Time (Casual vs Member)

As we can see casuals have more riding time than members did, which probably is because casual riders usually use it for leisure purposes. The interquartile range was also higher than members which we could look back at the median statistics earlier (Casual = 19.4 and Member = 11.7).

Let’s identify more insights after plotting with weekday.

Riding Time Of The Week (Casual vs Member)

Members riding time remain unchanged during weekday and the boxplot shows significant incline during the weekends. While the casuals riding time having the curve distribution, decreasing from Monday to Friday and increasing from Friday to Sunday.

Now we can start finding out the type of bikes during the week.

Type Of Bikes Riding Time (Casual vs Member)

Electrical bikes and classic bikes both having less riding time compared to docked bikes. Perhaps docked bikes usually use for leisure purposes most of the time.

Trends

Data shows there are more members than casuals.
Data shows there are differences in the riding flow between the members and casuals riders in a week.
Data shows the third quarter of 2020 had inclined during that period significantly. This need to take further investigation.
The motivations between members and casuals are different. For example, members usually ride for work and casuals ride for leisure based on the schedules data.
The example above may be led to the question of why members have less riding time compared to casuals.
Members tend to prefer classic bikes.
Increased volume from the afternoon.
The biggest volume in a week, typically on the weekends.

Types of activities

Work
Workout
Leisure

Conclusion:

When data shows there are more members than casuals, we have to look at the casuals trends. Usually, casuals having a significant increase on the weekends then I would say casuals ride during idle time either workout or leisure purposes. The marketing department team, need to target the casuals by identify their address. Allocate bikes nearby their housing areas for them to go for work, then will be a success of casual member transitioning.

Next, we have known that members ride on Monday to Friday most of the time because of work purposes and means that their offices are not too far from the bike stations. These locations need to well maintain and ensure adequate bikes standing by to avoid members cancelling the annual subscriptions. Or else members keep cancelling while the organization trying to convert casuals to members, it would be wasting the time and resources for making the organization profitable. While the maintenance team maintaining the bikes, they need to note that the peak time would be 2 pm to 11 pm. So the maintenance team is able to have better scheduling to maintain the stations in the morning.

Members tend to prefer classic bikes based on the data shown. Hence, it suggests organization shall acquire more classic bikes to enhance the users riding experience and allow the organisation to thrive in its success.

Bike Share Data Analysis - Capstone Project

Problem

Recent Posts

Comments