As I discuss in my first blog post, I began an ultralearning project where I dedicated time every single day for a year to learning data science. I used this time to build basic technical skills by designing SQL databases, learning how to scrape the web, developing deeplearning/ML algorithms in Python, and deploying an interactive machine learning web app called CupUp. I also used this time to learn how to more efficiently read textbooks, get a sense of basic graphic design principles, and write about my work to make it publishable (blog publishable at least, hello!). More importantly than all that, I have used this time to generate ~366 days of data. Regardless of what I learned on any given day I always left a Google Calendar event to keep track of the exact amount of time I spent learning, typically down to the exact minute I started and stopped.
Figure 1: My Google Calendar for a sample week in January 2021
While I understand that the quantity of time spent learning does not necessarily indicate the quality of learning, I usually stop the clock if I find myself losing focus or direction. Therefore, let’s assume that more time spent learning on a given day does indicate a better learning experience.
Since my time was tracked using a Google calendar, I needed to look for ways to convert this data into a more accessible format for analysis. I ended up using an online application called GTimeReport to convert my Google Calendar data from the past year into an excel file format. This made it much easier to analyze my time spent learning. Whether or not GTimeReport is a trustworthy app is anyone’s guess, but they haven’t done me wrong yet (use them at your own risk)!
With newly-granted access to our calendar data I created and cleaned a Pandas dataframe using the code below. Here, "Text" represents the title of the calendar event. Since each event I made about my project included the same "Data Science Learning" string, I was able to filter the dataframe to only include calendar events with the 'data science l' substring.
import pandas as pd
import matplotlib.pyplot as plt
from matplotlib import style
import numpy as np
import random
style.use('ggplot')
# Read in our calendar excel file
ul = pd.read_excel("CalendarData_FINAL.xlsx")
# Focus on columns we want to use for analysis
# "Text" represents the title of the event
ul = ul[['Text', 'Weekday', 'Start time', 'End time', 'Description']]
ul = ul[ul['Text'].notna()]
ul['Text'] = ul['Text'].str.lower()
# All events have exactly the same title, which
# makes events spent learning data science easy!
ul = ul[ul.Text.str.contains("data science l")]
# Specify the times as datetime objects
ul['Start time'] = pd.to_datetime(ul['Start time'], format='%Y-%m-%d%H:%M:%S')
ul['End time'] = pd.to_datetime(ul['End time'], format='%Y-%m-%d%H:%M:%S')
# Pick out the times we are interested in
ul['Date'] = ul['Start time'].dt.floor('d')
ul['Year'] = ul['Start time'].dt.year
ul['Month'] = ul['Start time'].dt.month
ul['Day'] = ul['Start time'].dt.day
ul['Hour'] = ul['Start time'].dt.hour
# Calculate the time spent learning in minutes
ul['Duration'] = ul['End time'] - ul['Start time']
ul['Duration'] = ul['Duration'].dt.seconds / 60
ul.head()
Figure 2: Output of the first 5 rows of the ul (ultralearning) dataframe
Some days I broke up learning into different events. To get the total amount of time spent learning for each calendar day, I grouped the dataframe by date and applied the sum function.
# Group event data by Date and sum their values.
# Some days I break learning into multiple events.
# This ensures that the time spent learning on a
# given day is the sum of the time spent in all events
# made on that day.
by_day = ul.groupby(['Date']).sum()
# This line of code helps fill in any missing days with 0s.
# Retrieved from: https://stackoverflow.com/questions/31867660/python-pandas-making-date-index-continuous
by_day = by_day.reindex(pd.date_range(min(by_day.index),
max(by_day.index),
freq='D')).fillna(0)
by_day.head()
Figure 3: Output of the first 5 rows of the by_day dataframe
Now let’s take a high-level look at how the data science learning process has gone over the course of the project. First, we will print out some overall stats.
# Print out some stats about the time spent learning
print(f"Ultralearning Project Stats (Minutes)\n")
print(f"Total time spent learning: {round(by_day['Duration'].sum(), 2):,}")
print(f"Max time spent learning: {round(by_day['Duration'].max(), 2)}")
print(f"Average time spent learning: {round(by_day['Duration'].mean(), 2)}")
print(f"Median time spent learning: {round(by_day['Duration'].median(), 2)}")
Which outputs the following ultralearning project stats (in minutes):
- Total time spent learning: 36,362.63 (~606 hours)
- Max time spent learning: 517.98 (~ 8.5 hours)
- Average time spent learning: 98.54
- Median time spent learning: 90.00
Since the average and median time spent learning are relatively close to one another, it would seem that my time spent learning every day remained pretty consistent. Let's plot the learning duration over time to get a high-level overview of how time was allocated to the project. Note that in each of the below graphs, the horizontal black line will represent my 90 minute goal benchmark.
# Plot our duration as a line chart
by_day['Duration'].plot()
# Add necessary plot and axis titles to the chart
plt.title('Daily Data Science Learning')
plt.ylabel('Minutes Spent Learning')
# Set the yticks to be multiples of 60 to
# easily identify different hours spent learning
plt.yticks(list(range(0, 550, 60)))
# Add the horizontal black line showing the
# daily target learning time
plt.axhline(90, c='black', alpha=0.3)
plt.tight_layout()
plt.savefig(r'DailyDataScienceLearning.png', dpi=300)
Figure 4: Complete overview of the time spent learning data science every day for 12 months.
There are a few trends I can immediately spot. First we can notice that the duration generally oscillates around the 90 minute marker line, indicating that if I learn less on one day I'll make it up the next day. We can also notice about 10 days that get really close to zero which are mostly concentrated around holidays (Halloween, Thanksgiving, Christmas, New Year's, etc.). At these points I was more interested spending time with friends and family than with code (no hard feelings, Python). There is also a strong drop-off around March when I moved across country. Around January is when we start to see the project center around the 60 minute mark. This is around the time I started my first full-time data science job. The average time spent learning is then hoisted back up to the ~160 minute mark at the start of a "Data Science Fundamentals" course I enrolled in on Mondays and Wednesdays in early June.
Overall we can see a lot of jumping around my goal. How often did I actually meet or exceed my goal time of 90 minutes a day? We can figure this out using the code below:
# Less than 90 minutes
below_thresh = len(ul[ul['Duration'] < 90]) / len(ul)
# Exactly 90 minutes
at_thresh = len(ul[ul['Duration'] == 90]) / len(ul)
# Exceeded 90 minutes
above_thresh = len(ul[ul['Duration'] > 90]) / len(ul)
labels = ["Below Goal", "Met Goal", "Exceeded Goal"]
sizes = [below_thresh, at_thresh, above_thresh]
colors = ["#0083a0", "#515052", "darkred"]
# Figure below accomplished using
# https://stackoverflow.com/questions/45771474/matplotlib-make-center-circle-transparent
# https://stackoverflow.com/questions/21572870/matplotlib-percent-label-position-in-pie-chart
fig, ax = plt.subplots()
wedges, text, autotext = ax.pie(sizes, colors=colors, labels=labels,
autopct='%1.1f%%', pctdistance=.425)
plt.setp( wedges, width=0.3)
ax.set_aspect("equal")
# The produced png will have a transparent background
plt.savefig("GoalsMet.png", dpi=300, transparent=True)
plt.show()
Figure 5: Time spent learning divided into times below, at, and exceeding my daily 90 minute goal.
It looks like time spent above and below my goal balanced out almost exactly. This is likely because I knew I wanted to average 90 minutes a day for this project and deliberately learned more or less to stay around that goal. Overall, I pretty much always managed to find some time every day to learn. That is the key to this whole project. Whether I learned for four hours or forty minutes, I still had a portion of every single day reserved to learning and exploring data science.
How did learning fluctuate from day to day? Were peaks always followed by troughs in learning? Figure 6 shows how one day’s learning time changed since yesterday. Here, a value of 0 would mean that I spent as much time learning today as I did yesterday.
# Calculate duration spent learning by day
by_days = ul.groupby('Date').sum()
by_days = by_days[['Duration']]
# Calculate differences (both absolute and percent) in
# durations spent learning between days
by_days['Change From Yesterday'] = by_days['Duration'].diff()
by_days['Pct Change from Yesterday'] = by_days['Change From Yesterday'] / by_days['Duration'].shift()
# Plot a histogram of the change since yesterday
by_days['Change From Yesterday'].hist(bins=75)
plt.title('Daily Change in Learning Duration')
plt.xlabel('Duration Change from Yesterday (Minutes)')
plt.xticks(list(range(-450, 451, 90)))
plt.ylabel('Number of Days');
plt.savefig(r'MinutesDailyChange.png', dpi=300)
Figure 6: Histogram of the daily change in minutes spent learning.
Here we can see a nearly perfect normal distribution balanced around 0. This tells me that on most days, it’s a pretty random shot whether I am going to learn for hours or just a few minutes. This could also mean that I am good about making up for less active days. For instance if I only learned for 60 minutes on one day, I make up for it by learning for 120 minutes on the next day. Since the distribution of change is centered around 0, I would say that on average I am likely to spend as much time learning today as I did yesterday. Therefore if I wanted to increase the amount of time I spent learning every day, a slow and gradual increase in time dedicated to learning would be a good way to build momentum and get me used to longer learning durations.
Next, I decided to look at how the day of the week affected my learning time (shown below in Figure 7).
# Make a copy of the dataframe to group by weekday
# and convert the date index into a datetime series object
by_weekday = by_day.reset_index().copy()
by_weekday.rename({'index':'Date'}, axis=1, inplace=True)
by_weekday['Date'] = pd.to_datetime(by_weekday['Date'])
# Group by the mean of the weekday, where Monday is 0 and Friday is 6
weekday_mean = by_weekday.groupby(by_weekday['Date'].dt.weekday).mean()
weekdays = ['Monday','Tuesday','Wednesday','Thursday','Friday', 'Saturday', 'Sunday']
weekday_mean['Weekday'] = weekdays
weekday_mean
# Color most bars grey, but highlight the highest
# bar with a dark red
colors = ["#515052" for i in range(len(weekdays))]
colors[2] = "darkred"
# Plot the average time spent learning by week
plt.bar(weekday_mean['Weekday'], height=weekday_mean['Duration'], color=colors)
plt.xticks(rotation=30)
plt.yticks([15, 30, 45, 60, 75, 90, 105, 120])
# Add chart labels
plt.ylabel('Average Minutes Learned')
plt.title('Time Spent Learning vs Day of the Week')
# Add the 90 minute goal marker
plt.axhline(90, c='black', alpha=0.5)
plt.tight_layout()
plt.savefig(r'Weekdays.png', dpi=300)
Figure 7: The average number of minutes spent learning for each day of the week.
Interestingly, there is a significant dip on Fridays (usually spent with friends or [insert relatable streaming platform]). and Sundays. Wednesday clearly stands out as my best time for learning data science, likely because of that MW course!
Finally, I decided to look at how the time I started learning affected the total time I would spend learning (shown below in Figure 8). Keep in mind that most days I wake up at around 6 am and sleep at 10 pm.
# Group the ultralearning dataframe (ul) by hour and
# count then join this information together to find the
# average time spent learning based on starting hour
learning_by_start_time = ul.groupby(['Hour']).sum()
num_times_started = ul.groupby(['Hour']).count().Weekday
learning_by_start_time = learning_by_start_time.join(num_times_started)
learning_by_start_time.rename({'Weekday': 'Count'}, axis=1, inplace=True)
learning_by_start_time['AvgDuration'] = learning_by_start_time['Duration'] / learning_by_start_time['Count']
# Plot a bar graph of the average duration by hour started
plt.bar(learning_by_start_time.index, height=learning_by_start_time['AvgDuration'], color=colors)
# Color the bars
start_times = sorted(ul['Hour'].unique())
colors = ["#515052" for i in range(len(start_times))]
colors[5] = "darkred"
colors[0] = "#0083a0" # Cool blue
colors[3] = "#0083a0" # Cool blue
colors[-3] = "#0083a0" # Cool blue
colors[-10] = "#0083a0" # Cool blue
# Add additional chart labels and info
plt.xticks(start_times)
plt.title('Starting Hour vs Average Duration Spent Learning')
plt.xlabel('Hour Started')
plt.ylabel('Average Minutes Learned')
plt.axhline(90, c='black')
plt.savefig(r'StartingHours.png', dpi=300)
Figure 8: Average amount of time spent learning based on starting time.
Figure 8 seems to tell me that learning in the mid-morning (2-3 hours after waking up) is the best way to increase the time spent learning. It could also mean that when I have the opportunity to spend the morning learning data science (ex. weekends when I do not normally have other plans) I can dedicate more time to learning. The shortest times spent learning (marked in blue) seem to be at almost exact thirds of the day. The two shortest average durations (5 am and 11 pm) are times when I am normally sleeping and learning is probably lackluster due to sleep deprivation. The middle hour at 1 pm seems to indicate there is a general lull in enthusiasm after lunch.
Overall project results:
- Between July 2020 and August 2021 I spent a total of 606 hours learning about data science
- I spent 98.5 minutes on average learning data science
- I missed 3 days total (which I made up at the end of the project)
- The most amount of time I spent learning in one day was 8.6 hours on July 11th, 2021, the night before my presentation of GaoGetter
My primary takeaways from this analysis are:
- Start the project early in the day (but not too early). Starting 2-3 hours after waking up is when I seem to build the most momentum.
- To break out of a rut, a slow and steady increase in the amount of time I spent on the project is the best way to increase the time spent on the project.
- Fridays are my weakest learning days. This is probably because the weekend is just around the corner. Understanding this, it is probably a smart idea to budget less learning time for Fridays and more for Saturdays.
I was proud of being able to complete this self-directed project over the course of a year. This is the most consistent I have ever been about teaching myself something. I have learned so much about my personal learning style and habits. To end off the project, I decided to stylize the learning over time as a bar chart. The bar chart reminded me of a city skyline, so I made the following picture to represent my year of learning:
x,y = range(len(by_day)), by_day.Duration
palate = ["#003049","#515052","#ffff00","#fcbf49","#eae2b7"]
# Make our plot
fig, ax, = plt.subplots(ncols=1, figsize=(5.25, 2.5))
# Add a horizontal line
#ax.axhline(y=1.5, color='black', alpha=0.5, linewidth=1,
# zorder=2)
# Removing those pesky borders
ax.axis('off')
# Add the buildings
ax.bar(x, y, color=palate[1], zorder=1)
# Add lights to the buildings
max_lights_per_building = 200
light_size = 0.025
light_x, light_y = [], []
for pillar_x, pillar_y in zip(x,y):
# If there is a day that I did
# DSL
if pillar_y:
num_lights = random.randint(0, max_lights_per_building)
light_y += list(np.random.uniform(light_size, pillar_y, size=num_lights))
light_x += [pillar_x for i in range(num_lights)]
light_bottom = np.array(light_y) - light_size
ax.bar(light_x, np.array(light_y)-light_bottom, bottom=light_bottom,
color=palate[2], zorder=2)
# Randomly add stars to the sky
num_stars = len(x) # Number of days of data science
star_x = np.random.uniform(0, len(x), size=num_stars)
star_y = np.random.uniform(low=2, high=max(y)*1.5, size=num_stars)
normal_distribution_sizes = np.random.normal(loc=0.5, scale=0.25, size=len(star_x))
ax.scatter(star_x, star_y,
s=normal_distribution_sizes,
c=palate[2], edgecolors='none', marker='*')
# Add a moon to the sky
#ax.scatter(int(len(x)/2), max(y)-1.45,
# s=100, c='white', zorder=2)
# Add moon border
#ax.scatter(int(len(x)/2), max(y)-1.45,
# s=130, c=palate[1], zorder=1)
plt.savefig(r'ULTrackerBar_FINAL_Dimmin.png',
facecolor=palate[0],dpi=300)
#transparent=True,
#dpi=300)
print(f"Days of Data Science: {len(x)}")
Figure 9: Stylized depiction of my year spent learning data science
While this analysis is custom-tailored to my lifestyle, habits, and experiences, you can also apply the same techniques to your own projects (all the code you see above is provided for you here). It all starts with solid data collection on a metric related to your project (whether that is time dedicated to the project, weight lifted, pages written, etc.). I would encourage you to start by keeping track one metric in your life over a period of one week. How does that metric change over time? Has this metric really been correlated with the success of your project? And most importantly, has tracking that metric helped you stay focused on that project?
Reach out to me and tell me what you find, I’d love to hear about it! 🙂