Inspiration

Hashy Birthday!

By BillyDataHayes Sept. 4, 2022, 2:57 a.m. Comments

Supposedly if you have 100 people in a room then two of them were born on the same day of the year (likely not the exact same day unless it was a high school reunion or something). This is sometimes called the “Birthday Paradox” or the “Birthday Problem”.

I didn’t quite believe it at first so I decided to write a script to test it out. We can quickly test this idea by picking a random number from 0 to 364 to count as a random birthday, where 0 can represent January 1st and 364 can represent December 31st. If we randomly generate 100 numbers from 0 to 364, we can simulate 100 birthdays. If we decide to randomly generate those numbers as an array we can simply check for duplicate values to determine whether two people shared a birthday. Running this script a thousand or so times should give us a rough idea of how frequently a match would occur.

simulations = 1_000
num_people = 100
had_matching_birthdays = []
for _ in range(simulations):
    # Simulate 100 birthdays
    birthdays = np.random.randint(0, 365, num_people) # Simulate 100 birthdays
    # Check if any of those birthdays are unique 
    unique_birthdays = np.unique(birthdays)
    birthdays_matched = len(birthdays) != len(unique_birthdays)
    had_matching_birthdays.append(birthdays_matched)

# Print the number of simulations where there was a matching birthday
print(sum(had_matching_birthdays))

Or if you want the fancy one-liner

Sum([len(np.unique(np.random.randint(0, 365, 100))) != 100 for _ in range(1_000)])

I ran this simulation one million times and every single time the room of 100 people had a matching birthday. This got my attention and I wanted to gain a more intuitive understanding of why this happened. I started by simplifying the problem and thinking of a network of two individuals.

Assuming that people are equally likely to be born on any given day, the probability that these two people share the same birthday is 1 / 365 (we’ll factor in leap years later). Therefore, we can think of this link between our two people as a 1/365 chance they share a birthday.
Two’s a crowd but three’s a party. If we add another person into the network we get

Now we can see that each node has two links. Person 1 can share a birthday with either Person 2 or Person 3. The probability of each link still remains the same, with each link representing a 1/365 chance of sharing a birthday. In total that means there are 3 opportunities for a shared birthday. As we add more and more people the number of links grows:

We can see that four people make 6 links and five people make 10 links. If we take the difference between the number of links at each step we can spot the following pattern:

People	Links	Link Difference
2	1	NaN
3	3	+2
4	6	+3
5	10	+4

Each time we add a new person the increase in the number of links goes up by one. Stated another way, the next person we add will mean an additional (people + links) number of links. Therefore I would expect six people to have 5+10 = 15 links, seven people to have 6+15 = 21 links, etc. We can generalize this phenomenon with the following equation:

num_links = (n*(n-1)) / 2

Which happens to tell us the number of links for any n number of people. This is a well-known equation in graph theory for obtaining the number of edges in a densely connected network of n nodes. A densely connected network is one where every node is connected to every other node.

Great! Now that we know how many links we have per person in the room we can start to calculate the probability that any of those links actually generates a matching birthday. With 1 link it’s easy (1/365), with 3 links it requires a bit more effort. We could simply multiply them together as (1/365)^n, but that would mean that as the number of people increases our probability of finding a matching birthday decreases which wouldn’t make sense. If we multiplied the probability that they didn’t share a birthday together (which we would expect to decrease over time) then subtract that product from 1 we could find the probability that those n people shared a birthday. Expressed as an equation this looks like

(1 - (prob_not_shared)**(num_links))

Where prob_not_shared in this case would be (1 – (1 / 365)) = 364 / 365.

Finally, putting all this together we find that the probability of at least two people in a room of 100 sharing a birthday is

1 – (364/365)^((100*(100-1)) / 2) = 0.9999987347684783

Or about 99.9999%, if anyone wants to place a bet against me on those odds I would love to take them up on it! By applying this equation from 2 to 366 people we can see just how fast the odds shift in favor of a match:

When the number of people reaches 21, there is about a 50/50 chance that two of them will share a birthday. Once the number of people reaches 366 we are guaranteed to have a match within our 365 day calendar. That is, unless we include leap years.

If we include leap years that number would need to be 367 (accounting for February 29th as the 366th day). We can also account for leap years in our equation by setting the probability of each link to be

((364*3) + 365) / ((365*3) + 366)

where we have 3 years where the probability of a match is 364 / 365 and one leap year where the probability of a match is 365 / 366. This adds a 0.000188% chance of a matching birthday for each link which is negligible for solving this problem (hence why it looks like the exact same graph):

It’s also interesting to imagine what we could do if we threw some data into the mix. This is a lower limit on the probability of matching birthdays because it assumes that people are equally likely to be born on any given day (except for February 29th). However, I’m sure that there are some months where babies are more likely to be born than others. For instance in the United States babies are supposedly more likely to be born in August and July than other months, meaning that there would be an even higher chance of two Americans in a room sharing a birthday.

The speed at which these birthday matches occur is nothing new to computer scientists. They encountered this type of problem years ago when trying to come up with information-scrambling algorithms for a technique called hashing. Hashing can be useful when you’re working with sensitive information that you want to scramble so it remains uniquely identifiable to you while remaining nonsense to others. A simple hashing algorithm might be to let every letter of the alphabet A-Z represent a number and summing the results. For instance if we let A = 1, B = 2, C = 3, etc. and we hashed the string “DIMMIN” we would get

“DIMMIN” = 4 + 9 + 13 + 13 + 9 + 14 = 62

This is makes it easy to go from “DIMMIN” to 62 but impossible to go from 62 to “DIMMIN”. This particular algorithm also breaks down pretty quickly. We would get a matching output with the reversed string “NIMMID”:

“NIMMID” = 14 + 9 + 13 + 13 + 9 + 4 = 62.

This is referred to a hash collision because two different inputs “collide” by giving the same hashed output. Just like how increasing the number of people in the room increases the opportunities for a matching birthday, increasing the number of inputs to a hashing function increases the number of opportunities for a hash collision.

Contrary to our birthday matching example, hash collisions are typically undesirable. One of the useful features of hashing is in the context of cryptography to store sensitive information. Passwords are typically hashed when stored to protect them because users typically implement the same password all over the place. Hopefully, when you log into a website your password input is hashed and checked against the hash that your password is stored as. If a hacker gets access to a bunch of hashed passwords, they would then need to go through the extra step of trying to identify which hashing algorithm the site employed and putting different passwords through that algorithm to check for matches (very computationally expensive).

Of course, if a website used my hashing algorithm and my password was “DIMMIN” then hash collisions would cause a pretty significant security flaw. Anyone could log into my account with any password that hashed to 62 whether that be “DIMMIN”, “NIMMID”, or even just “A” 62 times. So while it may be impossible to derive my password "DIMMIN" from 62, hash collisions prevent this from being an effective password hashing algorithm. Thankfully there are better engineered hashing algorithms out there than mine. If you’re interested in the applications of hashing algorithms I would suggest looking into cryptography as well as the proof-of-work mechanism that is employed by the bitcoin blockchain.

If you’re interested in other applications of hash collisions then we’re on the same wavelength! Taking the idea of hash collisions to other problems and adding some data about their likelihoods opens up some interesting ideas. For instance, what is the probability that:

• Any two people that you work with live in the same neighborhood? Could they start carpooling to work together?
• Any two people from the same high school will end up working at the same company? Could it be worth searching through your company org chart to reach out to fellow alumni?
• Any two people have matching PIN numbers or passwords? Could you be using the exact same passwords as somebody else you know?

I’ll leave these questions for you to think about. In the meantime I have to get ready for my (potentially our) birthday coming up on September 15th. Yes I accept gifts from adoring fans.

Tags:

Inspiration

Written by BillyDataHayes

I am a data scientist that likes to learn. In my spare time, I can usually be found working on projects that build on concepts that I study in math and statistics. I enjoy discovering anything and everything data related - from neural networks to the neurons of the people who implement them.

Previous Post GaoGetter Next Post Mini Martial Artists

DIMMiN

Hashy Birthday!

Tags:

Written by BillyDataHayes

Spending Waaaaaaay Too Much Time at Bars

Sports Betting with Python - Part 1 Project Introduction

Mini Martial Artists