Monday, February 28, 2022

Feelings about Set Similarity

The first tiny program I worked on for Code Doodle was a list difference calculator. It takes two lists and tells you which elements are on both lists or how similar the lists they are. 

I made this mostly to back up my joke that every time you watch any World Cup coverage an English sportscaster will talk about 1966 and how well the English team has done against whoever they're playing since then, despite the fact that ... none of the players are the same. Similarly the record of two NHL teams must be discussed even when there are more players who are playing for the other team than played for the same team after some period of time.  I thought having a program which backs up how ridiculous these types of statements are might be entertaining. As it happens, I also ran into a problem at work that needed me to find which people were on two lists, so it's not just a joke, but it's still mostly a joke.

All this program does is calculate the union, intersection and set difference for the two lists and then calculate how similar the two lists are. This is a bit messy in Java, due to the limitations of the Java Collections API (certainly the Python version of this program would be tiny), but nothing particularly hard to do. It was a fun project, so I wrote it up, wrote up some basic tests, and it seemed good enough.

Later on, when I was adding in the ability to read from a file, I put in two lists that were different than my original lists and thought I must have made a mistake. Even though I thought the two lists should be 50% similar, the program told me that they were 33% similar. 

So I went and checked through my code, my tests and my math. All of those were correct, as for as I could see, so I did what naturally comes next, I asked Google. After a few journeys through Stack Overflow and Wikipedia I managed to get something of an answer. I'm not weird, math is weird. 


Well, I am weird and math is weird.


Different Types of Set Similarity


My intuition when I was writing the program said to me that the similarity of two sets would be the size of the intersection of those sets, divided by the size of the union. If there are 10 total things and 3 of them are on both lists then the two lists should are 30% similar.

The problem came for me looking at two sets like this:

{a, b, c, d} and {c, d, e, f}

The intersection {c, d} has 2 elements and the union has 6 elements {a, b, c, d, e, f}, so by my reckoning above, the two sets are 33% similar. When I came back to it, my intuition said since each list has half unique elements and half common they should be 50% similar. So I had two "intuitive" results, both of which seemed to be mathematically correct, but which "felt" right in different contexts.

I should probably have done some reading first, but this wasn't a reading first kind of project. Once I did start reading I was at least pleased to discover that I'd done my math right.


Jaccard Coefficient


For my first calculation of similarity, I had stumbled upon the Jaccard coefficient (or the Jaccard index), which is the ratio of the size of the intersection to the size of the union.


A diagram of letters a - f with a - d grouped in a pink circle and d - f grouped in a blue circle.

Mathematical Notation: A = {a, b, c, d}, |A| = 4, B = {c, d, e, f}, |B| = 4

Mathematical Notation: Intersection of A and B = {c ,d}, Size of Intersection of A and B = 2, Union of A and B = {a, b, c, d, e ,f}, Size of Union of A and B = 6


Mathematical Notation: Jaccard coeefficient is the size of the Intersection of A and B divided by the size of the Union of A and B. This is equal to 2 divided by 6, which simplifies to 1 divided by three or 0.33



The Jaccard Coefficient has you figure out  the percentage of elements that are in both sets out of the all of the possible elements. This makes sense for a lot of sense for a lot of applications, for example if you want to see how similar two photos are for example you can see how many of the pixels are the same and the more similar pixels the more similar the photos should be. Because it focuses on on the intersection compared to all elements it makes the most sense when you're considering the whole space of the elements, rather than a specific set of those elements.

The problem here is that it doesn't suit my primary goal, which is to be able say that 22% of the 2014 English Men's football team returned to play in the 2018 English Men's football team. Instead the Jaccard coefficient tells me me that 5 out of the 44 people who played football for England in 2014 and 2018 played in both.So while it's an accurate measure of similarity for a lot of purposes, it doesn't tell us as much about the overlap between the two sets we're looking at.


Sørensen–Dice coefficient 

I learned all that about the Jaccard Similarity while I started searching for the "right answer". Did I do the math wrong? No. Did I build a meaningless measure? Not that either. Thankfully after that I found the Wikipedia entry that introduced the Sørensen–Dice coefficient. For Sørensen–Dice, instead of looking at the elements in the intersection compared to all the elements, we're looking at the size of the intersection compared to the average size of the two sets.


Mathematical Notation: Sørensen–Dice coefficient is = 2 times the size of the intersection of A and B divided by sum of the sizes of A and B. This is equal to 2 times 2 divided by 4 plus 4, which resolves to 4 divided by 8, which simplifies to 1 divided by two or 0.5



Since we're now saying that 2 out of the 4 elements in either of the sets are the same, we get much closer to that feeling I was looking for. Now if we have 5 out of the 22 members on the 2014 English Men's football team repeat as 5 out of the 22 members on the 2018 English Men's football team, then the two teams are 22 % similar, as opposed to the 12% Jaccard would suggest. 

For our example we now have 2 of the 4 in each set so a 50% similarity as opposed to the 33% similarity from the Jaccard coefficient and I think this produces a more intuitive measure for the kind of application I'm intending here. 


Szymkiwicz-Simpson Coefficient

While I was at it, I also discovered the Szymkiwicz-Simpson coefficient, which is also called the Overlap coefficient. This is almost the same as the Sørensen–Dice coefficient, but instead of - effectively - looking at the average size, we're comparing the size of the intersection to the size of the smaller set. This reframes the question a little bit into how much of the smaller set appears in the larger set.


Mathematical Notation: The Symkiwicz-Simpson coefficient is equal to the size of the intersection of A and B divided by the minimum of the size of A and the size of B, this resolves to 2 divided by the minimum of 4 and 4, which resolves to 2 divided by 4 which simplifies to 1 divided by 2 or 0.5


The big difference between Szymkiwicz-Simpson and Sørensen–Dice is that for sets with a large size difference the Sørensen–Dice will be higher. So the focus (as the name Overlap implies) is really focused on the overlap vs the overall similarity. Since for most of the data I'm playing with, the two sets are pretty similar in size there's really no effective difference between the two, but I'm sure in different applications, the difference might be really helpful. (For example: Is it more remarkable that two people were both in classes of 100, or more remarkable that they were in a class of 100 and a class of 10?)


If we take the same basic set of six elements and slightly reorganize the two sets we're comparing we can see real differences in the outcomes.


A diagram showing the letters a - f with one set containing {a, b, c} and the other set containing {b, c, d, e,f}

Mathematical Notation: Set A contains elements a, b and c, the size of set A is 3, Set B contains elements b, c, d, e, f, The size of Set B is 5.

Mathematical Notation: The intersection of Sets A and B is b and c, The size of the intersection of sets A and B is 2, The Union of sets A and B contain a, b, c, d, e, f, the size of the union is 6.

Mathematical Notation: The SS coefficient = size of the intersection of A and B divided by the minimum of the size of A and the size of B, this resolves to 2 divided by the minimum of 3 and 5, which resolves to 2 divided by three or 0.67, the SD coefficient is equal to 2 times the size of the intersection of A and B divided by the sum of the size of A and and the size of B, this resolves to 2 divided by 8, which simplifies to 0.5, the J equals the size of the intersection divided by the size of the union of A and B, which resolves to 2 divided by 6, which simplifies to 1 divided by 3 or 0.33



Feelings about Set Similarity

In short this is a silly project. It is entirely based out of me having listened to too many sports-casters say dumb statistical things. (Don't get me started on the "Data Science" of restricting events to such a ridiculous extent that you can find the most shots made by a left-handed player in the odd minutes of the second half of a game.)

I'm sure in *most* applications of similarity measures, there's something which will indicate what the right measure is for the application. Jaccard and Sørensen where both working in ecology and populations (and I think Szymkiwicz was as well) and so depending on what question they were looking at, the both make sense for comparing different populations. In computer vision, Jaccard makes a lot of sense for checking if two images are the same, but Szymkiwicz might make more sense to see if a smaller picture is part of a bigger one.

For my purposes, the primary answer is feel. Does it feel like the 2014 English Mens football team is more similar to the 2018 or the 2010? So even while math might leave me to lean on Jaccard, the other two feel like a better fit.

I actually ran a twitter poll on the topic (and thank you to all 5 people who answered ... it was a weird poll).



The results of the poll showed a split and I can't say anything at all other than to suggest that depending on your background and your purpose the way these measure "feel" is different. So when you're writing a program or analyzing data, it's worth thinking about how someone looking at your output is going to feel when the look at your results. Are they going to align with how they think about the problem and how they think about the data?

For my list difference calculator, I decided to avoid making a decision and added the options to calculate any of these three different similarity measures.






Tuesday, February 01, 2022

Games of January 2022


Recently, I've been too worried about finishing games and trying to focus on what I play. I'm trying now to play as much and as broadly as I can. I've played 14 games in January and I feel really good about it. Honestly, this month has felt like one of the months I've been happiest and excited about the games I play. It's been great to pick up whatever game feels like it best fits the moment. I've found a new space in my heart for Mario Golf, space for Puyo Puyo Tetris 2 and a bunch of other games too.

I wanted to talk a bit Super Mario Party, even though it isn't in my top five below. My partner and I played a game one Saturday night and we really enjoyed ourselves. I originally picked up Super Mario Party to defend against the threat of niblings who might wonder why Uncle TJ has so many video games that they can't play, and then it's basically sat on the shelf ever since. We often end up playing single player games beside each other and I thought it might be fun to actually do something together. Super Mario Party filled that gap for us pretty well (and really it's the first video game we've played together since we gave up on Animal Crossing: Amiibo Festival). I see why people have enjoyed Mario Party All Stars more, I'm really glad we played.

I played a little bit of Legend of Mana on the Switch, which I'm still not sure how I feel about exactly. I'll play more, but I'm reminded of how disappointed I was when I first played this. It has a very "unstory" story and the combat feels very odd - although now having played the intermediary step of Trials of Mana it makes more sense. It looks beautiful, but I'm a little unsure about the decision to upres the backgrounds, but leave the character sprites original.



Another game I wanted to talk about was Actraiser: Renaissance. I didn't get in to the original Actraiser on SNES - despite loving Quintet's other games - because as far as I could see it wasn't a great city builder and I didn't like the idea of the platforming parts. Unfortunately, Renaissance seems to also not be a good city builder and I thought the platforming was awful. I picked it up because I watched ProtonJon play a chunk of it, since he thought it had too much city building in it, it might appeal. It didn't. Also, unlike Puyo Puyo Tetris 2 - which we'll discuss momentarily, this game makes incredibly loquacious NPCs an absolute frustration and nightmare. 

Or perhaps I mean, "Oh, dearest, most venerable Lord, the people of this land which you so lovingly watch over, love you and and are so pleased to venerate you at this alter which you placed in a most advantageous position in our settlement and while so venerating, beg your assistance, oh Lord, to dispatch the vile monsters which do so plague our settlement, which you so lovingly..." *cough, cough, cough* I'm sorry, I don't know what came over me there. I'm glad I picked this up in the hopes that it brings Illusion of Gaia and the Turbo trilogy to light, but I can't say it's "good".





My top five games (by play time) for January were:
  1. Hollow Knight - Dan Floyd from Playframe, loves this game and came to love this game over the course of his Let's Play. I'd been thinking I should at least try it out and honestly it's been a great part of my new play to play more games. It's been a bit of a struggle, but I went back and restarted after hitting a bit of a wall and I'm really impressed with how much better I am than when I first started. This is an outstanding Metroidvania and I am really enjoying it.


  2. Legend of Zelda: Skyward Sword HD - So, Skyward Sword was the very first game I really wrote about on the this blog. I was feeling silly having bought it again for the switch, since my Wii U is *right there*, but honestly I'm glad I picked it up again. It's a lot of fun and it's pretty and relaxing. (Hollow knight has been pretty in its own way, but in no way has it been relaxing.) I've always liked Skyward Sword - in fact it was my favourite Zelda when it came out - and honestly the HD remake has felt great. It has a few flaws, and the motion controls are not ... seamless, but it's a good time.


  3. Puyo Puyo Tetris 2 - I'm pretty good at Tetris. I'm terrible at Puyo Puyo. I'm in love with this game. The original idea seemed a little dumb to me and I'd been avoiding this and the original Puyo Puyo Tetris since I don't really engage with multiplayer games that much. That was dumb and I am so happy I bought this game. There's way more than enough single player content for me here and the game actually does a reasonable job of teaching you, if for some reason you're only good at one of Tetris or Puyo Puyo.


    But the game play isn't what I'm here to talk about. This game has a dumb story. It's a game with 3 - 5 minutes of cut scene dialog before and after each level. The characters talk *a lot*. I'm here for it. I don't know why, but between the writing team and the localization team I'm grinning from ear to ear between levels because it's all so profoundly dumb. Nothing makes sense, no one makes sense and I am damned well here for it.


  4. Mario Golf: Super Rush - I wasn't too happy with this game when I wrote up my end of the year thoughts. Pretty good, but lacking content and soul. I was wrong.


    I was wrong about the content anyway, they've introduced a lot of new courses and a few new modes and honestly there's a lot to really enjoy about this game. They also fixed their online mode so that it's not misserable to play - although the netcode is still a little bit janky. I've really enjoyed Mario Golf this month and it's been the perfect thing to pick up for 20 minutes. It's *still* a little short on soul, but honestly I don't mind any more.


  5. Eastward - I've played about four and a half hours of Easward. I *feel* like I've played 20. I don't really know how to describe Eastward. It's an action RPG set in the post apocalypse (I think). It's an homage to 90s games and a pastiche of Earthbound. It's very upbeat while leaving my heart clenched that terrible things are going to happen. It's beautiful pixel art of an ugly world. I'm excited to keep playing and I'm really glad my current outlook on games makes it easier to digest each bit as I play.


Here's my total play time chart for January:



And here's a chart of how much I've played over the month:




The Books I Read - November 2024

November was a bit weird. The Hands of the Emperor is long, but excedingly good. I'm continuing to find Anna Lee Huber a very engagin...