Okay, let’s talk about this Mackenzie McDonald prediction thing I was messing around with today. It was a bit of a rollercoaster, not gonna lie.

First, I just grabbed some data. I mean, scraped it, really. Match stats, player rankings, you name it. Cleaned that mess up – you wouldn’t believe the garbage you find sometimes. Then I started thinking, what’s even important here? Is it just win-loss record? Head-to-head? I was all over the place.
Then I thought, “Hey, why not try a simple logistic regression?” Seemed like a decent starting point. I mean, I ain’t no data scientist, just a guy who likes to tinker. So, I threw all that cleaned data into a Pandas dataframe, did some feature engineering – you know, calculated some ratios, differences in rankings, that sort of jazz. Felt pretty good about it.
Split the data into training and testing sets, like you’re supposed to. Fit the logistic regression model on the training data. Easy peasy, right? Wrong. The predictions were… well, let’s just say they weren’t great. My accuracy was hovering around 60%, which is basically a coin flip with extra steps.
- Frustration Level 1: Accuracy sucks.
- Possible Solution: More data? Different model?
So, I tried more data. Went back, scraped some more from different sources. Added some more features – how often a player wins on a specific surface, their recent form (wins in the last X matches), stuff like that. Re-ran the model. Still not much better. Maybe 62% accuracy. Ugh.
Then I was like, “Okay, maybe logistic regression is just too simple.” I started playing around with other models. Tried a Support Vector Machine (SVM). That took forever to train, and the results were even worse. Like, significantly worse. Threw that in the trash pretty quick.

Next, I gave a random forest a shot. That seemed promising at first. Got the accuracy up to maybe 68%. Still not amazing, but definitely an improvement. I spent a bunch of time tuning the hyperparameters – the number of trees, the maximum depth, all that stuff. It helped a little, but nothing groundbreaking.
- Frustration Level 2: Models aren’t performing as expected.
- Possible Solution: Feature selection, different model architecture.
I even thought about trying some neural networks, but honestly, I just didn’t have the time or energy to deal with all that. I mean, I’m not trying to build Skynet here, just predict a tennis match.
In the end, I went back to the random forest and focused on feature selection. Used some techniques to figure out which features were actually contributing to the predictions and which were just noise. Turns out, some of the features I thought were important were actually hurting the model.
After a bunch of tweaking and pruning, I finally got the accuracy up to around 72%. Still not perfect, but good enough for a day’s work. I mean, I wouldn’t bet my life savings on it, but it’s a decent starting point.
The key takeaways? Data cleaning is crucial. Feature engineering is important, but less is often more. And don’t be afraid to try different models, but sometimes going back to basics and refining your features is the best approach.

Maybe tomorrow I’ll try that neural network thing. Or maybe I’ll just watch some tennis.