MatchOutcomeAI
A data-driven approach to predicting football match outcomes using advanced machine learning techniques. This project integrates various algorithms to forecast game results, providing insights for sports betting, team performance analysis, and sports enthusiasts.
Install / Use
/learn @ratloop/MatchOutcomeAIREADME
MatchOutcomeAI
A data-driven approach to predicting football match outcomes using advanced machine learning techniques. This project integrates various algorithms to forecast game results, providing insights for sports betting, team performance analysis, and sports enthusiasts.
Getting Started
Prerequisites: Python 3.x
Usage
Your own API key from API Football must be imported into api_data_scraper.py.
git clone https://github.com/ratloop/MatchOutcomeAI
pip install -r requirements.txt
py main.py
Introduction
Problem Statement
Predicting football match outcomes has become increasingly popular due to the rising interest in sports betting and the desire to improve team performance. Accurate predictions can benefit various stakeholders, including fans, coaches, analysts, and bookmakers. However, predicting match results remains a complex task due to numerous factors that influence a game's outcome, such as team form, player performance, and historical data. This project aims to develop a machine learning model capable of providing accurate and logical football match outcome predictions, comparable to those of popular bookmakers.
Objectives
Starting with the primary goal - developing a predictive model capable of generating probabilities that align with the offerings of established bookmakers - the model's performance was both robust and reliable. By leveraging the Gradient Boosting algorithm, it offered predictions that were not only on par with commercial bookmakers but also grounded in statistical accuracy. This accomplishment underscores the effectiveness of the chosen approach and validates the model's utility.
Secondly, the model achieved a commendable test accuracy of over 50%, a clear indicator of its success. This figure was more than a mere benchmark; it represented a significant achievement in the challenging field of sports predictions, where the complexity and unpredictability of football matches often make precise forecasting a daunting task. The accomplishment in surpassing this threshold indicates the model's credibility and its practical applicability for users looking to understand football match outcomes better. In addition to the model's predictive power, the application's interface was also a success. The command-line interface provided users with an intuitive and straightforward way to make predictions and review past predictions, ensuring the application was not only accurate but also user-friendly. The adoption of clear and organized menu options further enhanced the user experience, allowing seamless navigation through the application's various features.
Lastly, the model was designed with an eye towards continuous improvement. By incorporating new data and refining existing features, the system ensures that its predictive power does not stagnate but improves over time. This approach to iterative improvement is essential for staying relevant and accurate in the fast-paced and ever-changing world of football.
System Design
Data Scraper
For this predictive model, data was gathered using a purpose-built data scraper. The scraper was designed to fetch data from a specified API (Application Programming Interface), which provides a comprehensive collection of football match data for the Premier League. The data scraper works by making GET requests to the API's endpoints for each football season. These requests retrieve data in JSON format which includes various details about each match, such as the teams involved, the final score, the date of the match, and other match-specific statistics. The decision to use this particular API was driven by its extensive dataset that offers a wide range of features beneficial for the predictive model. The API is reliable, consistently updated, and provides granular data which is critical for the analysis.
Available statistics:
- Shots on Goal
- Shots off Goal
- Shots inside box
- Shots outside box
- Total Shots
- Blocked Shots
- Fouls
- Corner Kicks
- Offsides
- Ball Possession
- Yellow Cards
- Red Cards
- Goalkeeper Saves
- Total passes
- Passes accurate
- Passes %
The data scraping process was facilitated using Python, a language renowned for its robust data manipulation capabilities and extensive libraries. Libraries such as 'requests' and 'json' were employed for HTTP requests and JSON data manipulation respectively. The decision to utilize Python was driven by its ability to streamline the data collection process, allowing us to amass a substantial dataset in a time-efficient manner.
Upon retrieval, the data is stored in its raw JSON format for each football season, earmarking it for further processing. This method of storage retains an unaltered copy of the original data from the API, enabling us to revisit or troubleshoot the data if necessary. Moreover, it ensures that the data scraping stage needs to be executed only once, thereby conserving resources for subsequent phases of the project.
The API used was API Football
Data Engineering
Data Combination: Since I have data for each Premier League season in separate CSV files, I combine all these datasets into one single CSV file. This allows us to build a model that considers historical data from multiple seasons, thereby improving its predictive capability.
Feature Creation: The API provides us with a wide range of features, such as goals scored, shots on target, and possession percentage. I utilize these features and create some of my own, such as form from the past 5 games. The addition of these engineered features aims to capture more complex patterns and relationships that can potentially improve the accuracy of the predictive model.
Data Cleaning: The final step in the data engineering process is cleaning the data. This involves dealing with missing or inconsistent data and ensuring that the dataset is reliable and accurate. It's an important step as the quality of the data used to train the model significantly influences the model's performance.
The data engineering process was primarily implemented using Python due to its powerful data manipulation libraries like pandas and NumPy. This choice of language ensured a smooth and efficient data processing phase, providing us with a clean, organized dataset ready for visualisation, analysis, and model training
Data Visualisation
Having a clear, visual understanding of the data is a fundamental part of this project. This step aids in interpreting the dataset, revealing potential patterns, trends, and correlations among different variables. It provides valuable insights before I dive into building the machine learning models.
Several data visualisation strategies were exploited to delve into the relationships within the features:
Correlation Matrix: I created a correlation matrix to identify the interdependence between different features in my dataset. This step is crucial as it helps determine which variables exhibit strong positive or negative correlations. This information is valuable during the phase of feature selection for model training.

This heatmap presents a visual depiction of the correlations among a host of home team statistics such as shots on target, total shots, fouls, corners, offsides, ball possession, yellow cards, red cards, saves made by the goalkeeper, attempted passes, and successful passes. The magnitude and orientation of the correlation between these elements are signified by the depth of the colour and the proportions of the squares. This illustration provides valuable insights into the interplay and potential influence of these varied factors within a game scenario.
Scatter Plots: I used scatter plots to illustrate the relationships between different pairs of variables. These plots can emphasize correlations, reveal trends, and help spot any outliers or anomalies in the data.

This scatterplot matrix visualizes the pairwise relationships of shots on target, total shots, fouls, corners, offsides, possession, yellow cards, red cards, goalkeeper saves, attempted passes, and successful passes for the home team. Each plot in the matrix shows the relationship between a pair of these attributes, which allows for a detailed exploration of potential correlations or patterns. This comprehensive view provides critical insights for understanding the multifaceted dynamics of the game.
I leveraged libraries such as matplotlib and seaborn for data visualisation, which offer an extensive set of tools for creating informative and aesthetically pleasing statistical graphics. With these visual insights, it was better equipped to interpret the data and make more informed decisions during the model training phase.
Model Training
Logistic Regression: I chose to start with Logistic Regression due to its simplicity and power. Known as a linear classifier, Logistic Regression calculates the probability of a binary outcome based on input features. It then classifies the data accordingly. Despite its simplicity, it can model complex relationships when the input features are nonlinearly transformed or when interaction terms are included. The model was trained using maximum likelihood estimation to find the best fitting model to the data.
Support Vector Machine (SVM): Next, I employed the Support Vector Machine model. SVM is a versatile machine learning model typically used for classification and regression analysis. Its strength lies in its ability to handle high dimensional data and create complex decision boundaries, even in cases where the number of dimensions exceeds the number of samples. It works by constructing a hyperplane in a high-dimensional space that distinctly classifies the data points.
K-Nearest N
