Software Tech

ScoreCast: A Tool for Predicting Football Game Outcomes in Minor Leagues

This web application provides predictions for informational purposes only and should not be considered as financial or betting advice.

ScoreCast: A Tool for Predicting Football Game Outcomes in Minor Leagues

This web application provides predictions for informational purposes only and should not considered as financial or betting advice. The accuracy of the predictions is not guaranteed, and we are not liable for any losses incurred from using the tool. It is intended to assist developers in the football industry and it is used for informational purposes only.

ScoreCast, an open-source web application developed for predicting football game outcomes in six minor football leagues: Serie A Brazil, Serie B Brazil, Primera Division Argentina, J1 League Japan, Eliteserien Norway, and Veikkausliiga Finland. This tool not only assists football enthusiasts and bettors in making informed decisions but also serves as a gateway for exploring football analytics and gaining valuable insights. Drawing inspiration from a Dataquest video, ScoreCast sets itself apart by consolidating six leagues into a single platform, providing instant predictions, and ensuring a seamless betting experience. With simplicity and -friendliness as its pillars, ScoreCast becomes an indispensable companion for those venturing into the thrilling realm of minor league football betting.

The Goal Behind ScoreCast

Combining my interest for sports analytics and software development, I embarked on a journey to create ScoreCast. Scraping data from six minor football leagues — Serie A Brazil, Serie B Brazil, Primera Division Argentina, J1 League Japan, Eliteserien Norway, and Veikkausliiga Finland — I developed an open-source web application that stands as a predictor for minor league football games. Since there aren't many available predictors for smaller leagues, ScoreCast serves as a guiding companion, offering insights to help users in making informed betting choices. ScoreCast covers the following six football leagues:

Campeonato Brasileiro érie A (Brazil)

Campeonato Brasileiro Série B (Brazil)

Primera División de Argentina (Argentina)

J1 League (Japan)

Eliteserien (Norway)

Veikkausliiga (Finland)

Web Scraping with Beautiful Soup

In the pursuit of gathering crucial data for ScoreCast, I used the web scraping tool Beautiful Soup, to extract information from the FBREF website. Using classic modules, successfully retrieved comprehensive data from six minor football leagues: Serie A Brazil, Serie B Brazil, Primera Division Argentina, J1 League Japan, Eliteserien Norway, and Veikkausliiga Finland. The result was an extensive collection of six CSV , each containing over 3000 rows of match details such as date, time, competition, round, venue, result, goals for (gf), goals against (ga), and much more. Below, I present the CSV data from the Serie A organized into a structured dataframe.

Exploring and Cleaning the Data

During the exploratory phase of the project, I encountered several missing values across multiple columns. For instance, in the Serie A Brazil CSV, notable columns such as ‘date,' ‘time,' ‘comp,' ‘round,' ‘,' ‘venue,' ‘result,' ‘gf,' ‘ga,' and ‘opponent' had 158 missing values each. Careful data cleaning and imputation techniques were employed to handle these missing values.

In the data cleaning , I applied a series of steps to refine the dataset and prepare it for analysis in ScoreCast. First, I converted the ‘date' column into a datetime format and ensured the ‘time' column was in string format. To handle missing values, I implemented several strategies. For instance, when encountering missing values in the ‘venue' and ‘opponent' columns, I filled them with the most frequent values found in the dataset. Similarly, I addressed missing values in the ‘formation' column, either dropping it entirely all values were missing or filling it with the most common formation.

Moving forward, I handled missing values in the ‘result' column, filling them with the most frequent outcome recorded. To handle the ‘poss' (possession) column, I either dropped it if all values were missing or filled the missing values with the mean possession value. Additionally, for the ‘gf' (goals for) and ‘ga' (goals against) columns, I converted the values to numeric data and filled any missing values with the respective mean goal counts.

Furthermore, I handled missing values in the ‘referee' column, filling them with the most frequent referee's name from the dataset. For columns related to shots, goals, and penalties, including ‘gls', ‘sh', ‘sot', ‘sot%', ‘g/sh', ‘g/sot', ‘pk', and ‘pkatt', I filled the missing values with their respective means, ensuring a balanced dataset for analysis.

As part of the , I also converted the ‘gf' and ‘ga' columns to numeric data, allowing for more efficient computations. Moreover, I ensured that the dataset was ready for further analysis and modeling, laying the groundwork for ScoreCast's precise predictions and valuable insights.

Modeling and Extracting Predictions

Using the popular RandomForestClassifier algorithm, we aimed to build a robust model that forecasts football match outcomes of the six minor football leagues.

With the help of the Python pandas, we loaded the cleaned CSV files, and for each match, we engineered new features such as ‘venue_code', ‘opp_code', ‘hour', and ‘day_code.' These features were essential for training our model as they provided critical information about the teams, match venue, and time of the match.

One of the challenges we faced was handling the missing data present in the CSV files. For example, in the Serie A Brazil dataset, there were missing values for ‘date,' ‘time,' ‘comp,' ‘round,' ‘day,' ‘venue,' ‘result,' ‘gf,' ‘ga,' and ‘opponent.' To address this, we used a combination of techniques such as filling missing values with the most common ones, taking rolling averages, and replacing NaN values with the rolling average. We also mapped certain values using a custom mapping , which allowed us to effectively deal with missing data.

Below, we showcase the processed dataframe resulting from the above process. Please note that certain columns, such as “npxg/sh,” are dropped during this process, as they do not contribute to the training process.

To make our predictions, we divided the dataset into training and testing sets. The training data spanned up to July 19, 2023, while the testing data included matches from July 23, 2023, and beyond. Using the RandomForestClassifier algorithm from the scikit-learn library, we trained our model on the training data and then made predictions on the testing data.

We evaluated the model's performance using metrics such as accuracy. With careful fine-tuning and data preprocessing, our model achieved an impressive accuracy rate of 73%.

Upon completing the modeling process, we generated predictions for each match in the six minor football leagues. The predictions were organized into a CSV file, with the columns ‘Date,' ‘ A,' ‘Team B,' ‘ for Team A,' and ‘Prediction for Team B.' This concise format allowed us to present the match outcomes and the corresponding winning probabilities for each team.

Here are the results of the modeling and training presented in dataframe format:

However, it is essential to note that our model may encounter cases where both teams are predicted to win (W/W) or lose (L/L), or when the outcome is a draw for both teams (/D). We advise users to exercise caution and consider seeking professional advice when encountering such predictions in betting.

Developing the Web Application

For the development of the ScoreCast web application, I utilized the Flask framework and successfully deployed it on the platform. The app allows users to access predictions for football games from the six minor leagues mentioned above: Serie A Brazil, Serie B Brazil, Primera Division Argentina, J1 League Japan, Eliteserien Norway, and Veikkausliiga Finland.

I set up the Flask app and created different routes to handle requests for each league's predictions. The app follows a simple structure with an HTML template for the home page (‘index.html') and separate templates for each league's predictions: ‘br_a.html' for Serie A Brazil, ‘br_b.html' for Serie B Brazil, ‘arg.html' for Primera Division Argentina, ‘jpn.html' for J1 League Japan, ‘norw.html' for Eliteserien Norway, and ‘fin.html' for Veikkausliiga Finland.

For each league's route, I read the corresponding CSV file into a Pandas DataFrame and performed some preprocessing. One of the key features of the web app is the flexibility it offers to developers. Specifically, it allows users to define their desired time frame for extracting predictions. By default, the app is set to extract predictions for matches within the time frame from ‘2023–07–30' to ‘2023–08–31'. However, developers have the freedom to insert any time range of their choice directly through the code.

The web app has been deployed on Heroku and is accessible through the following domain: https://score-cast-3a6cb8fe5c50.herokuapp.com/. With this user-friendly web app, we can conveniently access and explore predictions for those six minor leagues.

Below we present the final web application deployed on Heroku.

Conclusions

In conclusion, ScoreCast is a comprehensive web application designed to provide football match predictions based on historical data from six minor football leagues: Serie A Brazil, Serie B Brazil, Primera Division Argentina, J1 League Japan, Eliteserien Norway, and Veikkausliiga Finland. Leveraging data gathering techniques such as web scraping and employing machine learning models like Random Forest , ScoreCast generates predictions for football matches. However, it is crucial to remember that these predictions are for informational purposes only and should not be used as financial or betting advice.

Heroku: https://score-cast-3a6cb8fe5c50.herokuapp.com/

GitHub: https://github.com/Costasgk/ScoreCast

Future Work

In the future, ScoreCast has exciting plans for development. We will focus on enhancing the prediction model's accuracy through advanced machine learning techniques and algorithm fine-tuning. Additionally, we aim to expand our data sources, optimize data processing pipelines for , and explore cutting-edge prediction models. The user interface will also be refined to offer a seamless and intuitive experience, providing valuable match outcome insights. These initiatives will ensure that our tool remains accurate and reliable, providing valuable insights to users in the football industry.

References

ScoreCast GitHub Repository: https://github.com/Costasgk/ScoreCast The ScoreCast GitHub repository contains the source code and files for the football prediction web application. Users and developers can access and explore the codebase to understand the implementation details and contribute to the project.

FBref: Football Data and Statistics: https://fbref.com/en/comps
FBref is a reliable source for comprehensive football data and statistics, providing valuable information on various leagues and competitions. It served as a key data source for the ScoreCast application, enabling the extraction of essential match details and performance metrics.

YouTube Video by Dataquest: https://www.youtube.com/watch?v=Nt7WJa2iu0s&ab_channel=Dataquest
This video provided insightful guidance and inspiration during the development process of the ScoreCast application.

About Author

Costasgk

Leave a Reply

SOFAIO BLOG We would like to show you notifications for the latest news and updates.
Dismiss
Allow Notifications