Abstract

At the time of writing this thesis, the master's dissertation is not yet completed. The expected completion date is May-June 2025. The full text of the work, as well as materials on the topic, can be obtained from the author or their supervisor after the specified date.

Introduction
1. Problem Statement
2. Review of Studies and Developments on Similar or Related Topics
2.1 "Prediction and Early Warning of Passenger Flow for Regional Buses Based on Machine Learning"
2.2 "Comparative Testing of ARIMA and LSTM Models for Passenger Flow Forecasting"
2.3 "The Impact of Weather Conditions on Bus Operations on Urban Routes"
2.4 "Current Methods for Passenger Flow Forecasting"
2.5 "Experimental Study of the Probability of Route Choice by Passengers"
3. Proposed Methods for Solving the Problem
References

Introduction

Forecasting passenger flow in public transportation is one of the key tasks for effective management and planning of transport systems. Effective passenger flow forecasting is important both from a scientific and practical perspective. From a scientific standpoint, it is a complex task requiring the use of advanced data analysis methods and machine learning algorithms to process large volumes of information and uncover hidden patterns. From a practical perspective, accurate passenger flow forecasts allow for optimizing public transport schedules, reducing operational costs, enhancing passenger comfort, and easing the load on the city's transport infrastructure.

Currently, there are many studies dedicated to forecasting passenger flow using various approaches, including statistical methods, regression models, time series methods, and machine learning algorithms. However, most existing solutions focus on short-term forecasting (for a day or a few days ahead) and rarely consider data over extended periods, limiting their applicability for long-term planning. Furthermore, not all approaches sufficiently account for external factors, such as weather conditions or urban events, which can also reduce the accuracy of forecasts.

Despite significant advances in passenger flow forecasting, several key issues remain unresolved:

Model adaptation to long-term data - "How can we effectively use passenger flow data from a past period to improve forecast accuracy?";
Integration of external factors - "How can we better account for the impact of external factors, such as weather and urban events, on passenger flow?";
Algorithm selection and tuning - "Which machine learning algorithms and their settings yield the best results for specific cities and routes?"

The goal of this article is to explore algorithms for building a machine learning model for forecasting passenger flow in public transportation, as well as to examine the stages of its development. The model should consider temporal and external factors to improve forecast accuracy for the next six months. The main research tasks include:

Analysis of existing passenger flow forecasting methods and identification of the most optimal algorithms for the given task;
Consideration of the stages of model creation capable of processing large volumes of data and accounting for various factors affecting passenger flow;
Study of ways to assess the model's accuracy and methods for its improvement;
Development of recommendations for applying the model in real-world conditions to improve the management of urban transport systems.

1. Problem Statement

The problem of forecasting passenger flow belongs to the class of time series tasks, specifically the task of prediction based on historical data. In this task, we are solving a typical regression problem, where it is necessary to predict a numerical value based on previous observations.

Input data:

Historical passenger flow data:
1. Format: data structure of the type [
  {"date": "2023-01-01", "passengers": 1250, "day_of_week": "Sunday", "holiday": true, "temperature": -2, "precipitation": "Snow", "event": "None"}, {"date": "2023-01-02", "passengers": 1300, "day_of_week": "Monday", "holiday": false, "temperature": 0, "precipitation": "None", "event": "None"}, {"date": "2023-01-03", "passengers": 1275, "day_of_week": "Tuesday", "holiday": false, "temperature": 1, "precipitation": "Rain", "event": "None"},
  ... ]
2. Observation period: at least 6 months;
3. Observation interval: daily.
Additional data:
1. Calendar data (day of the week, holidays);
2. Weather data (temperature, precipitation);
3. Event data (events in the city, public events).

Historical passenger flow data is essential for training the model. These data represent a time series, where each value corresponds to the number of passengers on a specific day. The observation period must be at least 6 months to allow the model to account for seasonal and long-term trends.

Calendar data - the day of the week, holidays, weekends, and other calendar features can significantly influence passenger flow. For example, on weekdays, passenger flow may be higher due to work and study-related trips, whereas on weekends and holidays, passenger numbers may decrease.

Weather data - weather conditions also influence passenger behavior. Data on temperature, precipitation, wind, and other weather conditions are included in the model to account for their impact on passenger numbers.

Event data - various events in the city (concerts, sports events, etc.) can attract large numbers of passengers. This data is also considered for more accurate forecasting.

Output data - passenger flow forecast:

Format: an array with the date and predicted number of passengers for each specific day;
Forecast period: 6 months;
Forecast interval: daily.

The passenger flow forecast - the model should output the predicted number of passengers for each day during the next 6 months. This data will be used for route optimization, resource planning, and improving passenger service quality.

The following limitations can be identified:

Data quality - it is essential that the data is accurate and complete. Missing or erroneous data can negatively affect the accuracy of forecasts;
Data volume - sufficient historical data is required to train the model. A minimum observation period of 6 months is established;
Seasonality and trends - the model should be able to account for seasonal fluctuations and long-term trends in the data.

Disturbing factors:

Unexpected events - accidents, strikes, emergencies can drastically change passenger flow and may not be predictable;
Route changes - sudden changes in public transport routes or schedules can also affect the forecast accuracy;
External factors - economic, political, and social changes can influence passenger flow and should be taken into account when developing the model.

The development of a passenger flow forecasting model using machine learning algorithms will:

Improve forecast accuracy. Accurate passenger flow forecasts will allow for better planning of public transport operations.
Optimize resources. Timely distribution of resources (vehicles, personnel) based on forecasts will increase the efficiency of the transport system.
Improve passenger service. Providing more accurate information about schedules and vehicle availability will enhance the quality of passenger service.

The formalized problem statement allows for a clear definition of the research goal, data types, and the methodology to be used for solving the passenger flow forecasting task. Expected results and limitations are also described, which contributes to a clear understanding of the problem and approaches to solving it. Forecasting for the next six months presents a complex challenge that requires consideration of numerous factors and nuances, making the use of machine learning algorithms especially relevant and promising.

2. Review of Studies and Developments on Similar or Related Topics

2.1. "Passenger Flow Forecasting and Early Warning System for Regional Buses Based on Machine Learning"

The article discusses the short-term passenger flow forecasting for regional bus terminals based on bus station map data with an integrated circuit (IC) and proposes an early warning model for regional bus passenger flow [1].

First, bus stations are consolidated into virtual regional bus stations. Then, short-term passenger flow forecasting for regional bus stations is carried out using the machine learning (ML) support vector machine (SVM) method. Based on this, an early warning model for regional bus passenger flow was developed by analyzing the capacity of regional bus terminals.

The results show that the accuracy of short-term passenger flow forecasting can be improved by replacing actual bus stations with virtual regional bus stations, as the passenger flow at regional bus stations is more stable than at individual bus stations.

Accurate forecasting and early warning of regional bus passenger flow enable urban bus dispatchers to maintain efficient control over the urban public transport system, especially during special and large-scale events.

2.2. "Comparative Testing of ARIMA and LSTM Models for Passenger Flow Forecasting Tasks"

Another article titled "Comparative Testing of ARIMA and LSTM Models for Passenger Flow Forecasting" [2] can also be considered. The introduction emphasizes the importance of passenger flow forecasting for optimizing public transport schedules. The authors highlight the need for effective forecasting methods based on historical data.

The ARIMA model is used for time series analysis and is an extension of the ARIMA model applicable to non-stationary time series.

The authors describe the process of assessing time series stationarity, testing for unit roots, and transforming the series into a stationary state by taking differences.

LSTM is a type of recurrent neural network proposed in 1997. The main components of LSTM include the memory cell, input, output, and forget gates.

The article presents the network structure and equations describing the operation of gates and the memory cell.

For modeling, passenger flow data from one stop over 12 days was used. The data was divided into training and test sets. Forecasts obtained using the ARIMA and LSTM models were compared with actual passenger flow values.

The graphs presented in the article show that both models successfully predicted the trend of passenger flow changes. The greatest deviations were observed during peak times and transition periods.

The mean squared error (MSE) for the models was:

ARIMA: 6.2484

LSTM: 3.8764

The root mean squared error (RMSE) was:

ARIMA: 2.4996

LSTM: 1.9668

Thus, the LSTM model demonstrated higher forecasting accuracy compared to ARIMA.

In conclusion, the authors emphasize the importance of considering passenger flow when optimizing public transport schedules. The article demonstrates that both methods - ARIMA and LSTM - have high forecasting accuracy, but LSTM yields better results. The main advantage of ARIMA is computational speed, while LSTM provides more accurate forecasts.

2.3. "The Impact of Weather Conditions on Bus Operation on Urban Routes"

This article is dedicated to studying the impact of various weather conditions on the operation of urban buses [3]. The authors emphasize the relevance of this topic in light of the need to improve the efficiency and safety of transport services in changing climatic conditions. Special attention is paid to the impact of precipitation, air temperature, and road surface conditions on bus speed and adherence to schedules. The study included the following stages:

Collection and processing of weather data and bus movement parameters;
Statistical analysis of the data to identify dependencies between weather conditions and changes in bus operation;
Modeling and forecasting the impact of weather conditions on bus movement using the obtained data.

The main findings of the article include the following aspects:

Precipitation. Rain and snow significantly reduce the average speed of buses. In heavy precipitation conditions, delays on the route increase on average by 10-15%;
Temperature. Low temperatures lead to reduced tire grip on the road, which also reduces the average speed and increases travel time;
Road surface conditions. The presence of ice and snow on the roads requires more cautious driving, leading to additional delays.

Graphs and tables in the article illustrate how bus movement parameters change depending on various weather conditions. For example, the average speed decreases by 20% in heavy snow conditions compared to dry weather.

The authors conclude that it is necessary to consider weather conditions when planning schedules and optimizing bus routes. The implementation of modern weather forecasting technologies and adaptive traffic management systems can significantly improve the efficiency of public transport.

2.4. "Current Methods for Passenger Flow Forecasting"

The author of the article discusses modern methods for forecasting passenger flows in various types of transportation [4]. Passenger flow forecasting is a key element for effective planning and management of transportation systems.

The main forecasting methods include:

Extrapolation models, including the moving average method and exponential smoothing. For example, the Gray model for rail transport;
Regression models estimate relationships between passenger flows and various factors and are applicable for long-term forecasts;
Gravitational models are based on the arrival-departure balance between transportation hubs.

New approaches include:

Machine learning and deep learning. The use of neural networks and other machine learning methods to detect complex dependencies in data;
Hybrid models. Combinations of traditional methods and modern approaches, increasing forecast accuracy.

Passenger flow forecasting is used to optimize schedules and manage resources, improve passenger satisfaction by reducing waiting times, and reduce operational costs through efficient resource utilization.

The author emphasizes the importance of continuous and systematic forecasting of passenger flows, selecting models based on feasibility and resources, as well as the use of new technologies and data to improve transportation systems.

2.5. "Experimental Study of Passenger Route Choice Probability"

This article explores the factors influencing passengers' choice of routes in urban transport [5]. The authors analyze the criteria that passengers consider when selecting a route and how these criteria can be used to optimize the transportation system.

The study included several stages:

Surveys of passengers to identify their preferences and factors influencing route selection;
Use of various models to predict the probability of a passenger choosing a specific route;
Data processing using statistical methods to identify the significance of various factors.

Main results:

Travel time, transfer convenience, movement intervals, and fare were found to be the main factors influencing route choice;
The majority of passengers choose the route that minimizes their total travel time, even if it includes transfers;
Time factors, such as weather and road conditions, also significantly affect route choice.

The authors emphasize the importance of a comprehensive approach to studying passenger behavior and using the collected data to improve service quality and transportation system efficiency. Forecasting and modeling passengers' route choices can significantly increase user satisfaction and optimize urban transport operations.

3. Proposed Methods for Solving the Problem

The model will be built using passenger flow data, which includes the following parameters:

Date and time;
Number of passengers;
Route (starting and ending stops);
External factors (weather, city events, holidays, etc.)

Data can be collected from various sources, including transport companies and meteorological services. Before training the model, data preprocessing should be performed: data cleaning, normalization (scaling all numerical values to the same range to improve the convergence of learning algorithms), and integration of external factors (adding data about weather conditions, holidays, and city events).

To adjust and improve the quality of the dataset, the following algorithms and methods can be used:

Missing value imputation methods: KNN (k-nearest neighbors) and mean value for time series;
Time series analysis: smoothing (e.g., moving average method) and decomposition of time series to detect trends and seasonal components[6];
Feature Engineering: creating additional features such as day of the week, month, season, holidays, and weather conditions

For forecasting passenger flow, the following machine learning algorithms can be considered.

Linear Regression

A data analysis method that predicts the value of unknown data using another related and known data value[7]. It mathematically models the unknown or dependent variable and the known or independent variable in the form of a linear equation (Figure 1). A simple model suitable for basic time series analysis.

Figure 1 - Linear Regression

Recurrent Neural Networks (RNN)

Networks with cycles, well-suited for handling sequences. Training an RNN (Figure 2) is similar to training a regular neural network. The backpropagation algorithm is also used, but with a slight modification. Since the same parameters are used at all time steps in the network, the gradient at each output depends not only on the calculations of the current step but also on previous time steps. For example, to compute the gradient for the fourth element in the sequence, the "error" would need to be "spread" across 3 steps and the gradients summed. This algorithm is called the "Backpropagation Through Time" algorithm. These networks can capture temporal dependencies, making them useful for time series forecasting tasks.

They introduce memory in artificial neural networks, but this memory is short-term. At each learning step, the information in memory is mixed with new data, leading to its complete overwrite after several iterations[8].

Figure 2 - Recurrent Neural Network

Long Short-Term Memory (LSTM)

LSTM modules (Figure 3) were developed to solve the problem of long-term dependencies, allowing information to be retained over both short and long time intervals. This is achieved thanks to the architecture: in the recurrent components of LSTM, no activation function is used, preventing the stored value from blurring over time and avoiding vanishing gradients during backpropagation.

LSTM blocks include three or four "gates" that regulate the flow of information into and out of memory. These gates are implemented using a logistic function that returns values in the range [0; 1]. The result is multiplied by the corresponding data stream, allowing information to be partially or fully passed into or out of memory. In Figure 3, the following are shown:

x_t - the input vector controlling the addition of new values to memory;
c_t - the state vector;
h_t - the output vector

Three main gates:

f_t - forget gate;
i_t - input gate;
o_t - output gate, controlling the degree to which memory values are used for calculating the activation function output of the block

Figure 3 - Example of LSTM Block

In summary, this is an improved version of RNN that can remember long-term dependencies, which is especially important for long-term forecasting.

Gradient Boosting (XGBoost, LightGBM)

Gradient boosting is a machine learning technique for classification and regression tasks that builds a prediction model in the form of an ensemble of weak predictive models, usually decision trees.

XGBoost is based on the gradient boosting algorithm for decision trees.

LightGBM is a framework developed by Microsoft that provides an efficient implementation of the gradient boosting algorithm[9]. The main advantage of LightGBM is changes in the learning algorithm that significantly speed up the process and, in many cases, lead to the creation of more effective models.

To improve prediction accuracy, ensemble methods that combine several models can also be explored:

Stacking - combining different models (e.g., LSTM and XGBoost) to obtain the final prediction,
Bagging - averaging predictions from multiple models of the same type to reduce variability and improve prediction stability.

The choice of the primary algorithm for model building will be made based on a comparison of the results from testing the above algorithms on a test dataset.

The adequacy of the model can be assessed using several criteria, applying various metrics: mean absolute error, mean squared error, and coefficient of determination. Using a combination of these metrics ensures a comprehensive evaluation of the model.

After model evaluation, it is necessary to analyze errors - identifying systematic errors will help understand where the model deviates the most and make corresponding adjustments.

After successful testing and evaluation, the model can be integrated into the city transport management system. This will enable automatic passenger flow forecasting and real-time schedule adjustments.

References

Forecast and Early Warning of Regional Bus Passenger Flow Based on Machine Learning URL: https://onlinelibrary.wiley.com/doi/10.1155/2020/6625435 (access date: 28.06.2024).
Yakimov M.A., Operailo K.V., Novikova E.N. Comparative Testing of ARIMA and LSTM Models in Passenger Flow Forecasting Tasks // Symbol of Science. 2022. No.6-2. URL: https://cyberleninka.ru/article/n/sravnitelnoe-testirovanie-modeley-arima-i-ltsm-v-zadachah-prognozirovaniya-passazhiropotoka (access date: 28.06.2024).
Omonov B.Sh., Yuldoshev D.F., Shomirzaev E.K. Influence of Weather Conditions on Bus Movement in Urban Routes // Economy and Society. 2023. No.2 (105). URL: https://cyberleninka.ru/article/n/vliyanie-pogodnyh-usloviy-na-rezhim-dvizheniya-avtobusov-na-gorodskih-marshrutah (access date: 28.06.2024).
Current Methods of Passenger Flow Forecasting URL: https://irts.su/2022/02/16/current-forecasting-methods/ (access date: 20.06.2024).
Nefedov N.A., Albert Avua J. Experimental Study of Passenger Route Choice Probability // VEPHT. 2014. No.3 (68). URL: https://cyberleninka.ru/article/n/eksperimentalnoe-issledovanie-veroyatnosti-vybora-passazhirom-marshruta-sledovaniya (access date: 21.06.2024).
Voronina V.V., Theory and Practice of Machine Learning / V.V. Voronina, A.V. Mikheev. - Ulyanovsk: UlSTU, 2017. - pp. 13-106. URL: https://lib.laop.ulstu.ru/venec/disk/2017/191.pdf (access date: 25.06.2024).
What is Linear Regression? URL: https://aws.amazon.com/ru/what-is/linear-regression/ (access date: 25.06.2024).
Long Short-Term Memory URL: https://neerc.ifmo.ru/wiki/index.php?title=Долгая_краткосрочная_память (access date: 28.06.2024).
LightGBM (Light Gradient Boosting Machine) URL: https://www.geeksforgeeks.org/lightgbm-light-gradient-boosting-machine/ (access date: 30.06.2024).
Introduction to Machine Learning URL: https://habr.com/ru/articles/448892/ (дата обращения: 30.06.2024).

Savenkova Valeria Olegovna

Faculty of Information Systems and Technologies

Department of Automated Control Systems

Specialization: "Information Systems and Technologies in Engineering and Business"

Public Transport Passenger Flow Prediction Using Machine Learning Algorithms

Scientific Advisor: Ph.D., Associate Professor of ACS Department, Savkova Elena Osipovna