Tracebase

Temporal matrix factorization for sparse traffic time series forecasting.

Generate Convert Improve

Install / Use

/learn @xinychen/Tracebase

About this skill

Quality Score

0/100

README

tracebase

<h6 align="center">Made by Xinyu Chen • :globe_with_meridians: <a href="https://xinychen.github.io">https://xinychen.github.io</a></h6>

Forecasting on high-dimensional and sparse Uber movement speed data of urban road networks with temporal matrix factorization techniques.

Uber movement project provides data and tools for cities to more deeply understand and address urban transportation problems and challenges. Uber movement speed data measure hourly street speeds across a city (e.g., New York City, Seattle, and London) to enable data-driven city planning and decision making. These data are indeed multivariate time series with N road segments and T time steps (hours), and are featured as high-dimensional, sparse, and nonstationary. To overcome the challenge created by these complicated data behaviors, we propose a temporal matrix factorization framework for multivariate time series forecasting on high-dimensional and sparse Uber movement speed data.

Uber movement project is not available now!

Data Processing

A detailed introduction to the analysis of missing data problem in Uber movement speed data is available on Medium.

Download Movement Speed Data

Open the download page of Uber movement project. Take an example of New York City, please try to open NYC Uber movement speed data.
Set the product as speeds and one specific time period.
Download data and save it on your computer.

<a href="https://movement.uber.com/explore/new_york/speeds"> <img align="middle" src="graphics/nyc_movement_heatmap.png" alt="drawing" height="270" hspace="20"> </a> <a href="https://movement.uber.com/explore/seattle/speeds"> <img align="middle" src="graphics/seattle_movement_heatmap.png" alt="drawing" height="270" hspace="20"> </a> Figure 1: Uber movement speed heatmaps of New York City (left panel) and Seattle (right panel), USA.

Extract Road Segments

Please download movement-speeds-hourly-new-york-2019-1.csv (movement speed data file of New York City in January 2019).

import pandas as pd
import numpy as np

data = pd.read_csv('movement-speeds-hourly-new-york-2019-1.csv')
road = data.drop_duplicates(['osm_way_id', 'osm_start_node_id', 'osm_end_node_id'])
road = road.drop(['year', 'month', 'day', 'hour', 'utc_timestamp', 'segment_id', 'start_junction_id', 
                  'end_junction_id', 'speed_mph_mean', 'speed_mph_stddev'], axis = 1)
road.to_csv('road.csv')

In New York City, Uber movement project covers 98,210 road segments.

Construct Speed Matrix

This process is time-consuming.

import numpy as np
import pandas as pd

month = 1
data = pd.read_csv('movement-speeds-hourly-new-york-2019-{}.csv'.format(month))
road = pd.read_csv('road.csv')
tensor = np.zeros((road.shape[0], max(data.day.values), 24))
k = 0
for i in range(road.shape[0]):
    temp = data[(data['osm_way_id'] == road.osm_way_id.iloc[i]) 
                & (data['osm_start_node_id'] == road.osm_start_node_id.iloc[i]) 
                & (data['osm_end_node_id'] == road.osm_end_node_id.iloc[i])]
    for j in range(temp.shape[0]):
        tensor[k, temp.day.iloc[j] - 1, temp.hour.iloc[j]] = temp.speed_mph_mean.iloc[j]
    k += 1
    if (k % 1000) == 0:
        print(k)
mat = tensor.reshape([road.shape[0], max(data.day.values) * 24])
np.savez_compressed('hourly_speed_mat_2019_{}.npz'.format(month), mat)

del data, tensor

The matrix's row corresponds to one specific road segment, while the column corresponds to one specific hour.

Use the Prepared Dataset

NYC Uber Movement

In this repository, we prepare the dataset and place it at the folder datasets/NYC-movement-data-set:

hourly_speed_mat_2019_1.npz (91 MB): data is of size 98,210 x 744 with 23,228,581 positive speed observations.
hourly_speed_mat_2019_2.npz (85.2 MB): data is of size 98,210 x 672 with 21,912,460 positive speed observations.
hourly_speed_mat_2019_3.npz (38.1 MB): data is of size 98,210 x 264 with 10,026,045 positive speed observations.

Note that to make the data as small as possible, we only maintain the data during the first 11 days of March 2019, and save it as hourly_speed_mat_2019_3.npz. You can use the following code to drop the unnecessary portion when preprocessing the raw data.

month = 3
data = pd.read_csv('movement-speeds-hourly-new-york-2019-{}.csv'.format(month))
road = pd.read_csv('road.csv')
i = data[(data.day > 11)].index
data = data.drop(i)

Seattle Uber Movement

You can also consider to use the prepared Seattle Uber movement speed data at the folder datasets/Seattle-movement-data-set:

hourly_speed_mat_2019_1.npz (26.4MB)
hourly_speed_mat_2019_2.npz (25.2MB)
hourly_speed_mat_2019_3.npz (31.6MB)

Data Analysis

If you want to investigate the missing data problem in Uber movement speed data, please prepare the data in the whole year of 2019 by yourself through the above codes. You can also skip this part and check out our documentation for multivariate time series forecasting on NYC Uber movement speed dataset in the next part.

Analyze Missing Rates

## Build a speed matrix for the whole year of 2019 in NYC
mat = np.load('hourly_speed_mat_2019_1.npz')['arr_0']
for month in range(2, 13):
    mat = np.append(mat, np.load('hourly_speed_mat_2019_{}.npz'.format(month))['arr_0'], axis = 1)

## Calculate missing rates
print('The missing ratte of speed matrix is:')
print(len(np.where(mat == 0)[0]) / (mat.shape[0] * mat.shape[1]))

N, T = mat.shape
sample_rate = np.zeros(T)
for t in range(T):
    pos = np.where(mat[:, t] == 0)
    sample_rate[t] = len(pos[0]) / N
sample_rate = sample_rate[: 52 * 7 * 24].reshape([52, 24 * 7])
whole_rate = np.mean(sample_rate, axis = 0)

Draw Missing Rates

rate = len(np.where(mat == 0)[0]) / (mat.shape[0] * mat.shape[1])
print(rate)

import matplotlib.pyplot as plt

plt.rcParams['font.size'] = 12
fig = plt.figure(figsize = (8, 2))
ax = fig.add_subplot(1, 1, 1)
plt.plot(whole_rate, color = 'red', linewidth = 1.8)
upper = whole_rate + np.std(sample_rate, axis = 0)
lower = whole_rate - np.std(sample_rate, axis = 0)
x_bound = np.append(np.append(np.append(np.array([0, 0]), np.arange(0, 7 * 24)), 
                              np.array([7 * 24 - 1, 7 * 24 - 1])), np.arange(7 * 24 - 1, -1, -1))
y_bound = np.append(np.append(np.append(np.array([upper[0], lower[0]]), lower), 
                              np.array([lower[-1], upper[-1]])), np.flip(upper))
plt.fill(x_bound, y_bound, color = 'red', alpha = 0.2)
plt.axhline(y = rate, color = 'gray', alpha = 0.5, linestyle='dashed')
plt.xticks(np.arange(0, 24 * 7 + 1, 1 * 24))
plt.xlabel('Time (hour)')
plt.ylabel('Missing rate')
plt.grid(axis = 'both', linestyle = 'dashed', linewidth = 0.1, color = 'gray')
ax.tick_params(direction = 'in')
ax.set_xlim([-1, 7 * 24])
# ax.set_ylim([0.6, 1])
plt.show()
# fig.savefig('NYC_missing_rate_stat.pdf', bbox_inches = 'tight')

<img align="middle" src="graphics/NYC_missing_rate_stat.png" alt="drawing" width="370"> <img align="middle" src="graphics/Seattle_missing_rate_stat.png" alt="drawing" width="370"> Figure 2: The missing rates of Uber movement speed data aggregated per week over the whole year of 2019. The red curve shows the aggregated missing rates in all 52 weeks. The red area shows the standard deviation of missing rates in each hour over 52 weeks. The 168 time steps refer to 168 hours of Tuesday, Wednesday, Thursday, Friday, Saturday, Sunday, and Monday. (Left panel) The dataset has 98,210 road segments, and the whole missing rate is 64.43%. (Right panel) The dataset has 63,490 road segments, and the whole missing rate is 84.95%.

Analyze Observation Rate of Road Segments

import numpy as np

mat = np.load('hourly_speed_mat_2019_1.npz')['arr_0']
for month in range(2, 13):
    mat = np.append(mat, np.load('hourly_speed_mat_2019_{}.npz'.format(month))['arr_0'], axis = 1)
ratio = np.sum(mat > 0, axis = 1) / (365 * 24)

Print observation rate results:

for threshold in 0.1 * np.arange(1, 10):
    print('Observation rate > {0:.2f}'.format(threshold))
    print(np.sum(ratio > threshold))
    print(np.sum(ratio > threshold) / ratio.shape[0])
    print()

Documentation

Problem Definition

In this research, we aim at simultaneously handling the following emerging issues in real-world time series datasets: 1) High-dimensionality (i.e., large $N$): data is of large scale with thousands of multivariate variables. 2) Sparsity and missing values: data is incomplete with missing values, and sometime only a small fraction of data is observed due to the data collection mechanism. 3) Nonstationarity: real-world time series often show strong seasonality and trend. For instance, the Uber movement speed dataset registers traffic speed data from thousands of road segments with strong daily and weekly periodic patterns. And due to insufficient sampling and limited penetration of ridesharing vehicles, we only have access to a small fraction of observed va

Related Skills

node-connect

354.3k

Diagnose OpenClaw node connection and pairing failures for Android, iOS, and macOS companion apps

frontend-design

112.3k

Create distinctive, production-grade frontend interfaces with high design quality. Use this skill when the user asks to build web components, pages, or applications. Generates creative, polished code that avoids generic AI aesthetics.

openai-whisper-api

354.3k

Transcribe audio via OpenAI Audio Transcriptions API (Whisper).

qqbot-media

354.3k

QQBot 富媒体收发能力。使用 <qqmedia> 标签，系统根据文件扩展名自动识别类型（图片/语音/视频/文件）。