JobFailurePredictionGoogleTraces2019

By learning and using prediction for failures, it is one of the important steps to improve the reliability of the cloud computing system. Furthermore, gave the ability to avoid incidents of failure and costs overhead of the system. It created a wonderful opportunity with the breakthroughs of machine learning and cloud storage that utilize generated huge data that provide pathways to predict when the system or hardware malfunction or fails. It can be used to improve the reliability of the system with the help of insights of using statistical analysis on the workload data from the cloud providers. This research will discuss regarding job usage data of tasks on the large “Google Cluster Workload Traces 2019” dataset, using multiple resampling techniques such as “Random Under Sampling, Random Oversampling and Synthetic Minority Oversampling Technique” to handle the imbalanced dataset. Furthermore, using multiple machine learning algorithm which is for traditional machine learning algorithm are “Logistic Regression, Decision Tree Classifier, Random Forest Classifier, Gradient Boosting Classifier and Extreme Gradient Boosting Classifier” while deep learning algorithm using “Long Short-Term Memory (LSTM) and Gated Recurrent Unit (GRU)” for job failure prediction between imbalanced and balanced dataset. Then, to have a comparison of imbalanced and balanced in terms of model accuracy, error rate, sensitivity, f – measure, and precision. The results are Extreme Gradient Boosting Classifier and Gradient Boosting Classifier is the most performing algorithm with and without imbalanced handling techniques. It showcases that SMOTE is the best method to choose from for handling imbalanced data. The deep learning model of LSTM and Gated Recurrent Unit may be not the best for the in terms of accuracy, based on the ROC Curve its better than the XGBoost Classifier and Gradient Boosting Classifier.

Generate Convert Improve

Install / Use

/learn @Vu5e/JobFailurePredictionGoogleTraces2019

About this skill

Quality Score

0/100

README

Job Failure Prediction In Cloud With Imbalanced Dataset Handling Techniques

Introduction

Proposed Method

A brief introduction on the theoretical framework of this study will be provided on this section and will be presenting the imbalance classifier methods used and machine learning techniques that will be used to design the failure prediction models. The conceptual map of research methodology in this paper will simplified in Figure 3.1.

Data Understanding

By analysing the raw data, important features and information can be found as the importance of data understanding. The scope of data understanding in this study are mainly the job and task in the Google Cluster Workload Traces Dataset 2019 which contribute to the job and task failures which to understand below:

The 8-cell running and describing usage traces from a single Borg cell.
The CollectionEvents table and InstanceEvents table describe the life cycle of collections and instances, respectively.
The Resource usage using Linux containers for resources isolation and usage accounting.

First is to understand, there are 8 cell running several days and it describe usage traces for several days from a single Borg cell. The trace of jobs and task are described below:

Several tables are made up in a trace, the primary key is each indexed that usually includes a timestamp.
Individual machines and information are provided by the cell management system which makes it a table for the data.
Timestamp in each record, which before the beginning is in microseconds in the past “600 seconds” and recorded as “64 – bit integer” of the trace period, the unique 64 – bit identifier is assigned to every job and machine.
The job ID is typically identified in the task and for jobs it used a 0 – based index.
The workload consists of six (6) separate tables which are job events, task events, machine events, machine attributes, machine constraints and alloc set.

Next, are the CollectionEvents table and InstanceEvents table which portray the life cycle of collections and instances. The CollectionEvents and InstanceEvents are mainly features are used to know which feature that contribute to the job failure. The description regarding the CollectionEvents and InstanceEvents are described below:

All the tasks within a job usually execute the same binary with the same options and resource request.
A common pattern is to run masters (controllers) and workers in separate jobs (e.g., this is used in MapReduce and similar systems).
A worker job that has a master job as a parent will automatically be terminated when the parent exits, even if its workers are still running and a job can have multiple child jobs but only one parent.
Another pattern is to run jobs in a pipeline, if job A says that it should run after job B, then job A will only be scheduled (made READY) after job B successfully finishes.
A job can list multiple jobs that it should run after and will only be scheduled after all those jobs finish successfully.

Finally, the Resource usage using Linux containers for resources isolation and usage accounting. The resource usage will be mainly used to calculate the power consumption of the usage in each cell, which will be explained below in the section 3.8 Power Consumption. Each task runs within its own container and may create multiple processes in that container.

Alloc instances are also associated with a container, inside which task containers nest.
The report usage values from a series of non – overlapping measurement windows for each instance.
The windows are typically 5 minutes long, although may be shorter if the instance starts, stops, or is updated within that period.
The measurements may extend for up to tens of seconds after an instance is terminated, or (rarely) for a few minutes.

Downloading the Dataset

The sources of data are retrieved from Google Cloud Storage. The total size of the compress trace is approximately 2.4TB. Since the dataset is big a sample of the dataset will be chosen and the method to process the data will be using Jupyter Notebook + Dask. Further explanation will be explained in section 3.3 Data Preparation. The dataset can be downloaded directly from Google Cloud Platform but by using HTTP, the google library which is known as urllib. request for an extensible library to open the URLs link used to download the files from the common storage. The urllib. request for modules which define functions and classes to facilitate the opening of URLs. It can be either a string or a request object. Figure 3.2 shows the process of extracting the data directly from HTTP using urllib. request. The information about the dataset is available on the website https://github.com/google/cluster-data/blob/master/ClusterData2019.md. Connected by high bandwidth cluster network, packed with and into physical racks, it’s what set of machines on trace is.

Data Extraction

The whole dataset is a total size of “133 GB” compressed size. Extracting the json file from files which have been downloaded from the google website by using HTTP. New data folder has been created to store the data which has been download. Downloaded data from the common storage need to be converted into proper form. Then, from the downloaded files, using “7Zip” to extract the json file for job and task files from common storage (.gz). After that by using the schemas provided, it’ll easy to query required files out of five schemas provided by Google Cloud.

Data Preparation

Once the files have been downloaded and extracted, the data is ready to be downloaded to the local storage for the modification process. The reason to use PowerBI is because of the large files of the traces dataset that is been handled. We can use python but since the dataset is in JSON format, it’ll take a long time to convert it to CSV file format and mostly causes crashes to the workstation used for this study. The solution found is by using PowerBI which can handle the large dataset provided by the Google Traces.

The “collection_events”, “instance_events”. This process will also be done for “instance_usage”. The timestamp on each dataset is in character form, hence, to analyse the data, the timestamp needs to be converted to become numeric. The converted timestamp will turn to microsecond number, and the purpose of converting timestamp from ‘char’ data type to microsecond number is because it’ll useful to calculate and visualize the failure based on numeric timestamp instead of using the character forms. The figure 3.3 shows the overall look and features contained using PowerBi for the Google Cluster Workload Traces Dataset 2019.

As “PowerBI” has limitations for exporting the file into CSV for rows that exceed 150,000, using “Dax Studio” can overcome these limitations. The rows that the dataset has currently is 2,065,730 and with “Dax Studio” this can be easily exported as CSV file type. Figure 3.4 shows the process of exporting the dataset in “Dax Studio”.

Dataset Append

Append Dataset Method

After changing the file type to CSV, based on Figure 3.5, first the dataset needs to be appended based on the collection_events, instance_usage, and instance_events. The dataset is appended first because of the large size in Google Cluster Workload Traces Dataset 2019. By using Dask + Jupyter Notebook, large dataset can be handled easily using lazy computation. Figure below shows the code to append, compute the value of each feature and outputting the file as CSV after the file is appended.

The collection_events are read first as it is separated to partitions. After that in Figure 3.7, the collection_events will be appended using concat method. In Python, the append function is used to add a single item to the list. A new list of items is not returned, but the exi

Related Skills

YC-Killer

2.7k

A library of enterprise-grade AI agents designed to democratize artificial intelligence and provide free, open-source alternatives to overvalued Y Combinator startups. If you are excited about democratizing AI access & AI agents, please star ⭐️ this repository and use the link in the readme to join our open source AI research team.

groundhog

398

Groundhog's primary purpose is to teach people how Cursor and all these other coding agents work under the hood. If you understand how these coding assistants work from first principles, then you can drive these tools harder (or perhaps make your own!).

last30days-skill

13.8k

AI agent skill that researches any topic across Reddit, X, YouTube, HN, Polymarket, and the web - then synthesizes a grounded summary

000-main-rules

Project Context - Name: Interactive Developer Portfolio - Stack: Next.js (App Router), TypeScript, React, Tailwind CSS, Three.js - Architecture: Component-driven UI with a strict separation of conce

Vu5e

View profile

View on GitHub

GitHub Stars16

CategoryEducation

Updated3mo ago

Forks0

Vu5e/JobFailurePredictionGoogleTraces2019

Languages

Jupyter Notebook

Security Score

72/100

Audited on Dec 22, 2025

No findings