Two-Stage Hierarchical Cattle Disease Classification Pipeline

Overview

This project implements a hierarchical machine learning pipeline for the classification of cattle diseases based on clinical symptom scores. The system operates in two distinct stages:

Stage 1: Category Classification: Uses Logistic Regression to classify symptoms into broad disease categories (e.g., Respiratory, Digestive, Infectious, etc.).
Stage 2: Specific Disease Identification: Uses Random Forest classifiers targeted to the predicted category to identify the specific disease.

This hierarchical approach is designed to improve model interpretability and handle the complexities of multi-class disease diagnosis more effectively than a flat classification model.

Project Structure

hierarchical_cattle_disease_classification.ipynb: The main Jupyter notebook containing data exploration, model training, and performance evaluation.
run_hierarchical_classification.py: A production-ready Python script that executes the complete two-stage training and validation pipeline.
validated_cattlex_dataset.csv: The validated dataset containing symptom scores and disease labels.
notebook_script.txt: A text version of the notebook logic.
IEEE_single_column_high_clarity.pdf: Supporting documentation/technical paper.
stage1_confusion_matrix.png: Visual representation of the Stage 1 classification performance.

Dataset Details

The dataset validated_cattlex_dataset.csv consists of approximately 2,044 samples.

Features

The models use 5 clinical aggregated symptom scores:

respiratory_score
digestive_score
mobility_score
skin_score
systemic_score

Targets

Stage 1: disease_category (6 unique categories)
Stage 2: disease_name (26 unique diseases)

Installation & Setup

Ensure you have Python 3 installed. You can install the required dependencies using pip:

pip install pandas numpy scikit-learn matplotlib seaborn

How to Run

Using the Notebook

Open hierarchical_cattle_disease_classification.ipynb in VS Code or Jupyter Lab and run all cells to see the full analysis, training process, and visualizations.

Using the Script

To run the automated pipeline from the terminal, execute:

python run_hierarchical_classification.py

This will:

Load the dataset.
Train and evaluate the Stage 1 Logistic Regression model using Stratified K-Fold.
Train and evaluate Stage 2 Random Forest models for each category.
Output classification reports and performance metrics.

Performance

The pipeline uses Stratified K-Fold Cross-Validation to ensure robust performance metrics, focusing on F1-score to balance precision and recall across all disease classes.

CATTLEX

Install / Use

README