SkillAgentSearch skills...

JchemoData.jl

Repository of datasets (chemometrics and others) in various formats (JLD2, etc.)

Install / Use

/learn @mlesnoff/JchemoData.jl
About this skill

Quality Score

0/100

Supported Platforms

Universal

README

JchemoData.jl

Project Status: Active - The project has reached a stable, usable state and is being actively developed.

The Julia package JchemoData is a repository containing datasets (chemometrics and others) in various formats (JLD2, CSV, etc.). Some of these datasets are used in the examples provided in Jchemo.jl and JchemoDemo.

The JLD2 datasets are listed and described below. New datasets are regularly added (check the commits).

<span style="color:green"> Installation </span>

In order to install JchemoData, run

pkg> add https://github.com/mlesnoff/JchemoData.jl.git

<span style="color:green"> Use </span>

A JLD2 dataset can be loaded as follows (in REPL):

using Jchemo, JchemoData
using JLD2
path_jdat = dirname(dirname(pathof(JchemoData)))
db = joinpath(path_jdat, "data/cassav.jld2") 
@load db dat
pnames(dat) # print the names of the objects contained in dat

<span style="color:green"> Available datasets </span>

<span style="color:green"> JLD2 datasets </span>

cassav

NIRS data on cassava roots (2009-2013; South-America). FOSS NiRSystem Instruments 400-2498 nm (step = 2 nm). This is an extract of the dataset used in Lesnoff et al. 2020.

Response variable:

  • TBC concentration (beta-carotene pigment).

Sources:

  • Harvest Plus Challenge Program, ICRAF, Columbia.

References:

  • Davrieux, F., Dufour, D., Dardenne, P., Belalcazar, J., Pizarro, M., Luna, J., Londoño, L., Jaramillo, A., Sanchez, T., Morante, N., Calle, F., Becerra Lopez-Lavalle, L., Ceballos, H., 2016. LOCAL regression algorithm improves near infrared spectroscopy predictions when the target constituent evolves in breeding populations. Journal of Near Infrared Spectroscopy 24, 109. https://doi.org/10.1255/jnirs.1213
  • Lesnoff, M., Metz, M., Roger, J.-M., 2020. Comparison of locally weighted PLS strategies for regression and discrimination on agronomic NIR data. Journal of Chemometrics n/a, e3209. https://doi.org/10.1002/cem.3209

cereal

Dataset 'cereal' in Filzmoser 2023. For 15 cereals an X and Y data set, measured on the same objects, is available (Filzmoser 2023). The X data are 145 infrared spectra, and theY data are 6 chemical/technical properties (Heating value, C, H, N, Starch, Ash). The cereals come from 5 groups: B=Barley, M=Maize, R=Rye, T=Triticale, W=Wheat. The data set can be used for PLS2.

Sources:

  • Package chemometrics; Multivariate Statistical Analysis in Chemometrics, 2023, Filzmoser P.

References:

  • K. Varmuza and P. Filzmoser: Introduction to Multivariate Statistical Analysis in Chemometrics. CRC Press, Boca Raton, FL, 2009.

challenge2018

NIRS data (protein content of forages and feed) used in the challenge of the congress Chemometrics2018 (Paris, January 2018). The original data contain errors (duplicates). The data provided in JchemoData have been corrected (duplicates have been removed), and documented with new descriptors (type of vegetal materials).

challenge2021

NIRS data (reflectance) used in the challenge of the e-congress Chemometrics2021 (Februaru 2021). A data description is available here.

corn

Eigenvector Corn data

This data set consists of 80 samples of corn measured on 3 different NIR spectrometers. The wavelength range is 1100-2498nm at 2 nm intervals (700 channels). The moisture, oil, protein and starch values for each of the samples is also included. A number of NBS glass standards were also measured on each instrument. The data was originally taken at Cargill.

  • m5spec: [80x700 dataset] Spectra on instrument m5
  • mp5spec: [80x700 dataset] Spectra on instrument mp5
  • mp6spec: [80x700 dataset] Spectra on instrument mp6
  • propvals: [80x4 dataset] Property values for samples
  • m5nbs: [3x700 dataset] NBS glass stds on m5
  • mp5nbs: [4x700 dataset] NBS glass stds on mp5
  • mp6nbs: [4x700 dataset] NBS glass stds on mp

fermentation

Dataset 'NIR' in Filzmoser 2023. Liebmann et al. 2009 (Filzmoser et al 2012) provided this dataset where 166 alcoholic fermentation mashes of different feedstock (rye, wheat, and corn) were analyzed. The response variables are the concentrations of glucose and ethanol (in grams per liter) in substrates from the bioethanol processes. The 235 predictor variables contain the first derivatives of near infrared spectroscopy (NIR) absorbance values at 1115–2285 nm, measured in liquid samples.

Sources:

  • Package chemometrics; Multivariate Statistical Analysis in Chemometrics 2023, Filzmoser P.

References:

  • B. Liebmann, A. Friedl, and K. Varmuza. Determination of glucose and ethanol in bioethanol production by near infrared spectroscopy and chemometrics. Anal. Chim. Acta, 642:171-178, 2009.
  • Filzmoser, P., Gschwandtner, M., Todorov, V., 2012. Review of sparse methods in regression and classification with application to chemometrics. Journal of Chemometrics 26, 42–51. https://doi.org/10.1002/cem.1418

flour_splus6

Dataset reported in the Splus6 Manual, p. 629 (example used to compute estimated marginal means, a.k.a 'ls-means')

Uncomplete 3-factor design without replication (n = 26 obs.):

  • 3 fat levels
  • 3 surfactant levels
  • 4 flour types
  • and reponse variable 'y'

forages2

NIRS data on dried and grounded mixed forages (N = 485): stems, leaves etc. collected mainly tropical African areas. FOSS NiRSystem Instruments 1100-2498 nm (step = 2 nm). The data being private, the provided spectra have been preprocessed: standard normal variation (SNV) and Savitzky-Golay (deriv = 2) transformation.

Response variables:

  • DM: dry matter content
  • NDF: fibers content
  • typ: Type of forage

Sources:

grapes

Varieties of wine grapes, to be discriminated by means of NIR and visible spectrometry. The spectra were measured in transmission on berries separated from the bunch, in laboratory conditions, with a ZEISS MMS1 spectrometer. The wavelengths ranged from 310 to 1100 nm. These data were collected within the framework of a project aiming at characterizing the sugar content and the acidity of wine grapes by NIR spectrometry. Thus,the berries were selected to span a great heterogeneity of maturity. Spectra were acquired by batches of 50 individuals. Each batch contained individuals of the same variety. The experimentation related to 3 varieties: carignan (crg), grenache blanc (grb) and grenache noir (grn). Only crg and grb varieties were measured on different batches, at various dates. For crg and grb varieties, the training set and the test set are different batches, whereas for the grn variety, a batch of spectra was cut randomly in two equal parts. Thus, the calibration and test sets consisted each of N = 125 individuals described by p = 256 variables.

References:

  • Roger JM, Palagos B, Guillaume S, Bellon-Maurel V. Discriminating from highly multivariate data by Focal Eigen Function discriminant analysis; application to NIR spectra. Chemometrics and Intelligent Laboratory Systems. 2005;79(1):31-41. doi:10.1016/j.chemolab.2005.03.006.

grapevariety

Visible-NIR spectra collected (with Labspec ASD) on N = 432 fresh leaves of three wine grape varieties to be discriminated. For confidentiality, the spectra have been anonymized and preprocessed with a Savitsky-Golay transformation (first derivate). A gap observed in the spectra at 1000 nm has been removed before the preprocessing.

Sources:

  • M. Ecarnot, Inrae, UMR Agap, Montpellier, France.

ham

Sensory evaluation of eight American dry-cured ham products, performed by a panel of trained assessors.

References:

  • M.D. Guardia, A.P. Aguiar, A. Claret, J. Arnau & L. Guerrero (2010). Sensory characterization of dry-cured ham using free-choice profiling. Food Quality and Preference, 21(1), 148-155. doi: 10.1016/j.foodqual.2009.08.014
  • Tchandao Mangamana, E., Cariou, V., Vigneau, E., Glèlè Kakaï, R.L., Qannari, E.M., 2019. Unsupervised multiblock data analysis: A unified approach and extensions. Chemometrics and Intelligent Laboratory Systems 194, 103856. https://doi.org/10.1016/j.chemolab.2019.103856

iris

Fisher's or Anderson's iris dataset gives the measurements in centimeters of the variables sepal length and width and petal length and width, respectively, for 50 flowers from each of 3 species of iris. The species are Iris setosa, versicolor, and virginica.

References:

  • Fisher, R. A. (1936) The use of multiple measurements in taxonomic problems. Annals of Eugenics, 7, Part II, 179–188.
  • Anderson, Edgar (1935). The irises of the Gaspe Peninsula, Bulletin of the American Iris Society, 59, 2–5.
  • Becker, R. A., Chambers, J. M. and Wilks, A. R. (1988) The New S Language. Wadsworth & Brooks/Cole. (has iris3 as iris.)

linnerud

Linnerud data (Tenenhaus 1998, Table 1, p.15).

Two tables of measures on 20 humans:

  • X = 3 variables of physical exercice.
  • Y = 3 variables of body condition.

References:

  • Tenenhaus, M., 1998. La régression PLS: théorie et pratique. Editions Technip, Paris.

mango_anderson

NIRS data from mango.

References:

  • Anderson, N.T., Walsh, K.B., Flynn, J.R., Walsh, J.P., 2021. Achieving robustness across season, location and cultivar for a NIRS model for intact mango fruit dry matter content. II. Local PLS and nonlinear models. Postharves

Related Skills

View on GitHub
GitHub Stars9
CategoryProduct
Updated17h ago
Forks0

Languages

Julia

Security Score

75/100

Audited on Apr 7, 2026

No findings