The data contained inside ML_TRANSITION.tar.gz are organized in 5 folders and 2 Compressed tar balls:
1) DECOMPOSITION_COEFFS.tar.gz:
Compressed tar ball containing the fitted coefficients for each molecule (train, test, and out-of-sample set) included in this work. The ground state, hole/particle densities and the transition densities are labelled in the suffix of each file as follows:
A) *.dm_fit.dat : Ground state density.
B) *.st1_dm_hole_fit.dat : n(pi)* hole density.
C) *.st1_dm_part_fit.dat : n(pi)* particle density.
D) *.st2_dm_hole_fit.dat : pi(pi)* hole density.
E) *.st2_dm_part_fit.dat : pi(pi)* particle density.
F) *.st2_transition_fit.dat : pi(pi)* transition density.
2) EXCEL_DATA: contains two excel sheets, one for n(pi)* and one for the pi(pi)* state. The excel files contain the ab initio results grouped in train and test sets for the four points of the learning curves (that is, with 325, 650, 975 and 1300 molecules in the training set), and the predicted values for the test and out-of-sample sets. An interactive version of the data contained here is available at http://vfbc21.herokuapp.com .
3) FRAGMENTS: contains the definition of the fragments (R3, R4, Azo, R1, R2) for each molecule. For the definition of the fragments refer to the SI of the paper.
A) Each line contains the list of atom numbers which belong to each fragment. See the GEOMETRIES folder file with the same name for structure.
4) GEOMETRIES: contains the 3427 structures of azo-dyes that form the training, the test and the out-of-sample sets. The geometries are in the xyz-format.
5) INDICES : contains four files containing the name of the molecules in the train, test (separated and joint) and out-of-sample sets.
6) PRED_COEFFICIENTS.tar.gz:
Compressed tar ball containing the predicted coefficients for each molecule (test and out-of-sample set) included in this work. The folder contains 6 subfolders:
A) GS : Ground state density.
B) H_S1 : n(pi)* hole density.
C) H_S2 : pi(pi)* hole density.
D) P_S1 : n(pi)* particle density.
E) P_S2 : pi(pi)* particle density.
F) TD_S2 : pi(pi)* transition density.
Each subfolder contains two additional folders:
1) OOS : out-of-sample predictions.
2) TEST_SET : test set predictions.
7) REGRESSION_WEIGHTS: contains the ML regression weights after training of the model for each targeted field. The folder contains 6 subfolders:
A) GS : Weights for ground state density.
B) H_S1 : Weights for n(pi)* hole density.
C) H_S2 : Weights for pi(pi)* hole density.
D) P_S1 : Weights for n(pi)* particle density.
E) P_S2 : Weights for pi(pi)* particle density.
F) TD_S2 : Weights for pi(pi)* transition density.