15 target sets, 7761 actives,and 382674 inactives from high-confidence PubChem Bioassay data  

Data Content

For each PubChem target set, the following ready-to-use input files are provided:
  • PBDid_protein.mol2 (protein x-ray structure, MOL2 file format)
  • PDBid_ligand.mol2 (ligand x-ray structure, MOL2 file format)
  • active.smi (true active compounds: SMILES string, PubChem SID)
  • inactive.smi (true inactive compounds: SMILES string, PubChem SID)
  • active_T.smi (active compounds for training: SMILES string, PubChem SID)
  • active_V.smi (active compounds for validation: SMILES string, PubChem SID)
  • inactive_T.smi (inactive compounds for training: SMILES string, PubChem SID)
  • inactive_V.smi (inactive compounds for validation: SMILES string, PubChem SID)


Please cite

Viet-Khoa Tran-Nguyen and Didier Rognan
LIT-PCBA:  An unbiased data set for machine learning and virtual screening
J. Chem. Inf. Model, 2020, 60, 9, 4263–4273