[Go]
PubMed | Entrez | Structure | PubChem | Help
PubChem » Help » Download Facility Help
PubChem Download Facility Help          

This document describes how to use the download facility built into PubChem, which lets you create a file containing a set of structures and/or bioassays from the results of an Entrez search or an ID list. Note that the number of records for compound/substance that can be retrieved this way is limited to 500,000 (except for images, limited to 50,000); if you need larger sets of records, then you should use the PubChem FTP site that contains the entire database in the same formats available here. All download requests are kept private; without your unique 64-bit key, nobody else can see what records you have requested.


 

Compound/Substance Download

back to top

| search then download | direct download | power user gateway (PUG) |

Search PubChem Substance or Compound → download chemical structures from the search results page:back to top

1) Perform a search in the PubChem Substance or PubChem Compound database.

2) Press the Structure Download icon PubChem structure download icon that appears on PubChem Compound and PubChem Substance search results pages that appears in the row of "Tools" buttons near the top of the search results page.

3) This will take you to a format selection page. There are two menus, one to choose the data format:

Text ASN.1: ASN.1 is PubChem's native data format; this is the record as it exists in our database. The text flavor of ASN.1 is both computer and human readable (to some extent), but is not an industry-supported type of ASN.1 data - so ASN.1 parsing libraries other than NCBI's may not be able to read it. The ASN.1 specification for this data is available on the PubChem FTP site.

Binary ASN.1: This is industry-standard ASN.1 in binary (non-human readable) format. Any ASN.1 parsing library should be able to read this format. The ASN.1 specification for this data is available on the PubChem FTP site.

XML: This data is exactly equivalent to the ASN.1 data, but is in standard XML format. The XML schema for this data is available on the PubChem FTP site.

SDF: This data is in standard SDF format, converted from the original ASN.1. A full description of all the SD tags used to present PubChem records is available on the PubChem FTP site.

Image / Small Image: Retrieves the images used in PubChem web pages, in large (currently 300x300) or small (100x100) size. The data is always returned in PNG format, stored as SID/CID-numbered files in a ZIP archive, regardless of compression selection below.

SMILES: Retrieves the isomeric SMILES description of the records. The format is a text file, where each line contains SID/CID - [tab] - SMILES string.

InChI: Retrieves the InChI description of the records (see http://www.iupac.org/inchi). The format is a text file, where each line contains SID/CID - [tab] - InChI string.

A second menu lets you choose the compression for the resulting data file:

GZip: This is the default and recommended compression, as it is recognized by most modern decompression applications. Information on GZip (.gz) is available at www.gzip.org.

BZip2: These files are slightly smaller, but the format is not as widely used and takes a little longer to decompress. Information on BZip2 (.bz2) is available at www.bzip.org.

None: No compression.

If 3D coordinates for the records are desired, then select the "Use 3D" checkbox. Note that this affects only the following download types. Substances with deposited 3D coordinates will always be returned in 3D. Any records that do not have 3D information will be omitted from the download.

Substance images: When the deposited form of the substance has 3D coordinates supplied by the depositor, this will change the images to a 3D rendering of these coordinates.

Compound images: When the compound has computed 3D coordinates, this will change the images to a 3D rendering of these coordinates.

Compound records: When the compound has computed 3D coordinates and the requested format is one that includes coordinate information, this will change the coordinates to 3D. Multiple conformers of each CID may be requested, though not all compounds may have that many conformers available.

4) Press the download button button to begin the download process. Because the records are being retrieved directly from the PubChem database, it is necessary to queue download requests in order to prevent server overload. You will see a series of self-refreshing pages during this process. In particular, the Queue status shows what's happening:

Waiting: There are requests in the queue already, and this job is waiting for its turn.

Running: This job's turn has come, and the download file is being prepared.

Done: The request has been completed.

You do not have to keep your browser open on this page the entire time; you can bookmark this status page and come back to it later to check your request's progress, anytime within 24 hours of the initial request.

5) When the download is finished, your file should start transferring automatically. You can also download by FTP from the given URL link - either directly through your browser or with any FTP client. Your file will remain on the FTP site for at least a week.


Direct download:
input a list of chemicals (CIDs or SIDs) → download chemical structures via PubChem Download Serviceback to top

It is now possible to download PubChem Compound or Substance data directly without going through Entrez:

1) Navigate to the PubChem Download Service web page (http://pubchem.ncbi.nlm.nih.gov/pc_fetch)

2) Select a database: PubChem Compound or PubChem Substance

3) Supply a list of IDs. These should be SIDs for PubChem Substance or CIDs for PubChem Compound, and you can either type or paste them directly into the web page form or upload a local file of IDs. The IDs may be integers separated by any combination of white space, comma, or semicolon. You may also choose from a list of prior Entrez searches, if available, but note that the history item selection must match the database selection. (Note that these input options will not appear in the web form if you reached the PubChem Download Service by pressing the download icon PubChem structure download icon that appears on PubChem Compound and PubChem Substance search results pages on a PubChem Compound or PubChem Substance search results page. In such a case, the PubChem Download Service automatically uses the CIDs or SIDs from your search results as input, and displays only the output options (format, compression type, 3D coordinates).) The rest of the download operation then proceeds as described above.

PubChem Power User Gateway (PUG)back to top

The save job button that appears at the bottom of a PubChem Download Sevice page button that appears at the bottom of a PubChem Download Sevice page produces an XML data structure that may be used with PUG, or as a model for constructing PUG download requests. See http://pubchem.ncbi.nlm.nih.gov/pug/pughelp.html for more information on accessing PubChem through PUG.

BioAssay Download

back to top

| bioassay descriptions and data set or subset | bioactivity summary for a single chemical |

Download BioAssay descriptions and corresponding data sets or subsetsback to top
| from PubChem BioAssay search results | directly through PCAssay Download Service|

Download assays from PubChem BioAssay search resultsback to top
When you perform a search in the PubChem BioAssay database, you can click on the download icon PubChem structure download icon that appears on PubChem Compound and PubChem Substance search results pages. in the "Tools" section of the BioAssay search results page, as illustrated below, to export the retrieved records. This opens the PubChem PCAssay Download Service, which allows you to select the format in which you would like to save the assay descriptions and/or data. If desired, you can choose to download only the subset of data from each bioassay that contains the results for a specified set of PubChem Substance IDs (SIDs), by uploading the SID list on the PubChem PCAssay Download Service page. A limit of 1000 bioassays records can be downloaded at a time.
Sample PubChem BioAssay search results page from a search for G6PD. The Tools row at the top of the page shows the download icon.


PubChem PCAssay Download Service: back to top
The PubChem PCAssay Download Service can be accessed by clicking on the download icon PubChem structure download icon that appears on PubChem Compound and PubChem Substance search results pages. of a PubChem BioAssay search results page, as described above, or by opening the service directly at http://pubchem.ncbi.nlm.nih.gov/assay/assaydownload.cgi. A limit of 1000 bioassays records can be downloaded at a time.

PubChem PCAssay Download Service home page, where you can input a list of assay IDs and download the corresponding descriptions and/or data sets. Click on the image to open the live Download Service page. If you access the service directly, it allows you to input a list of Assay IDs (AIDs) and download the corresponding assay descriptions and/or data. The IDs may be integers separated by any combination of white space or comma.

If you accessed the Download Service from a PubChem BioAssay search results page, the list of AIDs you retrieved will be imported automatically (and will appear as a "query_key" in the URL of the download service page).

You can also choose input a list of PubChem Substance IDs (SIDs), if you would like to download only the subset of data from each bioassay that shows only the results for the specified list of SIDs.

The Download Service provides the result as a single compressed file in "zip" format. The result file may contain multiple bioassay records depending on user’s request, in that case, each bioassay record is kept as a separate file when the zip file is uncompressed, and is named by the respective bioassay identifier, e.g. AID, as prefix. This download service supports the following formats:

ASN.1(Binary): Compressed and binary ASN.1 data containing bioassay descriptions and results.
XML: Compressed XML data containing bioassay descriptions and results.
XML(Description Only): Compressed XML data containing only bioassay descriptions.
CSV: Compressed CSV formatted data containing only bioassay results (no descriptions).

The ASN.1 object specification file, "pubchem.asn", and the XML object schema file, "pubchem.xsd", are available at ftp://ftp.ncbi.nlm.nih.gov/pubchem/specifications/.
Detailed information for the column definition in the CSV file can be found at ftp://ftp.ncbi.nlm.nih.gov/pubchem/Bioassay/CSV/README
Download bioactivity summary for a single chemicalback to top

To view/download a bioactivity summary for a single chemical, simply append the chemical's PubChem Compound identifier (CID) or PubChem Substance identifier (SID) to the appropriate URL below:

CID:

http://pubchem.ncbi.nlm.nih.gov/assay/assay.cgi?cid=myCID
for example:
http://pubchem.ncbi.nlm.nih.gov/assay/assay.cgi?cid=3059

SID:

http://pubchem.ncbi.nlm.nih.gov/assay/assay.cgi?sid=mySID
for example:
http://pubchem.ncbi.nlm.nih.gov/assay/assay.cgi?sid=855794

The resulting page will contain a tabular display of all bioassays that tested the chemical, sorted by bioactivity outcomes and endpoints. Use the Download CSV button to download a comma separated value (CSV) file of the data.

 Write to Helpdesk | Disclaimer | Privacy statement | Accessibility | Data Citation Guidelines revised 22 June 2011