FTP Service

The PMC FTP Service may be used to download the source files for any article in the PMC Open Access Subset, associate PMC articles with identifiers such as: PubMed IDs, DOIs, Manuscript IDs, ISSN, etc., and can be used as a source for data mining. The contents of the PMC FTP site is normally updated once a day. If you have questions or comments about the PMC FTP site, please write to oai@ncbi.nlm.nih.gov.

1. Source Files from the PMC Open Access Subset

This FTP service may be used to download the source files for any article in the PMC Open Access Subset. The source files for an article may include:

  1. A .nxml file, which is XML for the full text of the article, encoded in the NLM Journal Archiving and Interchange DTD.
  2. A PDF file of the article.
  3. Image files from the article, and graphics for display versions of mathematical equations or chemical schemes.
  4. Supplementary data, such as background research data or videos.

The URL to access the FTP site is ftp://ftp.ncbi.nlm.nih.gov/pub/pmc

All the source files for an article are packaged in a single .tar.gz file. The FTP site has a two-level-deep folder (directory) structure. Folder names are randomly generated and the .tar.gz file for an article is randomly assigned to a second-level folder.

2. Finding Data in the File Lists

The files file_list.txt and file_list.csv are two separate formats that each provide a single index of articles available from PMC.

file_list.txt

Location: ftp://ftp.ncbi.nlm.nih.gov/pub/pmc/file_list.txt

Contents:

  • The fully qualified name of the .tar.gz file for an article.
  • The article citation:
    • journal title abbreviation
    • publication date
    • volume
    • issue
  • The PMC accession number:
    • a unique, persistent ID for the article that can be used to display it in PMC

Sample:

  • 08/dc/PLoS_Biol-2-9-490026.tar.gz PLoS Biol. 2004 Sep; 2(9):e259 PMC490026
  • 65/25/Nucleic_Acids_Res-33-7-1079965.tar.gz Nucleic Acids Res. 2005; 33(7):2129-2140 PMC1079965

file_list.csv

Location: ftp://ftp.ncbi.nlm.nih.gov/pub/pmc/file_list.csv

Contents: Fields are the same as above, except separated by commas, and with the addition of a trailing timestamp indicating the last update to the article in PMC.

Sample:

  • 8d/2f/Int_J_Health_Geogr_2003_Sep_25_2_7.tar.gz,Int J Health Geogr. 2003 Sep 25; 2:7,PMC222916,2011-05-06 15:30:22
  • 1b/85/Retrovirology_2005_Feb_7_2_7.tar.gz,Retrovirology. 2005 Feb 7; 2:7,PMC549042,2011-05-06 15:35:55

3. Display an Article in PMC:

Insert the accession number (accid) to the PubMed Central URL:
http://www.ncbi.nlm.nih.gov/pmc/articles/<accid>/

For example:

To display PLoS Biol. 2004 Sep; 2(9):e259.
Accession number: PMC490026 (From the sample entries above)
Use: http://www.ncbi.nlm.nih.gov/pmc/articles/PMC490026/

To find an article from PMC on the FTP site:

Copy the PMC accession number from the PubMed Central URL by highlighting it in your browser's address bar and copying to the clipboard. Open the file file_list.txt, use the “find on this page” function (ctrl f), and paste (ctrl v) the accesssion number into the “find on this page” dialogue box.

4. Obtaining DOIs and PubMed IDs for Articles in PMC

Use PMC-ids.csv.gz to associate PMC articles with a PMC accession number, a PubMed ID, and the corresponding DOI.

PMC-ids.csv.gz is a comma separated file with the following fields:

  • Journal Title
  • ISSN
  • Electronic ISSN
  • Publication Year
  • Volume
  • Issue
  • Page
  • DOI (if available)
  • PMC accession number
  • PubMed ID (if available)
  • Manuscript ID (if available)
  • Release Date (Mmm DD YYYY or live)

Format:

Journal Title,ISSN,eISSN,Year,Volume,Issue,Page,DOI,PMC accession number,PMID,Manuscript Id,Release Date

Sample entries:
  • Mol Biol Cell,1059-1524,1939-4586,2000,11,6,2019, ,PMC14900,10848626, ,live
  • J Neurosci,0270-6474,1529-2401,2005,25,24,5740,10.1523/JNEUROSCI.0913-05.2005,PMC1201448,15958740,NIHMS3372,live
  • Cancer Res,0008-5472,1538-7445,2007,67,17,8022,10.1158/0008-5472.CAN-06-3749,PMC1986634,17804713,NIHMS25090,Sep 1 2008
  • Proc Natl Acad Sci U S A,0027-8424,1091-6490,2007,104,43,17075,10.1073/pnas.0707060104,PMC2040460,17940018, ,live
  • Cell Host Microbe,1931-3128,1934-6069,2007,2,6,404,10.1016/j.chom.2007.09.014,PMC2184509,18078692,NIHMS36164,live
  • Proc Natl Acad Sci U S A,0027-8424,1091-6490,2008,105,21,7382,10.1073/pnas.0711174105,PMC2396716,18495922, ,Nov 27 2008
  • PLoS Med,1549-1277,1549-1676,2008, ,Immediate Access,e168,10.1371/journal.pmed.0050168,PMC2494565,18684010, ,live

Note:

  • If any information is not available, entries will contain an empty space.
  • Articles that show a Release Date are under embargo and not yet available on the PMC public site.
  • When embargoed articles are released to the PMC public site, the Release Date field value changes to "live".

5. XML for Data Mining via FTP

The files below contain XML (and only XML) files for ALL PMC open access articles. These files were created for users who need PMC XML for data mining and processing purposes, but do not need PDFs, images, or supplementary data.

6. Suggested FTP Client Configuration

After a series of experiments using ftp clients with NCBI's ftp server, we've found that the configuration of ftp clients can seriously affect performance.

NCBI recommends setting the TCP buffer size to 32Mb.

For more information on FTP configuration, please see the US Department of Energy's Guide to Bulk Data Transfer over a WAN.

Last updated: Thu, 08 Sep 2011