Harvester Documentation
  
  
  
  
Overview
The jOAI harvester is  used to  retrieve metadata records from remote OAI data providers and save them to the local file system, one record per file. In addition, records that have been harvested  may be packaged into zip archives that can be downloaded and opened through the harvester's web-based interface.The harvester can be configured to harvest automatically at regular intervals and effectively maintain a mirror of the remote repository on the local file system.  
The jOAI harvester supports OAI protocol versions 1.1 and 2.0, supports data 
  providers that use resumption tokens for flow 
  control, selective harvesting by date 
  or set, 
  gzip response 
  compression and other protocol features. 
See the Harvester FAQ 
  for additional information. 
 
 
Harvester setup
1. Install the jOAI software on a system in a servlet container such as Apache Tomcat.
  See INSTALL.txt for installation instructions. If reading this page, most likely this step has  been completed.
 
  2. Complete Harvester Setup.  Add a new harvest and complete: 
  - Enter a repository name (required)
 
  - Provide a repository base URL that starts with  http:// (required)
 
  - Include a setSpec (optional)
 
  - Provide the metadata format being harvested (required) 
 
  - Indicate if the harvest should occur at regular  intervals (optional)
 
  - Indicate where metadata files should be saved (required)
 
  - Indicate how metadata files are saved (by set or  not)
 
 
The repository name is a name to describe  the data provider being harvested. The harvester status table is organized as  an alphabetical listing of repository names. 
  The base URL is the access point of a data  provider. It’s a web address that starts with http:// 
  The  harvested metadata format can be any  metadata format as long as it matches a metadata format used by the provider  being harvested. Use the OAI ListMetadataFormats  request to find available metadata formats at the provider. The  ListMetadataFormats requests look like:
  http://some.provider.org/base/url?verb=ListMetadataFormats 
  that is, concatenate together the [base URL] + [?verb=ListMetadataFormats]
  The OAI ListMetadataFormats request returns an XML document and the  XML element, metadataPrefix, provides the metadata formats available.
   
  Harvest automatically at  regular intervals means a time  interval (days/hours/minutes/seconds) can be specified that tells the jOAI  harvester when and how often to perform an automatic harvest that checks for and updates  new  records.
   
  Saving files at the default harvest location means metadata files are saved to the context (directory) within the OAI application  generally of the form "~oai/WEB-INF/harvested_records/". To view the default  directory path of this location, click on the save files help button (the question mark).
   
  Saving files to a non-default harvest location means  metadata files are saved to a user-specified location in which the full directory  path is provided or files are saved to a recently used location.
   
  If a SetSpec is specified,  metadata files are saved as a group. If a SetSpec  is not specified, metadata files can be saved into one big group (the do  not split by set option) or saved in many groups (split by set option)  depending on how the provider being harvested is organized. The default save  option is do not split by set. 
    
 
Harvest test files
Conduct a test harvest  by completing the harvester setup section above but use the following  information: 
  - Repository name: DLESE
 
  - Repository base URL: http://dlese.org/oai/provider
 
  - Metadata format: adn
 
 
Leave all other fields  blank and save the entry.
 
  On the Harvester Setup and Status click 'View harvest history' page to  see the harvest being performed. Click 'Refresh  page' to see the number of metadata files increase. The entire harvest may  take several minutes to complete.
 
  The test harvest is  successful if the metadata files can be viewed by one of these methods. On the Harvester Setup and Status page,
 
  - Locate and go to the 'Harvested to' directory on the server and view the files.
 
  
  - If zipping of files was enebled, under 'Download zipped harvest', click on 'Most recent'. Save the zip file to       your Desktop, unzip it, and view the harvested records. 
 
 
   
 
Registered data providers 
The Open Archives Initiative maintains a list  of registered data providers that can be harvested. 
 
The Java Harvester API 
The jOAI code base includes a Harvester API that may be used in Java programs to harvest from OAI data providers. The API is part of the DLESETools.jar Java library, found in the $tomcat/webapps/oai/WEB-INF/lib/ directory of the jOAI installation. See the Harvester  Javadoc for details. Use of the API assumes familiarity with the Java programming language.  
 
Harvest, validate and transform from the command line 
Linux shell scripts are included in the jOAI distribution that allow you to perform OAI harvests, XML validation, and XML transformations from the command line.   
  - To install: See the instructions provided in the README and script files located in the jOAI installation at 
$tomcat/webapps/oai/WEB-INF/bin/. Once installed, the scripts do not require the jOAI Web application in order to be used. 
 
harvest - This script  performs  harvests from OAI data providers and saves the harvested records as individual files on disk. It accepts options to harvest by date range, set, and variations on how the metadata is written to files. It is simply a wrapper to the  Java Harvester API mentioned above. 
validate - This script performs schema validation on a single XML file or batch validation on a directory of files, outputting a summary report of the results. 
transform - This script performs an XSL transformation on a single XML file or batch transformations on a directory of files, outputting the transformed XML files to a directory.  
  
  
  
  
 |