This is an old revision of the document!
Translator Framework
The translator framework is a way to build web translators that lets translator authors avoid most of the boilerplate that usually is required for new translators, making it possible to write simple content scrapers in just a few lines of JavaScript.
The framework was written and contributed by Erik Hetzner and is licensed under the GPLv3+. It currently resides at http://e6h.org/~egh/hg/zotero-transfw/, but there are plans to include it in Zotero itself.
To use the framework, simply insert the framework code at the beginning of your translator, after the translator information block (JSON header). If you are using Scaffold to develop your translator, you won't see the information block, and you can just insert the framework at the top of the code box. The latest version of the code is here.
You'll start writing beneath the line that reads:
/* End generic code */
Example Translator
From APN.ru.js (GPLv3+ licensed):
function detectWeb(doc, url) { return FW.detectWeb(doc, url); } function doWeb(doc, url) { return FW.doWeb(doc, url); } /** Articles */ FW.Scraper({ itemType : 'newspaperArticle', detect : FW.Xpath('//div[@class="block_div"]/div/*[@class="article_title"]'), title : FW.Xpath('//div[@class="block_div"]/div/*[@class="article_title"]').text().trim(), attachments : FW.Url().replace(/article/,"print").makeAttachment("text/html", "APN.ru Printable"), creators : FW.Xpath('//div[@class="block_div"]/div/a[@class="pub_aname"]').text().cleanAuthor("author"), date : FW.Xpath('//div[@class="block_div"]/div/span[@class="pub_date"]').text(), publicationTitle : "Агенство политических новостей" }); /** Search results */ FW.MultiScraper({ itemType : "multiple", detect : FW.Xpath('//div[@class="search_content"]'), titles : FW.Xpath('//div[@class="search_content"]/div/a[@class="searchtitle"]').text(), urls : FW.Xpath('//div[@class="search_content"]/div/a[@class="searchtitle"]').key('href').text() });
This is the functional portion of a real, working web translator using the translator framework. It defines two scrapers, in this case one for newspaper articles and one for multiple result pages.
This is the general model for creating a translator using the framework – define several scrapers that are triggered by different kinds of page content or URLs.
Scrapers
As the example translator above shows, there are two kinds of scrapers in the framework, defined using the functions FW.Scraper()
and FW.MultiScraper()
. The first kind identifies item metadata for a single item from a single page, while the second kind identifies item page URLs on a single page and is usually used for things like search results of journal issue tables of contents.
Both kinds of scrapers are defined by passing an object with the scraper's item type (itemType
), detect conditions (detect
) and other keys to the corresponding function.
FW.Scraper
- Required keys:
detect
,itemType
- Optional keys:
attachments
, all Zotero item fields
FW.MultiScraper
- Required keys:
detect
,itemType
,titles
,urls
- Optional keys:
attachments
Delegation
It is possible to have a translator using this framework delegate processing to another translator, by setting the key itemTrans
, as in this example from the framework-derived version of the Google Scholar translator:
itemTrans : FW.DelegateTranslator({ translatorType : "import", translatorId : "9cb70025-a888-4a29-a210-93ec52da40d4"}),
Functions
Functions that can be used with the framework.
Main functions
FW.PageText ( )
FW.Url ( )
FW.Xpath ( expression )
FW.Scraper ( {..} )
FW.MultiScraper ( {..} )
String functions
prepend ( text )
append ( text )
remove (regex, flags )
note that empty entries are dropped silently– can be used to filtertrim ()
trimInternal ()
match ( regex, [ group ] )
capitalizeTitle ( )
Should support flag?
unescapeHTML ( text )
unescape ( text )
key ( key )
split ( regex )
join ( separator )
Zotero functions
cleanAuthor ( text, useComma )
makeAttachment ( type, title )