Legacy Scraper

Submitted by Benjamin Melançon on June 11, 2007 - 7:11am

This module for 4.7 (PHP5 with XML DOM required) is not currently supported. It was created by David Donahue in 2006, and while Agaric Design Collective has adopted the scraper project and added a separate module for very basic scraping that uses a different method, we have not had reason to dive into this code. If you do, or want to add another module to the scraper family, let us know and we can add you as a maintainer to the project on Drupal.org (if you have a CVS access).

Legacy Scraper scrapes data from web pages. This data can then be imported into a Drupal site as nodes (via CSV) or used for any other purpose.

To scrape data from a web page, an administrator creates a "scraper job", which she configures to point to the URLs of interest. She adds form field settings to permit the scraper to traverse logins, pagination, etc. She adds instructions as to how to find the fields of data for each record, using a combination of PHP and XPath. The module includes a library of regular expression functions to permit extraction of phone numbers postal codes, dates, times, etc. Scraper is capable of complex functionality, including the ability to submit forms with dynamic form info, and the ability for jobs to fire off subjobs.

This module requires PHP5 with XML DOM and Tidy extensions. To use Legacy Scraper, you must have some capability in PHP and the ability to learn XPath.

Do read the notes on using XPath with this module; the notes there would also serve as a basis for an improved version of this module.

Legacy Scraper scrapes data from web pages. This data can then be imported into a Drupal site as nodes (via CSV) or used for any other purpose.

This module requires PHP5 with XML DOM and Tidy extensions. To use Legacy Scraper, you must have some capability in PHP and the ability to learn XPath.

Do read the notes on using XPath with this module; the notes there would also serve as a basis for an improved version of this module.

Comments

Post new comment

Your name: *

E-mail: *

The content of this field is kept private and will not be shown publicly.

Homepage:

Subject:

Comment: *

You may post code using <code>...</code> (generic) or <?php ... ?> (highlighted PHP) tags.
You can use Markdown syntax to format and style the text. Also see Markdown Extra for tables, footnotes, and more.
Web page addresses and e-mail addresses turn into links automatically.
Allowed HTML tags: <a> <em> <strong> <cite> <code> <ul> <ol> <li> <dl> <dt> <dd> <img> <blockquote> <small> <h2> <h3> <h4> <h5> <h6> <sub> <sup> <p> <br> <strike> <table> <tr> <td> <thead> <th> <tbody> <tt> <output>
Lines and paragraphs break automatically.

User login

Legacy Scraper

Comments

Post new comment

Search

Agaric?

Agaric Design Collective

Copyright