-
Notifications
You must be signed in to change notification settings - Fork 0
Scrape Table
By default, it ships with Table Scraping Service abstract class that scrapes data from the source URL, traces table structure, and generates data of respective structures traced.
To extract table contents, above abstract class can be extended. Keep in mind, It however does not implement Table Tracer methods. To support this, there are two traits that implements these methods (see usage in Step 2 below):
- Using PHP DOMNode: It uses PHP's DOM API and can also support extracting multiple tables from HTML document.
- Using only string: It uses regex and only supports extracting a single table from HTML document.
Create an enum that maps to each table column by defining cases in the same order they exist at source.
/** @template-implements BackedEnum<string> */
enum DeveloperDetails: string {
case Name = 'name';
case Title = 'title';
case Address = 'address';
case Age = 'age';
public function isValid( string $value ): bool {
return match ( $this ) {
self::Name, self::Title => strlen( $value ) < 20,
self::Address => strlen( $value ) === 3,
self::Age => is_numeric( $value ),
};
}
}
Then, create concretes that will scrape and parse html table with accented characters. We want each table column to return a string value, hence we will use generic return type as string.
use DeveloperDetails;
use TheWebSolver\Codegarage\Scraper\Enums\Table;
use TheWebSolver\Codegarage\Scraper\Traits\Diacritic;
use TheWebSolver\Codegarage\Test\DOMDocumentFactoryTest;
use TheWebSolver\Codegarage\Scraper\Error\ValidationFail;
use TheWebSolver\Codegarage\Scraper\Attributes\ScrapeFrom;
use TheWebSolver\Codegarage\Scraper\Interfaces\TableTracer;
use TheWebSolver\Codegarage\Scraper\Interfaces\Validatable;
use TheWebSolver\Codegarage\Scraper\Attributes\CollectUsing;
use TheWebSolver\Codegarage\Scraper\Proxy\ItemValidatorProxy;
use TheWebSolver\Codegarage\Scraper\Service\TableScrapingService;
use TheWebSolver\Codegarage\Scraper\Interfaces\AccentedIndexableItem;
use TheWebSolver\Codegarage\Scraper\Traits\Table\HtmlTableFromString;
/** @template-extends TableTracer<string> */
[CollectUsing(DeveloperDetails::class, DeveloperDetails::Name)]
// To collect only "name" and "age" Table Columns by omitting enum cases
// "Title" & "Address" as columns and not using index key, use like so:
// [CollectUsing(DeveloperDetails::class, null, DeveloperDetails::Name, null, null, DeveloperDetails::Age)]
class DeveloperTableTracer implements TableTracer, AccentedIndexableItem, Validatable {
/**
* @use HtmlTableFromString<string>
* Because there is only one table, we'll use string trait.
*/
use HtmlTableFromString, Diacritic;
/**
* @var list<string> $accentedItemIndices
* We need to translit title, so we will provide it here.
*/
public function __construct(
protected array $accentedItemIndices = [ DeveloperDetails::Title->value ]
) {};
public function indicesWithAccentedCharacters(): array {
return $this->accentedItemIndices;
}
public function validate( $content ): void {
$column = DeveloperDetails::from( $this->getCurrentItemIndex() );
$column->isValid( $content ) || throw new ValidationFail( "Fail for {$column->value}" );
}
}
/**
* @template TTracer of TableTracer<string>
* @template-extends TableScrapingService<string,TTracer>
*/
[ScrapeFrom('Wiki Dev List', url: 'https://fake.wiki.org/dev-list', filename: 'single-table.html')]
class DeveloperTableScraper extends TableScrapingService {
/** @param TTracer $tableTracer */
public function __construct( protected TableTracer $tableTracer, ?ScrapeFrom $scrapeFrom = null ) {
// We will let the client provide scraper source via constructor injection also.
// This is just for demonstration as we already provide it as class attribute.
$scrapeFrom && $this->setScraperSource( $scrapeFrom );
parent::__construct( $tableTracer );
}
public function parse( string $content ): Iterator {
// We'll provide default transformers if is not provided by the client.
$this->getTableTracer()->addEventListener( Table::Row, $this->hydrateWithDefaultTransformers( ... ) );
yield from $this->currentTableIterator( $content );
}
protected function defaultCachePath(): string {
// ...path/to/directory-name where file "single-table.html" (as in ScrapeFrom attribute) is to be cached.
// Below is the real path used for test files.
return DOMDocumentFactoryTest::RESOURCE_PATH;
}
protected function hydrateWithDefaultTransformers( TableTraced $event ): void {
$tracer = $event->tracer;
if (
! $tracer->hasTransformer( Table::Column )
&& $tracer instanceof AccentedIndexableItem
&& $tracer instanceof Validatable
) {
// This transformer will always be used for "DeveloperTableTracer"
// because that tracer implements both of above interfaces.
$tracer->addTransformer( Table::Column, new ItemValidatorProxy() );
}
if ( $tracer->hasTransformer( Table::Row ) ) {
return;
}
$rowTransformer = new MarshallTableRow(
invalidCountMsg: $this->getScraperSource()->name . ' ' . Indexable::INVALID_COUNT,
indexKey: $tracer->getIndicesSource()?->indexKey
);
$tracer->addTransformer( Table::Row, $rowTransformer );
}
}
Client code: Performs below task using above scraping and parsing concretes created at Step 2:
- Fetches content from https://fake.wiki.org/dev-list
- Caches it to Tests/Resource/single-table.html
- Parses content of single-table.html data
Using Test Table Source as an example:
- Generates iterator with all four columns: name | title | address | age as a single dataset.
- Indexes dataset by parsed dataset's name.
$service = new DeveloperTableScraper( new DeveloperTableTracer() );
$service->toCache( $service->scrape() );
$parsedDataIterator = $service->parse( $service->fromCache() );
// OPTION 1: Either retrieve each dataset:
$firstDataKey = $parsedDataIterator->key();
// Will be: "John Doe"
$firstDataSet = $parsedDataIterator->current()->getArrayCopy();
// Will be: ['name' => 'John Doe', 'title' => 'PHP Developer', 'address' => 'Ktm', 'age' => '22']
$parsedDataIterator->next();
// Then, extract second data key & set and so on.
// OPTION 2: Or, collect everything as once as an array.
$collectAsArray = iterator_to_array( $parsedDataIterator );
// Will be: [
// 'John Doe' => ['name' => 'John Doe', 'title' => 'PHP Developer', 'address' => 'Ktm', 'age' => '22'],
// 'Lorem Ipsum' => ['name' => 'Lorem Ipsum', 'title' => 'JS Developer', 'address' => 'Bkt', 'age' => '19']
// ]
- Generates iterator with only two columns: title | age as a single dataset.
- Does not index dataset by any of the parsed dataset value.
$service = new DeveloperTableScraper(
new DeveloperTableTracer(),
// Use different source but cache to same file (for simplicity sake to corroborate examples of basic usage).
new ScrapeFrom('Dev List Alt', url: 'https://fake.devs.org/list', filename: 'single-table.html')
);
$service->getTableTracer()
->traceWithout( Table::Caption, Table::THead ) // Don't trace caption and table head contents, if exists.
->addEventListener( Table::Row, static fn( TableTraced $e ) => $e->tracer->setIndicesSource( new CollectUsing(
// Collect only "title" and "age" columns without indexing dataset.
DeveloperDetails::class, null, null, DevDetails::Title, null, DevDetails::Age
) ) );
// To replace row transformer provided by $service:
// $service->addTransformer( Table::Row, new CustomRowTransformer() );
$service->toCache( $service->scrape() );
$parsedDataIterator = $service->parse( $service->fromCache() );
// OPTION 1: Either retrieve each dataset:
$firstDataKey = $parsedDataIterator->key();
// Will be: "0" (no index key provided)
$firstDataSet = $parsedDataIterator->current()->getArrayCopy();
// Will be: [ 'title' => 'PHP Developer', 'age' => '22' ]
$parsedDataIterator->next();
// Then, extract second data key & set and so on.
// OPTION 2: Or, collect everything as once as an array.
$collectAsArray = iterator_to_array( $parsedDataIterator );
// Will be: [ [ 'title' => 'PHP Developer', 'age' => '22'], [ 'title' => 'JS Developer', 'age' => '19'] ]