slav0nic
nuklea
без примера кода, в решении “точно такой же проблемы” вам вряд ли кто-то поможет
Да, конечно.
class AliexpressSpider(CrawlSpider):
name = 'aliexpress'
allowed_domains = ['www.aliexpress.com']
start_urls = ['http://www.aliexpress.com/all-wholesale-products.html']
rules = [Rule(SgmlLinkExtractor([r'/category/(\d+)/.+\.html$']), 'parse_category')]
def parse_category(self, response):
// Код по разбору страницы
return l.load_item()
Вывод:
2012-07-25 18:05:59+0800 [scrapy] INFO: Scrapy 0.14.4 started (bot: Grabber)
2012-07-25 18:05:59+0800 [scrapy] DEBUG: Enabled extensions: LogStats, TelnetConsole, CloseSpider, CoreStats, MemoryUsage, SpiderState
2012-07-25 18:05:59+0800 [scrapy] DEBUG: Enabled downloader middlewares: HttpAuthMiddleware, DownloadTimeoutMiddleware, UserAgentMiddleware, RetryMiddleware, DefaultHeadersMiddleware, RedirectMiddleware, CookiesMiddleware, HttpCompressionMiddleware, ChunkedTransferMiddleware, DownloaderStats
2012-07-25 18:05:59+0800 [scrapy] DEBUG: Enabled spider middlewares: HttpErrorMiddleware, OffsiteMiddleware, RefererMiddleware, UrlLengthMiddleware, DepthMiddleware
2012-07-25 18:05:59+0800 [scrapy] DEBUG: Enabled item pipelines: CategoryFiltersPipeline, SaveCategoryPipeline
2012-07-25 18:05:59+0800 [aliexpress] INFO: Spider opened
2012-07-25 18:05:59+0800 [aliexpress] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2012-07-25 18:05:59+0800 [scrapy] DEBUG: Telnet console listening on 0.0.0.0:6023
2012-07-25 18:06:00+0800 [aliexpress] DEBUG: Crawled (200) <GET http://www.aliexpress.com/all-wholesale-products.html> (referer: None)
2012-07-25 18:06:02+0800 [aliexpress] DEBUG: Crawled (200) <GET http://www.aliexpress.com/category/3/apparel-accessories.html> (referer: http://www.aliexpress.com/all-wholesale-products.html)
2012-07-25 18:06:08+0800 [aliexpress] DEBUG: Scraped from <200 http://www.aliexpress.com/category/3/apparel-accessories.html>
{'filters': {},
'item_id': 3,
'orig_name': u'Apparel & Accessories',
'remote_url': 'http://www.aliexpress.com/category/3/apparel-accessories.html',
'uri': u'apparel-accessories'}
2012-07-25 18:06:23+0800 [aliexpress] DEBUG: Redirecting (301) to <GET http://www.aliexpress.com/category/200000369/car-electronics.html> from <GET http://www.aliexpress.com/category/200000369/Car-Electronics.html>
2012-07-25 18:06:56+0800 [aliexpress] DEBUG: Redirecting (301) to <GET http://www.aliexpress.com/category/200000361/transporting-storage.html> from <GET http://www.aliexpress.com/category/200000361/Transporting-Storage.html>
2012-07-25 18:07:02+0800 [aliexpress] DEBUG: Redirecting (301) to <GET http://www.aliexpress.com/category/200000321/tools-maintenance-care.html> from <GET http://www.aliexpress.com/category/200000321/Tools-Maintenance-Care.html>
2012-07-25 18:07:02+0800 [aliexpress] INFO: Crawled 2 pages (at 2 pages/min), scraped 1 items (at 1 items/min)
2012-07-25 18:07:04+0800 [aliexpress] DEBUG: Redirecting (301) to <GET http://www.aliexpress.com/category/200000165/car-accessories.html> from <GET http://www.aliexpress.com/category/200000165/Car-Accessories.html>
2012-07-25 18:07:31+0800 [aliexpress] DEBUG: Redirecting (301) to <GET http://www.aliexpress.com/category/200000191/replacement-parts.html> from <GET http://www.aliexpress.com/category/200000191/Car-Styling-Parts.html>