Web Scraper for phpBB Forum

Бюджет: $8.0 - $25.0 HOURLY / PART_TIME ⭐ 4.96 (31) United Kingdom

crawlers, data-scraping, scrapy-framework, data-mining

Job Description: I am looking for an experienced web scraper to extract posts and download specific file attachments from an online forum. The goal is to save the forum's historical data into structured text files (.txt or .md) and archive the PDFs so I can use them as source documents in Google NotebookLM. Target Website: A small, niche automotive community forum running on standard phpBB software. I will share the exact URL with you via Upwork private messages so you can evaluate the site structure before accepting the contract. Project Scope & Requirements: Data Extraction: Scrape all threads and posts from the designated subforums. I will provide a standard user account so your script can access the required areas. PDF Downloads: Your script must maintain the authenticated session to download any .pdf files attached to the forum posts. (You can ignore image attachments like .jpg or .png, as I only need text-based documents). Data Formatting: The extracted data must be clean, with no HTML tags or website navigation junk. When a post contains a PDF attachment, you must save the PDF locally and insert a reference note in the text file. Each post needs to be formatted consistently like this: "Plaintext Forum Section: [Name of Subforum] Thread Title: [Title of the thread] Author: [Username] Date: [Date of post] Post: [The actual text of the post] Attachments: [Attachment downloaded: exact_filename.pdf]" --- File Splitting: Because Google NotebookLM has a 500,000-word limit per file, you must output the data as separate .txt or .md files divided by subforum (e.g., ForumSection_1.txt, ForumSection_2.txt). Polite Scraping: To prevent an IP ban, please implement a strict crawl delay (e.g., 2-3 seconds per request) and use a standard User-Agent. This is a personal, non-commercial project, and we do not want to strain the server. Deliverables: I do not need the scraping code itself. The final deliverable should be a ZIP file containing: The clean, separated .txt or .md files. A folder containing all downloaded PDF attachments, retaining their original file names so they match the references in the text files.

Открыть заказ