Job Scraping Platform - 1000 webstes to be crawled

Buget: $1000.0 FIXED / ⭐ 4.50 (32) United Kingdom

data-scraping, clay

Legal Job Scraper We have built the below system but are having trouble with 3rd party websites like workday and other vendor powered job boards with URL’s appear to change frequently We are looking to scrape the top 1000 law firms in the USA – many of which are powered by vendors like Workday, iCIMS, viDesktop / viRecruit, greenhouse, lever etc etc the issues we are having are with the vendor / ATS powered sites. Legal Job Scraper Project Name Legal Job Scraper Objective • Automatically scrape attorney job listings from 1,000+ law firm websites, filter by AI, and push relevant results to Airtable. The system must maintain a list of open jobs at any point in time, not just feed in new jobs. i.e. we need to know if a job closes or simply just have a list of what jobs are open at any point in time Input A list of target website URLs (provided by the user as standalone URL input or a CSV file with multiple URLS). Output Filtered attorney job records in an Airtable base. Out of Scope Airtable matching logic and downstream candidate processing. 1. System Overview The system consists of three sequential stages: Stage Name Description 1 Web Scraper Crawls career/jobs pages of target websites and extracts raw job listings. 2 AI Filter Analyzes each listing and discards anything that does not meet the defined criteria. 3 Airtable Push Sends approved job records to the designated Airtable base via API. 2. Stage 1: Web Scraper 2.1 Input • A flat list of target website URLs is provided by the client. • The scraper does not discover URLs on its own, it visits the provided list and crawls to the careers page. 2.2 Crawling Behaviour • For each URL, the scraper must locate the careers or jobs section of the site. • Common patterns to detect: /careers, /jobs, /join-us, /opportunities, or nav links containing those keywords. • The scraper must handle pagination and load all available job listings per site. • JavaScript-rendered pages (SPAs) must be supported. 2.3 Data Fields to Extract The following fields must be extracted for each job listing: Field Required? Notes Job URL Yes Direct link to the individual job posting page. Job Title Yes As listed on the posting. Job Location Yes City, state, or remote — capture as-is. Salary Yes Capture only if explicitly stated on the page. Date Posted Yes Capture only if explicitly stated on the page. Job Description Yes Full text of the job description (deep scrape the detail page). 3. Stage 2: AI Filter Each scraped job is passed to an AI model which evaluates it against three mandatory criteria. A job must pass ALL three to proceed. Any job that fails even one criterion is discarded. 3.1 Criterion 1: Job Type (Attorney Roles Only) The job must be a practicing attorney/lawyer role. KEEP examples: • Attorney, Associate, Senior Attorney, Partner, Of Counsel, Counsel DISCARD examples: • Sales, Marketing, HR, Operations, Business Development, Finance, IT, Paralegal, Legal Secretary, Law Clerk, Legal Assistant 3.2 Criterion 2: Seniority Level The role must match one of the following seniority levels: • Associate • Attorney • Senior Attorney • Partner DISCARD: • Law Clerk • Paralegal • Legal Assistant / Secretary • Intern / Summer Associate 3.3 Criterion 3: Location (10 Target US Cities) The job must be located in one of the following markets. Remote roles that explicitly list one of these cities as the base are also acceptable. Target Market Notes New York City, NY — Chicago, IL — Houston, TX — Dallas, TX — California (statewide) Includes San Francisco, Los Angeles, and all other CA locations marked as ‘California Remote’ Boston, MA — Miami, FL — Seattle, WA — Atlanta, GA — DISCARD: Any job outside the above locations, including fully remote roles with no listed office location in a target city. 3.4 Criterion 4: Specializations (updated 26-Jun-2026) The job must be for attorneys specializing in any of the following domains: • IP • Commercial • Corporate 4. Stage 3: Airtable Push • All jobs that pass the AI filter must be pushed to the designated Airtable base via the Airtable REST API. • Each job becomes one record in the base. • The following fields must be mapped to Airtable columns: • Job URL • Job Title • Job Location • Salary • Date Posted • Job Description • Source Website (the original URL from the input list) • Date Scraped (auto-populated timestamp) • Duplicate handling: before inserting, check if the Job URL already exists in Airtable. If it does, skip or update — developer to confirm preferred behaviour. • Airtable Base ID and API Key to be provided by the client. 5. Scheduling & Frequency • The scraper must run on a configurable periodic schedule (e.g.every 15 or 30 days). • Default frequency: once every 30 days. • Each run should process all URLs in the input list. • The schedule must be adjustable without code changes (config file or environment variable). • The system must maintain a list of open jobs at any point in time, not just feed in new jobs. i.e. we need to know if a job closes or simply just have a list of what jobs are open at any point in time 6. Error Handling & Logging • If a website is unreachable, log the failure and continue to the next URL. Do not halt the run. • If a page structure cannot be parsed, log the URL and skip. • All errors must be recorded in a run log with: URL, error type, timestamp. • A run summary must be generated after each cycle: total sites crawled, total jobs found, total passed AI filter, total pushed to Airtable, total errors. 7. Technical Assumptions • The developer will choose the appropriate scraping stack based on JS rendering requirements. • AI filtering can use a suitable LLM API (e.g. OpenAI GPT-4, Claude API) with a prompt implementing the three criteria above. • Airtable integration uses the official Airtable REST API. 8. Out of Scope • Discovering new website URLs beyond the provided list. • Matching scraped jobs against existing Airtable records for candidates. • Any front-end UI or admin dashboard. • Sending notifications or emails.

Deschide pe Upwork