The client is an AI based recruitment platform that enables talent discovery and personalised interaction based on organisation alignment and profile matching. Their data-driven hiring solution helps companies spot candidates who best fit their needs and are likely to move, and then approach them through personalized interaction.
They needed a solution to automate the search, retrieval and storage of publicly available data from multiple websites.This process was done manually and required a dedicated resource to manage the entire process. A data pipeline and data warehousing solution was needed to manage the movement and transformation of data, as well as quick retrieval and analysis for reporting and decision making.
How Velotio Helped?
Velotio developed a crawling and scraping solution that automates searching and extraction of data from multiple sources and uploads the data to the client database for further processing. The solution will crawl, extract and store data based on pre-specified rules. The solution also made it possible to specify the kind of URLs to crawl, the data type to be extracted and stored on the database. The time intervals for the crawling/extraction process and quantum of data extracted be specified as per requirement.
Key Technologies & Platforms
The solution was organized into 3 layers – Crawler, Data Extractor and Backend API Layer. Domain specific rules and intelligence was used by the solution to crawl, extract and store data. Basic NLP and machine learning is also leveraged to reduce the effort for scraping websites of platforms.
Crawler : It will crawl specific websites and platforms, following domain specific rules to extrapolate data.The data will then be uploaded to the cloud storage (Amazon S3 or Equivalent) for processing.The crawler will then have multiple spiders (processes), most likely customized spiders for each kind of platform.The crawler will run periodically to keep the content up to date.
Extractor : The data extractor takes HTML page content and apply the generic, platform/website specific rules to get the relevant data,which will then be saved to a database.
API Layer : A REST API-based Server was used to provide facility to access data from persistent storage over HTTP REST APIs.
Database : MySQL was used a relational database as persistent storage for the data. MySQL’s free text feature was used for supporting the search APIs. The solution described above was delivered as a Docker container.
The client was able to automate the entire process of searching for, extracting and storing content to their database. It resulted in considerable savings in terms of resource productivity and effort to extract data.
Management of profiles on the client platform was simplified due to automated and regular upload to database. As the rules for crawling/scraping of the data can be pre-specified, the client database now consisted of very specific and relevant data.