Web scraping of unstructured webpages
This PowerBI proof-of-concept illustrates how content of webpages can be transformed to an interactive report. The sample counts the number of 'Jobs for R-users' by city. The POC specifically demonstrates web scraping of unstructured webpages and reporting within PowerBI (ie; ETL).
The website source data is from the url address:
http://www.r-users.com/jobs
The job board is for companies looking for R-users to hire. The jobs are entered by date. The purpose of the dashboard is to report the loaded data grouped by city instead of a listing by date. The webpage layout is custom and not in a list format (table row x column). There are 25 items per page (about 20 pages) on this job board for people and companies looking to hire R users.
On this site, there is no html table to be loaded for processing. The web data needed to be scraped page by page following a pattern. The web data was loaded by an R-script cycling through each page. Once data was loaded, two reports were chosen to be PowerBI tiles. The first tile was a world map indicating cities from the data. The second tile was a bar chart showing city locations and counts. The tiles are inherently interactive on the PowerBI platform. The POC sample was published in a way not requiring a PowerBI user account or license. So now I can fill-in some meaningful documentation.
PowerBI is actually easy when you have your data source(s) loaded. So I am not targeting typical dashboards. I have also been working on the second PowerBI dashboard with 'time-line' data AND ANIMATION.