Wednesday 24 June 2015

Data Scraping - Increasing Accessibility by Scraping Information From PDF

You may have heard about data scraping which is a method that is being used by computer programs in extracting data from an output that comes from another program. To put it simply, this is a process which involves the automatic sorting of information that can be found on different resources including the internet which is inside an html file, PDF or any other documents. In addition to that, there is the collection of pertinent information. These pieces of information will be contained into the databases or spreadsheets so that the users can retrieve them later.

Most of the websites today have text that can be accessed and written easily in the source code. However, there are now other businesses nowadays that choose to make use of Adobe PDF files or Portable Document Format. This is a type of file that can be viewed by simply using the free software known as the Adobe Acrobat. Almost any operating system supports the said software. There are many advantages when you choose to utilize PDF files. Among them is that the document that you have looks exactly the same even if you put it in another computer so that you can view it. Therefore, this makes it ideal for business documents or even specification sheets. Of course there are disadvantages as well. One of which is that the text that is contained in the file is converted into an image. In this case, it is often that you may have problems with this when it comes to the copying and pasting.

This is why there are some that start scraping information from PDF. This is often called PDF scraping in which this is the process that is just like data scraping only that you will be getting information that is contained in your PDF files. In order for you to begin scraping information from PDF, you must choose and exploit a tool that is specifically designed for this process. However, you will find that it is not easy to locate the right tool that will enable you to perform PDF scraping effectively. This is because most of the tools today have problems in obtaining exactly the same data that you want without personalizing them.

Nevertheless, if you search well enough, you will be able to encounter the program that you are looking for. There is no need for you to have programming language knowledge in order for you to use them. You can easily specify your own preferences and the software will do the rest of the work for you. There are also companies out there that you can contact and they will perform the task since they have the right tools that they can use. If you choose to do things manually, you will find that this is indeed tedious and complicated whereas if you compare this to having professionals do the job for you, they will be able to finish it in no time at all. Scraping information from PDF is a process where you collect the information that can be found on the internet and this does not infringe copyright laws.

Source: http://ezinearticles.com/?Increasing-Accessibility-by-Scraping-Information-From-PDF&id=4593863

Friday 19 June 2015

Web Scraping: working with APIs

APIs present researchers with a diverse set of data sources through a standardised access mechanism: send a pasted together HTTP request, receive JSON or XML in return. Today we tap into a range of APIs to get comfortable sending queries and processing responses.

These are the slides from the final class in Web Scraping through R: Web scraping for the humanities and social sciences

This week we explore how to use APIs in R, focusing on the Google Maps API. We then attempt to transfer this approach to query the Yandex Maps API. Finally, the practice section includes examples of working with the YouTube V2 API, a few ‘social’ APIs such as LinkedIn and Twitter, as well as APIs less off the beaten track (Cricket scores, anyone?).

I enjoyed teaching this course and hope to repeat and improve on it next year. When designing the course I tried to cram in everything I wish I had been taught early on in my PhD (resulting in information overload, I fear). Still, hopefully it has been useful to students getting started with digital data collection, showing on the one hand what is possible, and on the other giving some idea of key steps in achieving research objectives.

Download the .Rpres file to use in Rstudio here

A regular R script with code-snippets only can be accessed here

Slides from the first session here

Slides from the second session here

Slides from the third session here

Source: http://www.r-bloggers.com/web-scraping-working-with-apis/

Monday 8 June 2015

Web Scraping Services : Making Modern File Formats More Accessible

Data scraping is the process of automatically sorting through information contained on the internet inside html, PDF or other documents and collecting relevant information to into databases and spreadsheets for later retrieval. On most websites, the text is easily and accessibly written in the source code but an increasing number of businesses are using Adobe PDF format (Portable Document Format: A format which can be viewed by the free Adobe Acrobat software on almost any operating system. See below for a link.). The advantage of PDF format is that the document looks exactly the same no matter which computer you view it from making it ideal for business forms, specification sheets, etc.; the disadvantage is that the text is converted into an image from which you often cannot easily copy and paste. PDF Scraping is the process of data scraping information contained in PDF files. To PDF scrape a PDF document, you must employ a more diverse set of tools.

There are two main types of PDF files: those built from a text file and those built from an image (likely scanned in). Adobe's own software is capable of PDF scraping from text-based PDF files but special tools are needed for PDF scraping text from image-based PDF files. The primary tool for PDF scraping is the OCR program. OCR, or Optical Character Recognition, programs scan a document for small pictures that they can separate into letters. These pictures are then compared to actual letters and if matches are found, the letters are copied into a file. OCR programs can perform PDF scraping of image-based PDF files quite accurately but they are not perfect.

Once the OCR program or Adobe program has finished PDF scraping a document, you can search through the data to find the parts you are most interested in. This information can then be stored into your favorite database or spreadsheet program. Some PDF scraping programs can sort the data into databases and/or spreadsheets automatically making your job that much easier.

Quite often you will not find a PDF scraping program that will obtain exactly the data you want without customization. Surprisingly a search on Google only turned up one business, that will create a customized PDF scraping utility for your project. A handful of off the shelf utilities claim to be customizable, but seem to require a bit of programming knowledge and time commitment to use effectively. Obtaining the data yourself with one of these tools may be possible but will likely prove quite tedious and time consuming. It may be advisable to contract a company that specializes in PDF scraping to do it for you quickly and professionally.

Let's explore some real world examples of the uses of PDF scraping technology. A group at Cornell University wanted to improve a database of technical documents in PDF format by taking the old PDF file where the links and references were just images of text and changing the links and references into working clickable links thus making the database easy to navigate and cross-reference. They employed a PDF scraping utility to deconstruct the PDF files and figure out where the links were. They then could create a simple script to re-create the PDF files with working links replacing the old text image.

A computer hardware vendor wanted to display specifications data for his hardware on his website. He hired a company to perform PDF scraping of the hardware documentation on the manufacturers' website and save the PDF scraped data into a database he could use to update his webpage automatically.

PDF Scraping is just collecting information that is available on the public internet. PDF Scraping does not violate copyright laws.

PDF Scraping is a great new technology that can significantly reduce your workload if it involves retrieving information from PDF files. Applications exist that can help you with smaller, easier PDF Scraping projects but companies exist that will create custom applications for larger or more intricate PDF Scraping jobs.

Source: http://ezinearticles.com/?PDF-Scraping:-Making-Modern-File-Formats-More-Accessible&id=193321

Tuesday 2 June 2015

Scraping the Royal Society membership list

To a data scientist any data is fair game, from my interest in the history of science I came across the membership records of the Royal Society from 1660 to 2007 which are available as a single PDF file. I’ve scraped the membership list before: the first time around I wrote a C# application which parsed a plain text file which I had made from the original PDF using an online converting service, looking back at the code it is fiendishly complicated and cluttered by boilerplate code required to build a GUI. ScraperWiki includes a pdftoxml function so I thought I’d see if this would make the process of parsing easier, and compare the ScraperWiki experience more widely with my earlier scraper.

The membership list is laid out quite simply, as shown in the image below, each member (or Fellow) record spans two lines with the member name in the left most column on the first line and information on their birth date and the day they died, the class of their Fellowship and their election date on the second line.

Later in the document we find that information on the Presidents of the Royal Society is found on the same line as the Fellow name and that Royal Patrons are formatted a little differently. There are also alias records where the second line points to the primary record for the name on the first line.

pdftoxml converts a PDF into an xml file, wherein each piece of text is located on the page using spatial coordinates, an individual line looks like this:

<text top="243" left="135" width="221" height="14" font="2">Abbot, Charles, 1st Baron Colchester </text>

This makes parsing columnar data straightforward you simply need to select elements with particular values of the “left” attribute. It turns out that the columns are not in exactly the same positions throughout the whole document, which appears to have been constructed by tacking together the membership list A-J with that of K-Z, but this can easily be resolved by accepting a small range of positions for each column.

Attempting to automatically parse all 395 pages of the document reveals some transcription errors: one Fellow was apparently elected on 16th March 197 – a bit of Googling reveals that the real date is 16th March 1978. Another fellow is classed as a “Felllow”, and whilst most of the dates of birth and death are separated by a dash some are separated by an en dash which as far as the code is concerned is something completely different and so on. In my earlier iteration I missed some of these quirks or fixed them by editing the converted text file. These variations suggest that the source document was typed manually rather than being output from a pre-existing database. Since I couldn’t edit the source document I was obliged to code around these quirks.

ScraperWiki helpfully makes putting data into a SQLite database the simplest option for a scraper. My handling of dates in this version of the scraper is a little unsatisfactory: presidential terms are described in terms of a start and end year but are rendered 1st January of those years in the database. Furthermore, in historical documents dates may not be known accurately so someone may have a birth date described as “circa 1782″ or “c 1782″, even more vaguely they may be described as having “flourished 1663-1778″ or “fl. 1663-1778″. Python’s default datetime module does not capture this subtlety and if it did the database used to store dates would need to support it too to be useful – I’ve addressed this by storing the original life span data as text so that it can be analysed should the need arise. Storing dates as proper dates in the database, rather than text strings means we can query the database using date based queries.

ScraperWiki provides an API to my dataset so that I can query it using SQL, and since it is public anyone else can do this too. So, for example, it’s easy to write queries that tell you the the database contains 8019 Fellows, 56 Presidents, 387 born before 1700, 3657 with no birth date, 2360 with no death date, 204 “flourished”, 450 have birth dates “circa” some year.

I can count the number of classes of fellows:

select distinct class,count(*) from `RoyalSocietyFellows` group by class

Make a table of all of the Presidents of the Royal Society

select * from `RoyalSocietyFellows` where StartPresident not null order by StartPresident desc

…and so on. These illustrations just use the ScraperWiki htmltable export option to display the data as a table but equally I could use similar queries to pull data into a visualisation.

Comparing this to my earlier experience, the benefits of using ScraperWiki are:

•    Nice traceable code to provide a provenance for the dataset;

•    Access to the pdftoxml library;

•    Strong encouragement to “do the right thing” and put the data into a database;

•    Publication of the data;

•    A simple API giving access to the data for reuse by all.

My next target for ScraperWiki may well be the membership lists for the French Academie des Sciences, a task which proved too complex for a simple plain text scraper…

Source: https://scraperwiki.wordpress.com/2012/12/28/scraping-the-royal-society-membership-list/