Friday 3 July 2015

Scraping data from a list of web pages using Google Docs

Quite often when you’re looking for data as part of a story, that data will not be on a single page, but on a series of pages. To manually copy the data from each one – or even scrape the data individually – would take time. Here I explain a way to use Google Docs to grab the data for you.

Some basic principles

Although Google Docs is a pretty clumsy tool to use to scrape webpages, the method used is much the same as if you were writing a scraper in a programming language like Python or Ruby. For that reason, I think this is a good quick way to introduce the basics of certain types of scrapers.

Here’s how it works:

Firstly, you need a list of links to the pages containing data.

Quite often that list might be on a webpage which links to them all, but if not you should look at whether the links have any common structure, for example “http://www.country.com/data/australia” or “http://www.country.com/data/country2″. If it does, then you can generate a list by filling in the part of the URL that changes each time (in this case, the country name or number), assuming you have a list to fill it from (i.e. a list of countries, codes or simple addition).

Second, you need the destination pages to have some consistent structure to them. In other words, they should look the same (although looking the same doesn’t mean they have the same structure – more on this below).

The scraper then cycles through each link in your list, grabs particular bits of data from each linked page (because it is always in the same place), and saves them all in one place.

Scraping with Google Docs using =importXML – a case study

If you’ve not used =importXML before it’s worth catching up on my previous 2 posts How to scrape webpages and ask questions with Google Docs and =importXML and Asking questions of a webpage – and finding out when those answers change.

This takes things a little bit further.

In this case I’m going to scrape some data for a story about local history – the data for which is helpfully published by the Durham Mining Museum. Their homepage has a list of local mining disasters, with the date and cause of the disaster, the name and county of the colliery, the number of deaths, and links to the names and to a page about each colliery.

However, there is not enough geographical information here to map the data. That, instead, is provided on each colliery’s individual page.

So we need to go through this list of webpages, grab the location information, and pull it all together into a single list.

Finding the structure in the HTML

To do this we need to isolate which part of the homepage contains the list. If you right-click on the page to ‘view source’ and search for ‘Haig’ (the first colliery listed) we can see it’s in a table that has a beginning tag like so: <table border=0 align=center style=”font-size:10pt”>

We can use =importXML to grab the contents of the table like so:

=Importxml(“http://www.dmm.org.uk/mindex.htm”, ”//table[starts-with(@style, ‘font-size:10pt’)]“)

But we only want the links, so how do we grab just those instead of the whole table contents?

The answer is to add more detail to our request. If we look at the HTML that contains the link, it looks like this:

<td valign=top><a href=”http://www.dmm.org.uk/colliery/h029.htm“>Haig&nbsp;Pit</a></td>

So it’s within a <td> tag – but all the data in this table is, not surprisingly, contained within <td> tags. The key is to identify which <td> tag we want – and in this case, it’s always the fourth one in each row.

So we can add “//td[4]” (‘look for the fourth <td> tag’) to our function like so:

=Importxml(“http://www.dmm.org.uk/mindex.htm”, ”//table[starts-with(@style, ‘font-size:10pt’)]//td[4]“)

Now we should have a list of the collieries – but we want the actual URL of the page that is linked to with that text. That is contained within the value of the href attribute – or, put in plain language: it comes after the bit that says href=”.

So we just need to add one more bit to our function: “//@href”:

=Importxml(“http://www.dmm.org.uk/mindex.htm”, ”//table[starts-with(@style, ‘font-size:10pt’)]//td[4]//@href”)

So, reading from the far right inwards, this is what it says: “Grab the value of href, within the fourth <td> tag on every row, of the table that has a style value of font-size:10pt”

Note: if there was only one link in every row, we wouldn’t need to include //td[4] to specify the link we needed.

Scraping data from each link in a list

Now we have a list – but we still need to scrape some information from each link in that list

Firstly, we need to identify the location of information that we need on the linked pages. Taking the first page, view source and search for ‘Sheet 89′, which are the first two words of the ‘Map Ref’ line.

The HTML code around that information looks like this:

<td valign=top>(Sheet 89) NX965176, 54° 32' 35" N, 3° 36' 0" W</td>

Looking a little further up, the table that contains this cell uses HTML like this:

<table border=0 width=”95%”>

So if we needed to scrape this information, we would write a function like this:

=importXML(“http://www.dmm.org.uk/colliery/h029.htm”, “//table[starts-with(@width, ‘95%’)]//tr[2]//td[2]“)

…And we’d have to write it for every URL.

But because we have a list of URLs, we can do this much quicker by using cell references instead of the full URL.

So. Let’s assume that your formula was in cell C2 (as it is in this example), and the results have formed a column of links going from C2 down to C11. Now we can write a formula that looks at each URL in turn and performs a scrape on it.

In D2 then, we type the following:

=importXML(C2, “//table[starts-with(@width, ‘95%’)]//tr[2]//td[2]“)

If you copy the cell all the way down the column, it will change the function so that it is performed on each neighbouring cell.

In fact, we could simplify things even further by putting the second part of the function in cell D1 – without the quotation marks – like so:

//table[starts-with(@width, ‘95%’)]//tr[2]//td[2]

And then in D2 change the formula to this:

=ImportXML(C2,$D$1)

(The dollar signs keep the D1 reference the same even when the formula is copied down, while C2 will change in each cell)

Now it works – we have the data from each of 8 different pages. Almost.

Troubleshooting with =IF

The problem is that the structure of those pages is not as consistent as we thought: the scraper is producing extra cells of data for some, which knocks out the data that should be appearing there from other cells.

So I’ve used an IF formula to clean that up as follows:

In cell E2 I type the following:

=if(D2=””, ImportXML(C2,$D$1), D2)

Which says ‘If D2 is empty, then run the importXML formula again and put the results here, but if it’s not empty then copy the values across‘

That formula is copied down the column.

But there’s still one empty column even now, so the same formula is used again in column F:

=if(E2=””, ImportXML(C2,$D$1), E2)

A hack, but an instructive one

As I said earlier, this isn’t the best way to write a scraper, but it is a useful way to start to understand how they work, and a quick method if you don’t have huge numbers of pages to scrape. With hundreds of pages, it’s more likely you will miss problems – so watch out for inconsistent structure and data that doesn’t line up.

Source: http://onlinejournalismblog.com/2011/10/14/scraping-data-from-a-list-of-webpages-using-google-docs/

Wednesday 24 June 2015

Data Scraping - Increasing Accessibility by Scraping Information From PDF

You may have heard about data scraping which is a method that is being used by computer programs in extracting data from an output that comes from another program. To put it simply, this is a process which involves the automatic sorting of information that can be found on different resources including the internet which is inside an html file, PDF or any other documents. In addition to that, there is the collection of pertinent information. These pieces of information will be contained into the databases or spreadsheets so that the users can retrieve them later.

Most of the websites today have text that can be accessed and written easily in the source code. However, there are now other businesses nowadays that choose to make use of Adobe PDF files or Portable Document Format. This is a type of file that can be viewed by simply using the free software known as the Adobe Acrobat. Almost any operating system supports the said software. There are many advantages when you choose to utilize PDF files. Among them is that the document that you have looks exactly the same even if you put it in another computer so that you can view it. Therefore, this makes it ideal for business documents or even specification sheets. Of course there are disadvantages as well. One of which is that the text that is contained in the file is converted into an image. In this case, it is often that you may have problems with this when it comes to the copying and pasting.

This is why there are some that start scraping information from PDF. This is often called PDF scraping in which this is the process that is just like data scraping only that you will be getting information that is contained in your PDF files. In order for you to begin scraping information from PDF, you must choose and exploit a tool that is specifically designed for this process. However, you will find that it is not easy to locate the right tool that will enable you to perform PDF scraping effectively. This is because most of the tools today have problems in obtaining exactly the same data that you want without personalizing them.

Nevertheless, if you search well enough, you will be able to encounter the program that you are looking for. There is no need for you to have programming language knowledge in order for you to use them. You can easily specify your own preferences and the software will do the rest of the work for you. There are also companies out there that you can contact and they will perform the task since they have the right tools that they can use. If you choose to do things manually, you will find that this is indeed tedious and complicated whereas if you compare this to having professionals do the job for you, they will be able to finish it in no time at all. Scraping information from PDF is a process where you collect the information that can be found on the internet and this does not infringe copyright laws.

Source: http://ezinearticles.com/?Increasing-Accessibility-by-Scraping-Information-From-PDF&id=4593863

Friday 19 June 2015

Web Scraping: working with APIs

APIs present researchers with a diverse set of data sources through a standardised access mechanism: send a pasted together HTTP request, receive JSON or XML in return. Today we tap into a range of APIs to get comfortable sending queries and processing responses.

These are the slides from the final class in Web Scraping through R: Web scraping for the humanities and social sciences

This week we explore how to use APIs in R, focusing on the Google Maps API. We then attempt to transfer this approach to query the Yandex Maps API. Finally, the practice section includes examples of working with the YouTube V2 API, a few ‘social’ APIs such as LinkedIn and Twitter, as well as APIs less off the beaten track (Cricket scores, anyone?).

I enjoyed teaching this course and hope to repeat and improve on it next year. When designing the course I tried to cram in everything I wish I had been taught early on in my PhD (resulting in information overload, I fear). Still, hopefully it has been useful to students getting started with digital data collection, showing on the one hand what is possible, and on the other giving some idea of key steps in achieving research objectives.

Download the .Rpres file to use in Rstudio here

A regular R script with code-snippets only can be accessed here

Slides from the first session here

Slides from the second session here

Slides from the third session here

Source: http://www.r-bloggers.com/web-scraping-working-with-apis/

Monday 8 June 2015

Web Scraping Services : Making Modern File Formats More Accessible

Data scraping is the process of automatically sorting through information contained on the internet inside html, PDF or other documents and collecting relevant information to into databases and spreadsheets for later retrieval. On most websites, the text is easily and accessibly written in the source code but an increasing number of businesses are using Adobe PDF format (Portable Document Format: A format which can be viewed by the free Adobe Acrobat software on almost any operating system. See below for a link.). The advantage of PDF format is that the document looks exactly the same no matter which computer you view it from making it ideal for business forms, specification sheets, etc.; the disadvantage is that the text is converted into an image from which you often cannot easily copy and paste. PDF Scraping is the process of data scraping information contained in PDF files. To PDF scrape a PDF document, you must employ a more diverse set of tools.

There are two main types of PDF files: those built from a text file and those built from an image (likely scanned in). Adobe's own software is capable of PDF scraping from text-based PDF files but special tools are needed for PDF scraping text from image-based PDF files. The primary tool for PDF scraping is the OCR program. OCR, or Optical Character Recognition, programs scan a document for small pictures that they can separate into letters. These pictures are then compared to actual letters and if matches are found, the letters are copied into a file. OCR programs can perform PDF scraping of image-based PDF files quite accurately but they are not perfect.

Once the OCR program or Adobe program has finished PDF scraping a document, you can search through the data to find the parts you are most interested in. This information can then be stored into your favorite database or spreadsheet program. Some PDF scraping programs can sort the data into databases and/or spreadsheets automatically making your job that much easier.

Quite often you will not find a PDF scraping program that will obtain exactly the data you want without customization. Surprisingly a search on Google only turned up one business, that will create a customized PDF scraping utility for your project. A handful of off the shelf utilities claim to be customizable, but seem to require a bit of programming knowledge and time commitment to use effectively. Obtaining the data yourself with one of these tools may be possible but will likely prove quite tedious and time consuming. It may be advisable to contract a company that specializes in PDF scraping to do it for you quickly and professionally.

Let's explore some real world examples of the uses of PDF scraping technology. A group at Cornell University wanted to improve a database of technical documents in PDF format by taking the old PDF file where the links and references were just images of text and changing the links and references into working clickable links thus making the database easy to navigate and cross-reference. They employed a PDF scraping utility to deconstruct the PDF files and figure out where the links were. They then could create a simple script to re-create the PDF files with working links replacing the old text image.

A computer hardware vendor wanted to display specifications data for his hardware on his website. He hired a company to perform PDF scraping of the hardware documentation on the manufacturers' website and save the PDF scraped data into a database he could use to update his webpage automatically.

PDF Scraping is just collecting information that is available on the public internet. PDF Scraping does not violate copyright laws.

PDF Scraping is a great new technology that can significantly reduce your workload if it involves retrieving information from PDF files. Applications exist that can help you with smaller, easier PDF Scraping projects but companies exist that will create custom applications for larger or more intricate PDF Scraping jobs.

Source: http://ezinearticles.com/?PDF-Scraping:-Making-Modern-File-Formats-More-Accessible&id=193321

Tuesday 2 June 2015

Scraping the Royal Society membership list

To a data scientist any data is fair game, from my interest in the history of science I came across the membership records of the Royal Society from 1660 to 2007 which are available as a single PDF file. I’ve scraped the membership list before: the first time around I wrote a C# application which parsed a plain text file which I had made from the original PDF using an online converting service, looking back at the code it is fiendishly complicated and cluttered by boilerplate code required to build a GUI. ScraperWiki includes a pdftoxml function so I thought I’d see if this would make the process of parsing easier, and compare the ScraperWiki experience more widely with my earlier scraper.

The membership list is laid out quite simply, as shown in the image below, each member (or Fellow) record spans two lines with the member name in the left most column on the first line and information on their birth date and the day they died, the class of their Fellowship and their election date on the second line.

Later in the document we find that information on the Presidents of the Royal Society is found on the same line as the Fellow name and that Royal Patrons are formatted a little differently. There are also alias records where the second line points to the primary record for the name on the first line.

pdftoxml converts a PDF into an xml file, wherein each piece of text is located on the page using spatial coordinates, an individual line looks like this:

<text top="243" left="135" width="221" height="14" font="2">Abbot, Charles, 1st Baron Colchester </text>

This makes parsing columnar data straightforward you simply need to select elements with particular values of the “left” attribute. It turns out that the columns are not in exactly the same positions throughout the whole document, which appears to have been constructed by tacking together the membership list A-J with that of K-Z, but this can easily be resolved by accepting a small range of positions for each column.

Attempting to automatically parse all 395 pages of the document reveals some transcription errors: one Fellow was apparently elected on 16th March 197 – a bit of Googling reveals that the real date is 16th March 1978. Another fellow is classed as a “Felllow”, and whilst most of the dates of birth and death are separated by a dash some are separated by an en dash which as far as the code is concerned is something completely different and so on. In my earlier iteration I missed some of these quirks or fixed them by editing the converted text file. These variations suggest that the source document was typed manually rather than being output from a pre-existing database. Since I couldn’t edit the source document I was obliged to code around these quirks.

ScraperWiki helpfully makes putting data into a SQLite database the simplest option for a scraper. My handling of dates in this version of the scraper is a little unsatisfactory: presidential terms are described in terms of a start and end year but are rendered 1st January of those years in the database. Furthermore, in historical documents dates may not be known accurately so someone may have a birth date described as “circa 1782″ or “c 1782″, even more vaguely they may be described as having “flourished 1663-1778″ or “fl. 1663-1778″. Python’s default datetime module does not capture this subtlety and if it did the database used to store dates would need to support it too to be useful – I’ve addressed this by storing the original life span data as text so that it can be analysed should the need arise. Storing dates as proper dates in the database, rather than text strings means we can query the database using date based queries.

ScraperWiki provides an API to my dataset so that I can query it using SQL, and since it is public anyone else can do this too. So, for example, it’s easy to write queries that tell you the the database contains 8019 Fellows, 56 Presidents, 387 born before 1700, 3657 with no birth date, 2360 with no death date, 204 “flourished”, 450 have birth dates “circa” some year.

I can count the number of classes of fellows:

select distinct class,count(*) from `RoyalSocietyFellows` group by class

Make a table of all of the Presidents of the Royal Society

select * from `RoyalSocietyFellows` where StartPresident not null order by StartPresident desc

…and so on. These illustrations just use the ScraperWiki htmltable export option to display the data as a table but equally I could use similar queries to pull data into a visualisation.

Comparing this to my earlier experience, the benefits of using ScraperWiki are:

•    Nice traceable code to provide a provenance for the dataset;

•    Access to the pdftoxml library;

•    Strong encouragement to “do the right thing” and put the data into a database;

•    Publication of the data;

•    A simple API giving access to the data for reuse by all.

My next target for ScraperWiki may well be the membership lists for the French Academie des Sciences, a task which proved too complex for a simple plain text scraper…

Source: https://scraperwiki.wordpress.com/2012/12/28/scraping-the-royal-society-membership-list/

Thursday 28 May 2015

Data Scraping Services - Scraping Yelp Business Data With Python Scraping Script

Yelp is a great source of business contact information with details like address, postal code, contact information; website addresses etc. that other site like Google Maps just does not. Yelp also provides reviews about the particular business. The yelp business database can be useful for telemarketing, email marketing and lead generation.

Are you looking for yelp business details database? Are you looking for scraping data from yelp website/business directory? Are you looking for yelp screen scraping software? Are you looking for scraping the business contact information from the online Yelp? Then you are at the right place.

Here I am going to discuss how to scrape yelp data for lead generation and email marketing. I have made a simple and straight forward yelp data scraping script in python that can scrape data from yelp website. You can use this yelp scraper script absolutely free.

I have used urllib, BeautifulSoup packages. Urllib package to make http request and parsed the HTML using BeautifulSoup, used Threads to make the scraping faster.

Yelp Scraping Python Script

import urllib from bs4 import BeautifulSoup import re from threading import Thread #List of yelp urls to scrape url=['http://www.yelp.com/biz/liman-fisch-restaurant-hamburg','http://www.yelp.com/biz/casa-franco-caramba-hamburg','http://www.yelp.com/biz/o-ren-ishii-hamburg','http://www.yelp.com/biz/gastwerk-hotel-hamburg-hamburg-2','http://www.yelp.com/biz/superbude-hamburg-2','http://www.yelp.com/biz/hotel-hafen-hamburg-hamburg','http://www.yelp.com/biz/hamburg-marriott-hotel-hamburg','http://www.yelp.com/biz/yoho-hamburg'] i=0 #function that will do actual scraping job def scrape(ur): html = urllib.urlopen(ur).read() soup = BeautifulSoup(html) title = soup.find('h1',itemprop="name") saddress = soup.find('span',itemprop="streetAddress") postalcode = soup.find('span',itemprop="postalCode") print title.text print saddress.text print postalcode.text print "-------------------" threadlist = [] #making threads while i<len(url): t = Thread(target=scrape,args=(url[i],)) t.start() threadlist.append(t) i=i+1 for b in

threadlist: b.join()

import urllib

from bs4 import BeautifulSoup

import re

from threading import Thread

 #List of yelp urls to scrape

url=['http://www.yelp.com/biz/liman-fisch-restaurant-hamburg','http://www.yelp.com/biz/casa-franco-caramba-hamburg','http://www.yelp.com/biz/o-ren-ishii-hamburg','http://www.yelp.com/biz/gastwerk-hotel-hamburg-hamburg-2','http://www.yelp.com/biz/superbude-hamburg-2','http://www.yelp.com/biz/hotel-hafen-hamburg-hamburg','http://www.yelp.com/biz/hamburg-marriott-hotel-hamburg','http://www.yelp.com/biz/yoho-hamburg']

 i=0

#function that will do actual scraping job

def scrape(ur):

           html = urllib.urlopen(ur).read()

          soup = BeautifulSoup(html)

       title = soup.find('h1',itemprop="name")

          saddress = soup.find('span',itemprop="streetAddress")

          postalcode = soup.find('span',itemprop="postalCode")

          print title.text

          print saddress.text

          print postalcode.text

          print "-------------------"

 threadlist = []

#making threads

while i<len(url):

          t = Thread(target=scrape,args=(url[i],))

          t.start()

          threadlist.append(t)

          i=i+1

for b in threadlist:

          b.join()

Recently I had worked for one German company and did yelp scraping project for them and delivered data as per their requirement. If you looking for scraping data from business directories like yelp then send me your requirement and I will get back to you with sample.

Source: http://webdata-scraping.com/scraping-yelp-business-data-python-scraping-script/

Tuesday 26 May 2015

Data Extraction Services

Are you finding it tedious to perform your routine tasks as well as finding time to research for some information? Don't worry; all you have to do is outsource data extraction requirements to reliable service providers such as Hi-Tech BPO Services.

We can assist you in finding, extracting, gathering, processing and validating all the required data through our effective data extraction services. We can extract data from any given source such as websites, databases, printed documents, directories, etc.

With a whole plethora of data extraction services solutions; we are definitely a one stop solution to all your data extraction services requirements.

For utilizing our data extraction services, all you have to do is outsource data extraction requirements to us, and we will create effective strategies and extract the required data from all preferred sources. Then we will arrange all the extracted data in a systematic order.

Types of data extraction services provided by our data extraction India unit:

The data extraction India unit of Hi-Tech BPO Services can attend to all types of outsource data extraction requirements. Following are just some of the data extraction services we have delivered:

•    Data extraction from websites
•    Data extraction from databases
•    Extraction of data from directories
•    Extracting data from books
•    Data extraction from forms
•    Extracting data from printed materials

Features of Our Data Extraction Services:

•    Reliable collection of resources for data extraction
•    Extensive range of data extraction services
•    Data can be extracted from any available source be it a digital source or a hard copy source
•    Proper researching, extraction, gathering, processing and validation of data
•    Reasonably priced data extraction services
•    Quality and confidentiality ensured through various strict measures

Our data extraction India unit has the competency to handle any of your data extraction services requirements. Just provide us with your specific requirements and we will extract data accordingly from your preferred resources, if particularly specified. Otherwise we will completely rely on our collection of resources for extracting data for you.

Source: http://www.hitechbposervices.com/data-extraction.php

Monday 25 May 2015

Which language is the most flexible for scraping websites?

3 down vote favorite

I'm new to programming. I know a little python and a little objective c, and I've been going through tutorials for each. Then it occurred to me, I need to know which language is more flexible (python, obj c, something else) for screen scraping a website for content.

What do I mean by "flexible"?

Well, ideally, I need something that will be easy to refactor and tweak for similar projects. I'm trying to avoid doing a lot of re-writing (well, re-coding) if I wanted to switch some of the variables in the program (i.e., the website to be scraped, the content to fetch, etc).

Anyways, if you could please give me your opinion, that would be great. Oh, and if you know any existing frameworks for the language you recommend, please share. (I know a little about Selenium and BeautifulSoup for python already).

4 Answers

I recently wrote a relatively complex web scraper to harvest a TON of data. It had to do some relatively complex parsing, I needed it to stuff it into a database, etc. I'm C# programmer now and formerly a Perl guy.

I wrote my original scraper using Python. I started on a Thursday and by Sunday morning I was harvesting over about a million scores from a show horse site. I used Python and SQLlite because they were fast.

HOWEVER, as I started putting together programs to regularly keep the data updated and to populate the SQL Server that would backend my MVC3 application, I kept hitting snags and gaps in my Python knowledge.

In the end, I completely rewrote the scraper/parser in C# using the HtmlAgilityPack and it works better than before (and just about as fast).

Because I KNEW THE LANGUAGE and the environment so much better I was able to add better database support, better logging, better error handling, etc. etc.

So... short answer.. Python was the fastest to market with a "good enough for now" solution, but the language I know best (C#) was the best long-term solution.

EDIT: I used BeautifulSoup for my original crawler written in Python.

5 down vote

The most flexible is the one that you're most familiar with.

Personally, I use Python for almost all of my utilities. For scraping, I find that its functionality specific to parsing and string manipulation requires little code, is fast and there are a ton of examples out there (strong community). Chances are that someone's already written whatever you're trying to do already, or there's at least something along the same lines that needs very little refactoring.

1 down vote

I think its safe to say that Python is a better place to start than Objective C. Honestly, just about any language meets the "flexible" requirement. All you need is well thought out configuration parameters. Also, a dynamic language like Python can go a long way in increasing flexibility, provided that you account for runtime type errors.

1 down vote

I recently wrote a very simple web-scraper; I chose Common Lisp as I'm learning the language.

On the basis of my experience - both of the language and the availability of help from experienced Lispers - I recommend investigating Common Lisp for your purpose.

There are excellent XML-parsing libraries available for CL, as well as libraries for parsing invalid HTML, which you'll need unless the sites you're parsing consist solely of valid XHTML.

Also, Common Lisp is a good language in which to implement DSLs; a DSL for web-scraping may be a solution to your requirement for flexibility & re-use.

Source: http://programmers.stackexchange.com/questions/74998/which-language-is-the-most-flexible-for-scraping-websites/75006#75006


Friday 22 May 2015

Web scraping using Python without using large frameworks like Scrapy

scrapy-big-logoIf you need publicly available data from scraping the Internet, before creating a webscraper, it is best to check if this data is already available from public data sources or APIs. Check the site’s FAQ section or Google for their API endpoints and public data.

Even if their API endpoints are available you have to create some parser for fetching and structuring the data according to your needs.

Scrapy is a well established framework for scraping, but it is also a very heavy framework. For smaller jobs, it may be overkill and for extremely large jobs it is very slow.

So if you would like to roll up your sleeves and build your own scraper, continue reading.

Here are some basic steps performed by most webspiders:

1) Start with a URL and use a HTTP GET or PUT request to access the URL
2) Fetch all the contents in it and parse the data
3) Store the data in any database or put it into any data warehouse
4) Enqueue all the URLs in a page
5) Use the URLs in queue and repeat from process 1
Here are the 3 major modules in every web crawler:
1) Request/Response handler.
2) Data parsing/data cleansing/data munging process.
3) Data serialization/data pipelines.

Lets look at each of these modules and see what they do and how to use them.

Request/Response handler

Request/response handlers are managers who make http requests to a url or a group of urls, and fetch the response objects as html contents and pass this data to the next module. If you use Python for performing request/response url-opening process libraries such as the following are most commonly used

1) urllib(20.5. urllib – Open arbitrary resources by URL – Python v2.7.8 documentation) -Basic python library yet high-level interface for fetching data across the World Wide Web.

2) urllib2(20.6. urllib2 – extensible library for opening URLs – Python v2.7.8 documentation) – extensible library of urllib, which would handle basic http requests, digest authentication, redirections, cookies and more.

3) requests(Requests: HTTP for Humans) – Much advanced request library

which is built on top of basic request handling libraries.

Data parsing/data cleansing/data munging process

This is the module where the fetched data is processed and cleaned. Unstructured data is transformed into structured during this processing. Usually  a set of Regular Expressions (regexes) which perform pattern matching and text processing tasks on the html data are used for this processing.

In addition to regexes, basic string manipulation and search methods are also used to perform this cleaning and transformation. You must have a thorough knowledge of regular expressions and so that you could design the regex patterns.

Data serialization/data pipelines

Once you get the cleaned data from the parsing and cleaning module, the data serialization module will be used to serialize the data according to the data models that you require. This is the final module that will output data in a standard format that can be stored in databases, JSON/CSV files or passed to any data warehouses for storage. These tasks are usually performed by libraries listed below

1) pickle (pickle – Python object serialization) –  This module implements a fundamental, but powerful algorithm for serializing and de-serializing a Python object structure

2) JSON (JSON encoder and decoder)

3) CSV (https://docs.python.org/2/library/csv.html)

4) Basic database interface libraries like pymongo (Tutorial – PyMongo),mysqldb ( on python.org), sqlite3(sqlite3 – DB-API interface for SQLite databases)

And many more such libraries based on the format and database/data storage.

Basic spider rules

The rules to follow while building a spider are to be nice to the sites you are scraping and follow the rules in the site’s spider policies outlined in the site’s robots.txt.

Limit the  number of requests in a second and build enough delays in the spiders so that  you don’t adversely affect the site.

It just makes sense to be nice.

We will cover more techniques in future articles

Source: http://learn.scrapehero.com/webscraping-using-python-without-using-large-frameworks-like-scrapy/

Tuesday 19 May 2015

How Web Data Extraction Services Impact Startups

Starting a business has its fair share of ebbs and flows – it can be extremely challenging to get a new business off the blocks, and extremely rewarding when everything goes according to plan and yields desired results. For startups, it is important to get the nuances of running a business right from day one. To succeed in an immensely competitive space, startups need to perform above and beyond expectation right from the start, and one of the factors that can be of great help during the growing years of a startup is web data extraction.

Web data extraction through crawling and scraping, a highly efficient information gathering process, can be used in many creative ways to bring about major change in the performance graph of a startup. With effective web data extraction services acquired by outsourcing to a reputed company, the business intelligence gathered and the numerous possibilities associated with it, web crawling and extraction services can indeed become the difference maker for a startup, propelling it to the heights of success.

What drives the success of web data extraction?

When it comes to figuring out the perfect, balanced web data collection methodology for startups, there are a lot of crucial factors that come into play. Some of these are associated with the technical aspects of data collection, the approach used, the time invested, and the tools involved. Others have more to do with the processing and analysis of collected information and its judicious use in formulating strategies to take things forward.

Web Crawling Services & Web Scraping Services

With the advent of highly professional web data extraction services providers, massive amounts of structured, relevant data can be gathered and stored in real time, and in time, productively used to further the business interests of a startup. As a new business owner, it is important to have a high-level knowledge of the modern and highly functional web scraping tools available for use. This will help to utilize the prowess of competent data extraction services. This in turn can assist both in the immediate and long-term revenue generation context.

Web Data Extraction for Startups

From the very beginning, the dynamics of startups is different from that of older, well-established businesses. The time taken by the new business entity in proving its capabilities and market position needs to be used completely and effectively. Every day of growth and learning needs to add up to make a substantial difference. In this period, every plan and strategy, every execution effort, and every move needs to be properly thought out.

In such a trying situation where there is little margin for error, it pays to have accurate, reliable, relevant and actionable business intelligence. This can put you in firm control of things by allowing you to make informed business decisions and formulate targeted, relevant and growth oriented business strategies. With powerful web crawling, the volume of data gathered is varied, accurate and relevant. This data can then be studied minutely, analyzed in detail and arranged into meaningful clusters. With this weapon in your arsenal, you can take your startup a long way with smart decisions and clever implementations.

Web data extraction is a task best handled by professionals who have had rich experience in the field. Often, in-house web scraping teams are difficult to assemble and not economically viable to maintain, especially for startups. For a better solution, you can outsource your web scraping needs to a reliable web data extraction service for data collection. This way, you can get all the relevant intelligence you need without overstraining your workforce or having to employ additional personnel to handle web scraping. The company you outsource your work to can easily scrape data from multiple sources as per your requirements, and furnish you with actionable business intelligence that can help you take a lead in a competitive market.

Different Ways for Startups to use Web Data Extraction

Web scraping can be employed for many different purposes to yield different kinds of relevant data that generate actionable insights. For a startup, the important decision is how to use this powerful technique to provide valuable information that can make a difference for the future prospects of the company. Here are some interesting possibilities when it comes to impactful web data extraction for startups –

Fishing for Social Rankings and Backlinks

One of the most important business processes for a startup is competition analysis. This is one area where web data extraction can come across as an invaluable enabler. In the past, many startups have effectively used web scraping to fish for backlinks and social rankings related to competing companies.

Backlinks are important to reach a greater mass of better-targeted audiences, which can go on to increase customer base with minimal efforts. Social ranking is also an immensely important factor, as social actions on the internet are building blocks of opinion and reputation generation in this day and age. Keeping this in mind, you can use web data extraction to scrape for social rankings and backlinks related to content generated by your competing companies. After careful analysis, it is possible to arrive at concrete conclusions regarding what your competitors are doing well, and what sells the best.

This information is gold for marketers and sales personnel, and can be used to discern exactly what needs to be done to increase social buzz, generate favorable opinion, and win over customers from your competitors. You can also use this technique to develop high authority backlinks that help with SEO, targeted reach and organic traffic for your business website. For competition analysis, web scraping is a formidable tool.

Sourcing Contact Information

Another important aspect of business that startups can never ignore is good networking. Whether it is with customers, prospective customers, industry peers, partners, or competitors, excellent networking and open, transparent communication is essential for the success of your startup. For effective communication and networking, you need a large, solid list of contact information pertaining to your exact requirements.

Scraping data from multiple web sources gives you the perfect method of achieving this. With automated, fast web scraping, you can in a short time collect a wealth of important contact information that can be leveraged in many different ways. Whether it is the formation of lasting business relationships or making potential customers aware of what you have on offer, this information has the power to propel your startup to new levels of recognition.

For Ecommerce

If you sell your products and services online and want to stay on top of the competition when it comes to variety, pricing analysis, and special deals and offers, web scraping is the way to go. For many e-commerce startups, the problem of high CTR and low conversion is a stumbling block to higher bottom lines. To remedy problems like these and to ensure better sales, it is always a good idea to have a clear insight about your competition.

Future of Retail Industry

With web data extraction, you can be always aware of what competing companies are doing in terms of pricing strategies, product diversity and special customer offers. By considering that information while evaluating and cementing your own strategies, you can always ensure that you provide better value and range of products and services than your competitors, and therefore stay ahead of the competition.

For Marketing, Brand Promotion and Advertisement

For startups, the first wave of promotion and marketing is the one that holds the key to your long-term business success. It is during this phase that the first and most important public perception of your company is formed, and the rudiments of public opinion start taking shape. For this reason, it is crucial to be on point with your marketing and promotion during the early, formative years of your business.

To achieve this, you need a clear, in-depth understanding of your target audience. You need to categorize your target audience on the basis of many factors like age, gender, demographics, income groups and tastes and preferences. Such detailed understanding can only be possible when you have a large wealth of social data pertaining to your target audience. There is no better way of achieving this than by web data extraction.

Love your brand

With the help of data extraction services, you can gather large chunks of relevant data regarding your target audience which can help you accurately evaluate the potential of each prospective customers as a possible addition to your business family. To ensure that you have a steady, early wave of customers to take your business off the blocks at a rapid pace, you need to devise marketing campaigns, promotional strategies and advertisements in accordance with the customer knowledge you drive through your web scraping efforts. This is a foolproof strategy to have marketing and promotional plans in place that achieve goals, bring in new business and provide your company with enough initial momentum to carry it through the later years of success.

To conclude, web data extraction can be a veritable tool in the hands of a startup. With the proper use and leveraging of this technique, your startup can gather the required business intelligence to shine in a competitive market and become a favorite with the customer base. Working with the right web data extraction company can be one of the most important business decisions you make as a startup owner.

Source: https://www.promptcloud.com/blog/web-data-extraction-services-for-startups/

Wednesday 6 May 2015

Web Scraping - Data Collection or Illegal Activity?

Web Scraping Defined

We've all heard the term "web scraping" but what is this thing and why should we really care about it?  Web scraping refers to an application that is programmed to simulate human web surfing by accessing websites on behalf of its "user" and collecting large amounts of data that would typically be difficult for the end user to access.  Web scrapers process the unstructured or semi-structured data pages of targeted websites and convert the data into a structured format.  Once the data is in a structured format, the user can extract or manipulate the data with ease.  Web scraping is very similar to web indexing (used by most search engines), but the end motivation is typically much different.  Whereas web indexing is used to help make search engines more efficient, web scraping is typically used for different reasons like change detection, market research, data monitoring, and in some cases, theft.

Why Web Scrape?

There are lots of reasons people (or companies) want to scrape websites, and there are tons of web scraping applications available today.  A quick Internet search will yield numerous web scraping tools written in just about any programming language you prefer.  In today's information-hungry environment, individuals and companies alike are willing to go to great lengths to gather information about all sorts of topics.  Imagine a company that would really like to gather some market research on one of their leading competitors...might they be tempted to invoke a web scraper that gathers all the information for them?  Or, what if someone wanted to find a vulnerable site that allowed otherwise not-so-free downloads?  Or, maybe a less than honest person might want to find a list of account numbers on a site that failed to properly secure them.  The list goes on and on.

I should mention that web scraping is not always a bad thing.  Some websites allow web scraping, but many do not.  It's important to know what a website allows and prohibits before you scrape it.

The Problem With Web Scraping

Web scraping rides a fine line between collecting information and stealing information.  Most websites have a copyright disclosure statement that legally protects their website information.  It's up to the reader/user/scraper to read these disclosure statements and follow along legally and ethically.  In fact, the F5.com website presents the following copyright disclosure:  "All content included on this site, such as text, graphics, logos, button icons, images, audio clips, and software, including the compilation thereof (meaning the collection, arrangement, and assembly), is the property of F5 Networks, Inc., or its content and software suppliers, except as may be stated otherwise, and is protected by U.S. and international copyright laws."  It goes on to say, "We reserve the right to make changes to our site and these disclaimers, terms, and conditions at any time."

So, scraper beware!  There have been many court cases where web scraping turned into felony offenses.  One case involved an online activist who scraped the MIT website and ultimately downloaded millions of academic articles.  This guy is now free on bond, but faces dozens of years in prison and $1 million if convicted.  Another case involves a real estate company who illegally scraped listings and photos from a competitor in an attempt to gain a lead in the market.  Then, there's the case of a regional software company that was convicted of illegally scraping a major database company's websites in order to gain a competitive edge.  The software company had to pay a $20 million fine and the guilty scraper is serving three years probation.  Finally, there's the case of a medical website that hosted sensitive patient information.  In this case, several patients had posted personal drug listings and other private information on closed forums located on the medical website.  The website was scraped by a media-rese
arch firm, and all this information was suddenly public.

While many illegal web scrapers have been caught by the authorities, many more have never been caught and still run loose on websites around the world.  As you can see, it's increasingly important to guard against this activity.  After all, the information on your website belongs to you, and you don't want anyone else taking it without your permission.

The Good News

As we've noted, web scraping is a real problem for many companies today.  The good news is that F5 has web scraping protection built into the Application Security Manager (ASM) of its BIG-IP product family.  As you can see in the screenshot below, the ASM provides web scraping protection against bots, session opening anomalies, session transaction anomalies, and IP address whitelisting.

The bot detection works with clients that accept cookies and process JavaScript.  It counts the client's page consumption speed and declares a client as a bot if a certain number of page changes happen within a given time interval.  The session opening anomaly spots web scrapers that do not accept cookies or process JavaScript.  It counts the number of sessions opened during a given time interval and declares the client as a scraper if the maximum threshold is exceeded.  The session transaction anomaly detects valid sessions that visit the site much more than other clients.  This defense is looking at a bigger picture and it blocks sessions that exceed a calculated baseline number that is derived from a current session table.  The IP address whitelist allows known friendly bots and crawlers (i.e. Google, Bing, Yahoo, Ask, etc), and this list can be populated as needed to fit the needs of your organization.

I won't go into all the details here because I'll have some future articles that dive into the details of how the ASM protects against these types of web scraping capabilities.  But, suffice it to say, ASM does a great job of protecting your website against the problem of web scraping.

I'm sure as you studied the screenshot above you also noticed lots of other protection capabilities the ASM provides...brute force attack prevention, customized attack signatures, Denial of Service protection, etc.  You might be wondering how it does all that stuff as well.  Give us a little feedback on the topics you would like to see, and we'll start posting some targeted tech tips for you!

Thanks for reading this introductory web scraping article...and, be sure to come back for the deeper look into how the ASM is configured to handle this problem. For more information, check out this video from Peter Silva where he discusses ASM botnet and web scraping defense.

Source: https://devcentral.f5.com/articles/web-scraping-data-collection-or-illegal-activity

Tuesday 28 April 2015

Web Scraping – An Illegal Activity or Simple Data Collection?

Gone are the days when skillful extraction of information pertaining to real estate such as foreclosures, homes for sale, or mortgage records was considered difficult. Now, it is not only easy to extract data from real estate websites but also scrape real estate data on a consistent basis to add more value to your portal, or ensure that updated data is available to your visitors at all times. From downloading actual scanned documents in the form of PDF files to scraping websites for deeds or mortgages, smartly designer data extraction tools can do it all.

However, the one question that still manages to come to the front in the minds of those who scrape real estate listings and others are whether the act is illegal in nature or a simple way of collecting data.

Take a look.

Web Scraping—What is it?

Generally speaking, web scraping refers to programs that are designed to simulate human internet surfing and access websites on behalf of their users. These tools are effective in collecting large quantities of data that are otherwise difficult for end users to access. They process semi-structured or unstructured data pages of targeted websites and transform available data into a more structured format that can be extracted or manipulated by the user easily.

Quite similar to web indexing that is used by search engines, the end motivation of web scraping is much different. While web indexing makes search engines far more efficient, the latter is used for reasons like market research, change detection, data monitoring, or in some events, theft. But then, it is not always a bad thing. You just need to know if a website allows web scraping before proceeding with the act.

Fine Line between Stealing and Collecting Information

Web scraping rides an extremely fine line between the acts of collecting relevant information and stealing the same. The websites that have copyright disclosure statements in place to protect their website information are offended by outsiders raiding their data without due permission. In other words, it amounts to trespassing on their portal, which is unacceptable—both ethically and legally. So, it is very important for you to read all disclosure statements carefully and follow along in the right way. As web scraping cases may turn into felony offenses, it is best to guard against any kind of scrupulous activity and take permission before scraping data.

The Good News

However, all is not grey in data extraction processes. Reputed agencies are helping their clients scrape valuable data for gaining more value through legal means and carefully used tools. If you are looking for such services, then do get in touch with a reliable web scraping company of your choice and take your business to the next levels of success.

Source: https://3idatascraping.wordpress.com/2015/03/11/web-scraping-an-illegal-activity-or-simple-data-collection/

Wednesday 22 April 2015

How to Properly Scrape Windows During The Cleaning Process

Removing ordinary dirt such as dust, fingerprints, and oil from windows seem simple enough. However, sometimes, you may find stubborn caked-on dirt or debris on your windows that cannot be removed by standard window cleaning techniques such as scrubbing or using a squeegee. The best way to remove caked-on dirt on your windows is to scrape it off. Nonetheless, you have to be extra careful when you are scraping your windows, because they can be easily scratched and damaged. Here are a number of rules that you need to follow when you are scraping windows.

Rule No. 1: It is recommended that you use a professional window scraper to remove caked-on dirt and debris from your windows. This type of scraper is specially made for use on glass, and it comes with certain features that can prevent scratching and other kinds of damage.

Rule No. 2: It is important to inspect your window scraper before using it. Take a look at the blade of the scraper and make sure that it is not rusted. Also, it must not be bent or chipped off at the corners. If you are not certain whether the blade is in a good enough condition, you should just play it safe by using a new blade.

Rule No. 3: When you are working with a window scraper, always use forward plow-like scraping motions. Scrape forward and lift the scraper off the glass, and then scrape forward again. Try not to slide the scraper backwards, because you may trap debris under the blade when you do so. Consequently, the scraper may scratch the glass.

Rule No. 4: Be extra cautious when you are using a window scraper on tempered glass. Tempered glass may have raised imperfections, which make it more vulnerable to scratches. To find out if the window that you are scraping is made of tempered glass, you have to look for a label in one of its corners.

Window Scraping Procedures

Before you start scraping, you have to wet your window with soapy water first. Then, find out how the window scraper works by testing it in a corner. Scrape on the same spot three or four times in forward motion. If you find that the scraper is moving smoothly and not scratching the glass, you can continue to work on the rest of the window. On the other hand, if you feel as if the scraper is sliding on sandpaper, you have to stop scraping. This indicates that the glass may be flawed and have raised imperfections, and scraping will result in scratches.

After you have ascertained that it is safe to scrape your window, start working along the edges. It is best that you start scraping from the middle of an edge, moving towards the corners. Work in a one or two inch pattern, until all the edges of the glass are properly scraped. After that, scrape the rest of the window in a straight pattern of four or five inches, working from top to bottom. If you find that the window is beginning to dry while you are working, wet it with soapy water again.

Source: http://ezinearticles.com/?How-to-Properly-Scrape-Windows-During-The-Cleaning-Process&id=6592930

Tuesday 7 April 2015

The Nasty Problem with Scraping Results from the Engines

One theme that I've been concerned with this week centers around data transparency in the search engine world. Search engines provide information that is critical to the business of optimizing and growing a business on the web, yet barriers to this data currently force many companies to use methods of data extraction that violate the search engines' terms of service.

Specifically, we're talking about two pieces of information that no large-scale, successful web operation should be without. These include rankings (the position of their site(s) vs. their competitors) for important keywords and link data (currently provided most accurately through Yahoo!, but also available through MSN and in lower quality formats from Google).

Why do marketers and businesses need this data so badly? First we'll look at rankings:

•    For large sites in particular, rankings across the board will go up or down based on their actions and the actions of their competition. Any serious company who fails to monitor tweaks to their site, public relations, press and optimization tactics in this way will lose out to competitors who do track this data and, thus, can make intelligent business decisions based on it.

•    Rankings provide a benchmark that helps companies estimate their global reach in the search results and make predictions about whether certain areas of extension or growth make logical sense. If a company must decide on how to expand their content or what new keywords to target or even if they can compete in new markets, the business intelligence that can be extracted from large swaths of ranking data is critical.

•    Rankings can be mapped directly to traffic, allowing companies to consider advertising, extending their reach or forming partnerships

And, on the link data side:

•    Temporal link information allows marketers to see what effects certain link building, public relations and press efforts have on a site's link profile. Although some of this data is available through referring links in analytics programs, many folks are much more interested in the links that search engines know about and count, which often includes many more than those that pass traffic (and also ignores/doesn't count some that do pass traffic).

•    Link data may provide references for reputation management or tracking of viral campaigns - again, items that analytics don't entirely encompass.

•    Competitive link data may be of critical importance to many marketers - this information can't be tracked any other way.

I admit it. SEOmoz is a search engine scraper - we do it for our free public tools, for our internal research and we've even considered doing it for clients (though I'm seriously concerned about charging for data that's obtained outside TOS). Many hundreds of large firms in the search space (including a few that are 10-20X our size) do it, too. Why? Because search engine APIs aren't accurate.

Let's look at each engine's abilities and data sources individually. Since we've got a few hundred thousand points of data (if not more) on each, we're in a good position to make calls about how these systems are working.

Google (all APIs listed here):

•    Search SOAP API - provides ranking results that are massively different from almost every datacenter. The information is often less than useless, it's actually harmful, since you'll get a false sense of what's happening with your positions.

•    AJAX Search API - This is really designed to be integrated with your website, and the results can be of good quality for that purpose, but it really doesn't serve the job of providing good stats reporting.

•    AdSense & AdWords APIs - In all honesty, we haven't played around with these, but the fact that neither will report the correct order of the ads, nor will they show more than 8 ads at a time tells me that if a marketer needed this type of data, the APIs wouldn't work.

Yahoo! (APIs listed here):

•    Search API - Provides ranking information that is a somewhat accurate map to Yahoo!'s actual rankings, but is occassionally so far off-base that they're not reliable. Our data points show a lot more congruity with Yahoo!'s than Google's, but not nearly enough when compared with scraped results to be valuable to marketers and businesses.

•    Site Explorer API - Shows excellent information as far as number of pages indexed on a site and the link data that Yahoo! knows about. We've been comparing this information with that from scraped Yahoo! search results (for queries like linkdomain: and site:) and those at the Site Explorer page and find that there's very little quality difference in the results returned, though the best estimate numbers can still be found through a last page search of results.

•    Search Marketing API - I haven't played with this one at all, so I'd love to hear comments from those who have.

MSN:

•    Doesn't mind scraping as long as you use the RSS results. We do, we love them and we commend MSN for giving them out - bravo! They've also got a web search SDK program, but we've yet to give it a whirl. The only problem is the MSN estimates, which are so far off as to be useless. The links themselves, though, are useful.

Ask.com

•    Though it's somewhat hidden, the XML.Teoma.com page allows for scraping of results and Ask doesn't seem to mind, though they haven't explicitly said anything. Again, bravo! - the results look solid, accurate and match up against the Ask.com queries. Now, if Ask would only provide links

I know a lot of you are probably asking:

•    "Rand, if scraping is working, why do you care about the search engines fixing the APIs?"

•    The straight answer is that scraping hurts the search engines, hurts their users and isn't the most practical way to get the data. Let me give you some examples:

•    Scraped queries have to look as much like real users as possible to avoid detection and banning - thus, they affect the query data that search engineers use to improve web search.

•    These queries also hit advertisers - falsifying the number of "real" impressions that advertisers see and lowering their CTRs unnaturally.

•    They take up search engine resources and though even the heaviest scraping barely impacts their server loads, it's still an annoyance.

•    With all these negative elements, and so many positive incentives to have the data, it's clear what's needed - a way for marketers/businesses to get the data they need without hurting the search engines. Here's how they can do it:

•    Provide the search ranking position of a site in the referral string - this works for ranking data, but not for link data and since Yahoo! (and Google) both send referrals through re-directs at times, it wouldn't be a hard piece to add.

•    Make the API's accurate, complete and unlimited

•    If the last option is too ambitious, the search engines could charge for API queries - anyone who needs the data would be more than happy to pay for it. This might help with quality control, too.

•    For link data - serve up accurate, wholistic data in programs like Google Sitemaps and Yahoo! Search Submit (or even, Google Analytics). Obviously, you'd only get information about your own site after verifying.

I've talked to lots of people at the search engine level about making changes this week (including Jeremy, Priyank, Matt, Adam, Aaron, Brett and more). I can only hope for the best...

Source: http://moz.com/blog/the-nasty-problem-with-scraping-results-from-the-engines

Friday 27 March 2015

The Great Advantages of Data Extraction Software – Why a Company Needs it?

Data extraction is being a huge problem for large corporate companies and businesses, which needs to be handled technically and safely. There are many different approaches used for data extraction from web and various tools have designed to solve certain problems.

Moreover, algorithms and advanced techniques were also developed for data extraction. In this array, the Data Extraction Software is widely used to extract information from web as designed.

Data Extraction Software:

 This is a program specifically designed to collect and organize the information or data from the website or webpage and reformat them.

Uses of Data Extraction Software:

Data extraction software can be used at various levels including social web and enterprise levels.

Enterprise Level: Data extraction techniques at the enterprise level are used as the prime tool to perform analysis of the data in business process re-engineering, business system and in competitive intelligence system.

Social Web Level: This type of web data extraction techniques is widely used for gathering structured data in large amount that are continuously generated by Web.2.0, online social network users and social media. 

To specify other uses of Data Extraction software:

  •     It helps in assembling stats for the business plans
  •     It helps to gather data from public or government agencies
  •     It helps to collect data for legal needs

Does the Data Extraction Software make Your Job Simple?

The usage of data extraction software has been widely appreciated by many large corporate companies. In this array, here are a few points to favor the usage of the software;
  •     Data toolbar consists of web scraping tool to automate the process of web data extraction
  •     Point data fields from which the data need to be collected and the tool will do the rest
  •     There are no technical skills required to use data tool
  •     It is possible to extract a huge number of data records in just a few seconds

Benefits of Data Extraction Software:

This data extraction software benefits many computer users. Here follows a few remarkable benefits of the software;
  •     It can extract detailed data like description, name, price, image and more as defined from a website
  •     It is possible to create projects in the extractor and extract required information automatically from the site without the user’s interference
  •     The process saves huge effort and time
  •     It makes extracting data from several websites easy like online auctions, online stores, real estate portal, business directories, shopping portals and more
  •     It makes it possible to export extracted data to various formats like Microsoft Excel, HTML, SQL, XML, Microsoft Access, MySQL and more
  •     This will allow processing and analyzing data in any custom format

Who majorly Benefits from Data Extraction Software?


Any computer user benefit from this data extraction software, however, it is majorly benefiting users like;
  •     Business men to collect market figures, real estate data and product pricing data
  •     Book lovers to extract information about titles, authors, images, descriptions prices and more
  •     Collectors and hobbyists to extract auction and betting information
  •     Journalists to extract article and news from new websites
  •     Travelers to extract information about holiday places, vacations, prices, images and more
  •     Job seekers to extract information about jobs available, employers and more

Websitedatascraping.com is enough capable to web data scraping, website data scraping, web scraping services, website scraping services, data scraping services, product information scraping and yellowpages data scraping.

Tuesday 17 March 2015

Professional Web Scraping Process

Web scraping is usually regarded as data mining and knowledge discovery. It is the process of extracting useful data and relationships from any data sources. For instance the web pages, databases and search engines. It employs pattern matching and statistical techniques. It is important to note that web scraping does not borrow from other fields like machine learning, databases, data visualization and others but supports such fields.

Web scraping process is such a complex process that requires not only time but also people with expertise in the same field. This is because the internet is such a dynamic resource that changes every time. For instance the data you can extract from a certain website a month ago will not be the same one you will extract now. The changing of data in short period of time poses the difficult of relying to such data and therefore calls for web scraping process. The web scraping process should be performed regularly in order to obtain accurate data that can be relied upon.

It is important to understand that many areas of business, science and other environments use a large amount of data. This data needs to be meaningful and knowledge in its application. Web scraping sometimes may be overlooked, but in essence it can provide very useful information than the statistical methods can produce. The web scraping methods are vital as they give you more control over the data.

Usually the data found on the internet is noisy data. This implies of the advertisements and pop-ups. The data also found on the internet can be described as dynamic data, sparse data, static data, heterogeneity and so and so forth. Such problems occur in very large amounts and therefore call for web scraping professional companies to perform their job. With such problems it is important to realize that statistical methods would never succeed and therefore calls for web scraping.

Process of web scraping

1. Identification of data sources and selection of target data. You need not to harvest any kind of data, but data that is deemed relevant and useful in its application. The relevance can be seen in a way of getting the data that will benefit your company. This is an important step in the web scraping process.

2. Pre-process.This involves cleaning and attributes selection of data before it is being harvested. Web scraping is usually done on specific websites that are relevant to your business. For instance if you have an online store and need information about your competitors products then you need data from other websites that are relevant such e-commerce stores and so on.

3. Web scraping. This involves data mining so as to extract models and information patterns or models that is beneficial to your business.

4. Post-process. After web scraping is done, it is important to identify the useful data that can be used in your business in decision making and so on.

It is important to note that the patterns identified need to be novel, understandable, potentially viable and valid for web scraping process to make sense in business data harvesting.

Source:http://www.loginworks.com/blogs/web-scraping-blogs/professional-web-scraping-process/

Monday 16 March 2015

6 Benefits Associated with Data Mining

Data has been used from time immemorial by various companies to manage their operations.Data is needed by various organizations strategically aimed at expanding their business operations, reduction of costs, improve their marketing force and above all improve profitability. Data mining is aimed at the creation of information assets and uses them to leverage their objectives.

In this article, we discuss some of the common questions asked about the data mining technology. Some of the questions we have addressed include:

•    How can we define data mining?
•    How can data mining affect my organization?
•    How can my business get started with data mining?

Data Mining Defined

Data mining can be regarded as a new concept in the enterprise decision support system, usually abbreviated as DSS. It does more than complementing and interlocking with the DSS capabilities that may involve reporting and query. It can also be used in on-line analytical processing (OLAP), traditional statistical analysis and data visualization. The technology comes up with tables, graphs and reports of the past business history.

We may define data mining as modeling of hidden patterns and discovering data from large volumes of data.It is important to note that data mining is very different from other retrospective technologies because it involves the creation of models. By using this technology, the user can discover patterns and use them to build models without even understanding what you are after. It gives explanation why the past events happened and even predicting what is likely to happen.

Some of the information technologies that can be linked to data mining include neural networks, fuzzy logic, rule induction and genetic algorithms. In this article we do not cover those technologies but focus on how data mining can be used to meet your business needs and you can translate the solutions thereafter into dollars.

Setting Your Business Solutions and Profits

One of the common questions asked about this technology is; what role can data mining play for my organization? At the start of this article we described some of the opportunities that can be associated with the use of data. Some of those benefits include cost reduction, business expansion, sales and marketing and profitability. In the following paragraphs we look into some of the situations where companies have used data mining to their advantage.

Business Expansion

Equity Financial Limited wanted to expand their customer base and also attract new customers. They used the Loan Check offer to meet their objectives. Initiating the loan, a customer had to go to any branch of Equity branch and just cash the loan. Equity introduced a $6000 LoanCheck by just mailing the promotion to their existing customers. The equity database was able to track about 400 characteristics of every customer. The characteristics were about loan history of the customer, their active credit cards, current balance on the credit cards and if they could respond to the loan offer. Equity used data mining to shift through 400 customer features and also finding the significant ones. They used the data and build model based on the response to the Loan Check offer. They then integrated this model to 500,000 potential customers from credit bureau. They then selectively mailed the most potential customers that were determined by the data mining model.At the end of the process they were able to generate a tot
al of $2.1M in extra net income from 15,000 new customers.

Reduction of Operating Costs

Empire is one of the largest insurance companies in the country. In order to compete with other insurance companies, it has to offer quality services and at the same time reducing costs.Therefore it has to attack costs that may in form of fraud and abuse. This demands a considerable investigation skills and use of data management technology. The latter calls for data mining application that can profile every physician in their network based on claims records of every patient in their data warehouse. The application is able to detect subtle deviations on the physician behavior that are linked to her/her peer group. The deviations are then reported to the intelligence and fraud investigators as “suspicion index.” With this effort derived from data mining, the company was able to save $31M, $37M, and $41M in the first three years respectively from frauds.

Sales Effectiveness and Profitability

In this case we look into pharmaceutical sector. Their sales representatives have wide range of assortment tools they use in promoting various products to physicians. Some of the tools include product samples, clinical literature, dinner meetings, golf outings, teleconferences and many more. Therefore getting to know the promotions methods that are ideal for particular physician is of valuable importance and it is likely to cost the company a lot of dollars in sales call and thereby more lost revenue.

Through data mining, a drug maker was able to link eight months of promotional activity based on corresponding sales found in their database. They then used this information to build a predictive model for each physician.The model revealed that for the six promotional alternatives, only three had a significant impact. Then they used the knowledge found in the data mining models and thereby customizing the ROI.

Looking at those two case studies, then ask yourself, was data mining necessary?

Getting Started


All the cases presented above have revealed how data mining was used to yield results to the various businesses. Some of the results led to increased revenue and increased customer base. Others can be regarded as bottom-line improvements that impacted on cost savings and also improved productivity.In the next few paragraphs we try to answer the question; how can my company get started and start realizing the benefits of data mining.

The right time to start your data mining project is now. With the emergence of specialized data mining companies, starting the process has been simplified and the costs greatly reduced. Data mining project can offer important insights into the field and also aggregate the idea of creating a data warehouse.

In this article we have addressed some of the common questions regarding data mining, what are the benefits associated with the process and how a company can get started. Now, with this knowledge your company should start with a pilot project and then continue building a data mining capability in your company; to improve profitability, market your products more effectively, expand your business and also reduce costs.

Source: http://www.loginworks.com/blogs/web-scraping-blogs/255-benefits-associated-with-data-mining/

Monday 9 March 2015

Why Web Scraping Software Won't Help

How to get continuous stream of data from these websites without getting stopped? Scraping logic depends upon the HTML sent out by the web server on page requests, if anything changes in the output, its most likely going to break your scraper setup.

If you are running a website which depends upon getting continuous updated data from some websites, it can be dangerous to reply on just a software.

Some of the challenges you should think:

1. Web masters keep changing their websites to be more user friendly and look better, in turn it breaks the delicate scraper data extraction logic.

2. IP address block: If you continuously keep scraping from a website from your office, your IP is going to get blocked by the "security guards" one day.

3. Websites are increasingly using better ways to send data, Ajax, client side web service calls etc. Making it increasingly harder to scrap data off from these websites. Unless you are an expert in programing, you will not be able to get the data out.

4. Think of a situation, where your newly setup website has started flourishing and suddenly the dream data feed that you used to get stops. In today's society of abundant resources, your users will switch to a service which is still serving them fresh data.

Getting over these challenges


Let experts help you, people who have been in this business for a long time and have been serving clients day in and out. They run their own servers which are there just to do one job, extract data. IP blocking is no issue for them as they can switch servers in minutes and get the scraping exercise back on track. Try this service and you will see what I mean here.

Dheeraj Juneja, Founder & CEO, Loginworks Softwares, A Virtual IT Team for your business

Loginworks Softwares Web Scraping Service

Read more about various technical stuff at our blogs: Technical Blogs

Source: http://ezinearticles.com/?Why-Web-Scraping-Software-Wont-Help&id=4550594

Monday 2 March 2015

Basics of Web Data Mining and Challenges in Web Data Mining Process

Today World Wide Web is flooded with billions of static and dynamic web pages created with programming languages such as HTML, PHP and ASP. Web is great source of information offering a lush playground for data mining. Since the data stored on web is in various formats and are dynamic in nature, it's a significant challenge to search, process and present the unstructured information available on the web.

Complexity of a Web page far exceeds the complexity of any conventional text document. Web pages on the internet lack uniformity and standardization while traditional books and text documents are much simpler in their consistency. Further, search engines with their limited capacity can not index all the web pages which makes data mining extremely inefficient.

Moreover, Internet is a highly dynamic knowledge resource and grows at a rapid pace. Sports, News, Finance and Corporate sites update their websites on hourly or daily basis. Today Web reaches to millions of users having different profiles, interests and usage purposes. Every one of these requires good information but don't know how to retrieve relevant data efficiently and with least efforts.

It is important to note that only a small section of the web possesses really useful information. There are three usual methods that a user adopts when accessing information stored on the internet:

• Random surfing i.e. following large numbers of hyperlinks available on the web page.

• Query based search on Search Engines - use Google or Yahoo to find relevant documents (entering specific keywords queries of interest in search box)

• Deep query searches i.e. fetching searchable database from eBay.com's product search engines or Business.com's service directory, etc.

To use the web as an effective resource and knowledge discovery researchers have developed efficient data mining techniques to extract relevant data easily, smoothly and cost-effectively.

Source:http://ezinearticles.com/?Basics-of-Web-Data-Mining-and-Challenges-in-Web-Data-Mining-Process&id=4937441

Wednesday 25 February 2015

Achieving Sustainability in Mining

There's so much that our planet gives us for our consumption. These things come in different shapes and sizes, and some of the most abundant of them are minerals. Minerals are essential for living in these modern times, and when it comes to extracting them, mining is still the primary method used.

One of the biggest issues that any industry faces is sustainability, and the mining sector is certainly no exception to it. Some of the things that serve to constrain sustainability in this industry are the ever-increasing demand minerals, the consumption of resources that are needed to extract and process metals, as well as the pollution caused by the process of extracting them.

Increasing Demand for Minerals

There's no question that there's growth in the extraction of construction minerals. As more and more countries become more industrialized, the demand for such minerals is almost directly proportional to the growth in the construction industry. In the 20th century, we saw a growth in the extraction of construction materials. Demand for ores and industrial minerals also increased.

Impacts

Aside from the obvious impact mining has on the environment, it can also have a negative social impact. In order to keep up with the demand for mined resources, there's also a subsequent increase in mining activities to meet such demand. During the course of conducting such activities, there can be times when certain things are overlooked, including the short, medium and even long-term effects of mining activities in the community where they are done. This is then where there arises a need to balance the economic benefits of mining versus its potential harmful effects on the environment.

Sustainability and Maximizing Mining Benefits

There are ways to maximize the benefits we can get from mining as we improve sustainability both on the environmental and social fronts. This was specifically addressed in the Plan of Implementation of the World Summit on Sustainable Development. It identified three priority areas:

a. Support efforts to address the environmental, economic, health and social impacts and benefits of mining, minerals and metals throughout their life cycle;

b. Enhance the participation of stakeholders, including local and indigenous communities and women, to play an active role in minerals, metals and mining development throughout the life cycles of mining operations; and

c. Foster sustainable mining practices through the provision of financial, technical and capacity-building support to developing countries and countries with economies in transition for the mining and processing of minerals.

As long as efforts are made for mining to be environmentally, economically, and socially sustainable, we can enjoy the many benefits of mining without worrying about and suffering the potentially harmful effects mining can have on people and nature.

Source: http://ezinearticles.com/?Achieving-Sustainability-in-Mining&id=8108499

Tuesday 24 February 2015

How Gold Mining of the Past Creates New Gold for Cash

The history of gold mining and the different methods for retrieving and producing gold from the earth is quite extensive. Gold has been extracted as early as 2000BC with the ancient roman civilization and has never stopped being mined since then. While the techniques have shifted slightly over the years, the process of finding gold and converting it into wealth has remained consistent and today you can sell gold for cash with ease. It is an interesting history to examine how this gold originally came from the earth.

The Roman civilization as we know were well advanced in their approach to science and this benefited their approach to gold mining. Hydraulic mines and pumps were used to excavate gold from various regions were it was discovered. Gold discoveries prompted the capture and expansion of several territories and countries by the Roman Empire. When gold was mined it was often used to produce coinage and served as the primary source of currency or exchange for goods and services where money really represented its value. Mining with the use of gold panning techniques also most-likely goes back to the Romans. This mining technique requires a prospector to slosh sediment containing gold in a pan with water using the naturally higher density of the metal to shift it to the bottom of the pan and all other dirt or rock forced out on top.

Between the time frame of 1840 to year 2000, the capacity of gold extraction has exploded with world gold production starting at a mere 1 ton growing to currently around 2500 tons. Over this time frame the significance of gold in financial terms hasn't changed with people who continue to sell gold for cash. The increase in production, however, can be attributable to more advanced machinery and mining practices. Hard rock mining is performed when gold is found in cased in rock and requires heavy machinery or explosives to grind rock down to the point that gold can be separated. For gold that is found in veins on loose soil, rock, or sentiment a common process of either dredging or sluicing is performed. These techniques are very similar to panning by allowing the gold to settle to the bottom, but are more practical in commercial application.

These are just a few of the courses your gold may have followed over the course of its use in human history. No matter what the case, it is always possible to sell gold for cash so that it may continue to follow its path and purpose in the human world.

Derek Robertson is a financial market analyst and writer. He has written for several years for publications in print and now on many blogs and online information resources. His background in investigative journalism enables him to provide a unique and unbiased perspective on many of the subjects he writes.

He also specializes in cash for Gold and Sell Gold.

Source:http://ezinearticles.com/?How-Gold-Mining-of-the-Past-Creates-New-Gold-for-Cash&id=5012666

Saturday 21 February 2015

CSR in the Extraction Sector

A study commissioned by the Canadian Mining industry found that Canadian mining companies were involved in 4 times as many mining "incidents" as companies from other countries. The study was intended for internal consumption only but has been leaked to the press recently. The study found that Canadian mining companies were involved in nearly two thirds of the 171 "high profile" environmental and human rights violations it studied occurring between 1999 and 2009. Members of the mining industry pointed out that the occurrences are in proportion to their representation on the global mining scene, indicating that they were no better or worse than companies from other countries.

First some background on the study. The study findings were captured in a report titled "Corporate Social Responsibility & the Canadian International Extractive Sector: A Survey". The report was prepared for the Prospectors and Developers Association of Canada (PDAC) by the Canadian Centre for the Study of Resource Conflict (CCSRC). The purpose of the study was to measure the level of Corporate Social Responsibility (CSR) in the "extractive" sector. The extractive sector, for those of us untutored in the terminology means exploration, gas, oil, and mining companies. The document leaked to the press was a first draft of the report, not the final draft. I should also mention that there is a bill, C-300, before the Canadian parliament which would make financing for foreign ventures contingent on meeting federally defined CSR standards. The exploration, gas, oil, and mining companies, and the organizations which represent them are very much against this bill. Leaking the negative aspects of this report was fortuitous for those in support of bill C-300 and disastrous for those opposed to it.

One of the observations the report makes is that adoption of formal CSR policies by companies with international interests is "remarkably low", but that those companies which have adopted CSR policies have experienced positive outcomes. The CCSRC contacted 584 companies which they felt met their criteria to participate in the study. Of those, 202 chose to participate. The first survey question was "Do you have a CSR policy or Code of International Business Conduct?" 56 of the 202 companies had documented policies in place. The study broke the 202 companies they surveyed into "junior" and "major" companies. 50% of the companies designated as major had documented CSR policies while only 21% of junior companies had one.

The survey also asked about the positive effects of a CSR policy. 24% of respondents claimed a reduction in conflicts or complications, 62% claimed better community relations (relations with the communities they were doing business in), and 25% reported increased shareholder interest. On the downside, 24% reported increased administration costs and 25% reported increased operating costs. One question they failed to ask was whether the benefits outweighed the costs.

The information I've stated in the preceding 2 paragraphs was gleaned from the final draft of the report. I don't have access to the first draft but apparently it described some of the 171 violations they were addressing in the study. I reported on one such violation in Project Management Tips section of this web site under the title "CSR Problems". The incidents reported on reflect the difficulty faced by companies who conduct business in some international locations. These incidents juxtapose our Canadian values and ethics with those of the countries our exploration, gas, mining, and oil companies do business in. One incident reported on, and attributed to the mining company's lack of CSR by the media, pitted one host community against another with the resulting violence blamed on the Canadian mining company. I'm not suggesting here that these companies have not made mistakes in the past, or that improvements cannot be made in their CSR efforts, I am suggesting that we should have realistic expectations about the effectiveness of a CSR policy to prevent any problems in a foreign venture.

A reasonable expectation in some cases would be that the company have a documented CSR policy which conforms to the standards and ethics of this country (Canada), abides by the laws of the host country, and conforms to the standards and ethics of the host country. The expectation should be tempered with the acknowledgment that the operating environment these companies encounter in host countries can be radically different than that found here. For example, when one community is in conflict with another over whether a mining operation should take place, we tend to look to non-violent forms of dispute resolution where some countries may resort to extreme violence to settle the dispute. Canadian companies frequently hire locals as security guards to protect their property as local authorities cannot perform this duty for one reason or another. It is reasonable to expect the hiring company to do its due diligence in hiring these people to ensure they don't create a threat to the surrounding community. It is not reasonable to expect that there will be no conflicts arising out of these situations. Where it is suspected that a security guard overstepped their authority, or engaged in illegal behaviour, it is reasonable to expect the employer to cooperate with the local authorities in the investigation.

North American companies doing business internationally have long had to deal with conflicts between acceptable corporate behaviour in their own country and acceptable behaviour in the host country. Bribery is the classic example. There are countries where bribery is not only accepted but essential to conducting business. Our laws will convict anyone proved to have offered a bribe but failure to pay the bribe may result in a failure to perform on the part of the North American company. Failure to perform might result in the loss of all or part of the company's investment in the project. Holding a company to this type of double standard can only result in one of 2 outcomes: the company will break the rule against bribery, or the company will cease to do business in that host country.

Since this web site is aimed at the project management community, let's draw some conclusions from the survey and CSR in general that may help project managers. The first conclusion I would draw from all of the above is that the CSR policy that governs your project must describe achievable goals. By this I mean that the goals, objectives, and standards stated in the policy must be within the project's power to achieve, or comply with. The second conclusion is that the right CSR policy carefully implemented can provide a business benefit to the organization. It is the project manager's job to ensure that those benefits are realized.

The goals and objectives of the project must include goals and objectives in support of the CSR policy. Those goals and objectives should be spelled out in the Project Charter and the connection between those goals and objectives and the CSR policy clearly defined. Make sure that the CSR related goals and objectives you set for the project are clearly defined, measurable, and obtainable and then agree with your stakeholders on the conditions that will indicate the goals have been met. Check for CSR policy goals and objectives that might conflict with each other and any of your project's goals and objectives, both CSR related and non-CSR. Goals and objectives you feel might conflict with each other, or with the CSR policy should be resolved by senior management. Start your escalation by drawing the project sponsor's attention to the conflict and ask for their help with resolution.

Source: http://ezinearticles.com/?CSR-in-the-Extraction-Sector&id=5675024