Monday 29 December 2014

How to scrape address from Google Maps

If you want to build a new online directory based website and want it to be popular with latest web contents, then you need the help of web scraping services from iWeb scraping. If you want to scrape address from maps.google.com, there is a specialized web scraping tool developed by iWeb scraping which can do the job for you. There are plenty of benefits with web scraping which includes market research, gathering customer information, managing product catalogs, compare prices, gather real estate data, gather job posting information etc. Web scraping technology is very popular nowadays and it saves lot of time and effort involved in manual extraction of data from websites.

The web scraping tools developed iWeb Scraping is very user-friendly and can extract specific information from targeted websites. It converts data from HTML web pages to useful formats like Excel spread sheets or Access database. Whatever web scraping requirements you have, you can contact iWeb Scraping as they have more than 3.5 years of web data extraction experience and offer the best prices in the industry. Also their services are available in 24x7 basis and free pilot projects will be done based on request.

Companies which require specific web data and look for an application which can automate the process and export the HTML data in structured format could benefit greatly from web scraping applications of iWeb scraping. You can easily extract data from multiple target websites, parse and re-assemble the information in HTML format to database or spread sheets as you wish. The application has simple point-and-click user-interface and any beginner can use it scrape address from Google Maps. If you want to gather address of people in particular region from Google maps, you can do it with help of web scraping application developed by iWebscraping.

Web Scraping is a technology that able to digest target website databases that are visible only as HTML web pages, and create a local, identical replica of those databases as a information or result. With our web scraping & web data extraction service we can capture web pages, then pin-point specific pieces of data/information you'd like to extract from web pages. What is needed in this process is much more than a Website crawler and set of Website wrappers. The time required to do web data extraction goes down in comparison to manually data copying and pasting job.

Source:http://www.articlesbase.com/information-technology-articles/how-to-scrape-address-from-google-maps-4683906.html

Wednesday 24 December 2014

Limitations and Challenges in Effective Web Data Mining

Web data mining and data collection is critical process for many business and market research firms today. Conventional Web data mining techniques involve search engines like Google, Yahoo, AOL, etc and keyword, directory and topic-based searches. Since the Web's existing structure cannot provide high-quality, definite and intelligent information, systematic web data mining may help you get desired business intelligence and relevant data.

Factors that affect the effectiveness of keyword-based searches include:

• Use of general or broad keywords on search engines result in millions of web pages, many of which are totally irrelevant.

• Similar or multi-variant keyword semantics my return ambiguous results. For an instant word panther could be an animal, sports accessory or movie name.

• It is quite possible that you may miss many highly relevant web pages that do not directly include the searched keyword.

The most important factor that prohibits deep web access is the effectiveness of search engine crawlers. Modern search engine crawlers or bot can not access the entire web due to bandwidth limitations. There are thousands of internet databases that can offer high-quality, editor scanned and well-maintained information, but are not accessed by the crawlers.

Almost all search engines have limited options for keyword query combination. For example Google and Yahoo provide option like phrase match or exact match to limit search results. It demands for more efforts and time to get most relevant information. Since human behavior and choices change over time, a web page needs to be updated more frequently to reflect these trends. Also, there is limited space for multi-dimensional web data mining since existing information search rely heavily on keyword-based indices, not the real data.

Above mentioned limitations and challenges have resulted in a quest for efficiently and effectively discover and use Web resources. Send us any of your queries regarding Web Data mining processes to explore the topic in more detail.

Source: http://ezinearticles.com/?Limitations-and-Challenges-in-Effective-Web-Data-Mining&id=5012994

Thursday 18 December 2014

Data Extraction - A Guideline to Use Scrapping Tools Effectively

So many people around the world do not have much knowledge about these scrapping tools. In their views, mining means extracting resources from the earth. In these internet technology days, the new mined resource is data. There are so many data mining software tools are available in the internet to extract specific data from the web. Every company in the world has been dealing with tons of data, managing and converting this data into a useful form is a real hectic work for them. If this right information is not available at the right time a company will lose valuable time to making strategic decisions on this accurate information.

This type of situation will break opportunities in the present competitive market. However, in these situations, the data extraction and data mining tools will help you to take the strategic decisions in right time to reach your goals in this competitive business. There are so many advantages with these tools that you can store customer information in a sequential manner, you can know the operations of your competitors, and also you can figure out your company performance. And it is a critical job to every company to have this information at fingertips when they need this information.

To survive in this competitive business world, this data extraction and data mining are critical in operations of the company. There is a powerful tool called Website scraper used in online digital mining. With this toll, you can filter the data in internet and retrieves the information for specific needs. This scrapping tool is used in various fields and types are numerous. Research, surveillance, and the harvesting of direct marketing leads is just a few ways the website scraper assists professionals in the workplace.

Screen scrapping tool is another tool which useful to extract the data from the web. This is much helpful when you work on the internet to mine data to your local hard disks. It provides a graphical interface allowing you to designate Universal Resource Locator, data elements to be extracted, and scripting logic to traverse pages and work with mined data. You can use this tool as periodical intervals. By using this tool, you can download the database in internet to you spread sheets. The important one in scrapping tools is Data mining software, it will extract the large amount of information from the web, and it will compare that date into a useful format. This tool is used in various sectors of business, especially, for those who are creating leads, budget establishing seeing the competitors charges and analysis the trends in online. With this tool, the information is gathered and immediately uses for your business needs.

Another best scrapping tool is e mailing scrapping tool, this tool crawls the public email addresses from various web sites. You can easily from a large mailing list with this tool. You can use these mailing lists to promote your product through online and proposals sending an offer for related business and many more to do. With this toll, you can find the targeted customers towards your product or potential business parents. This will allows you to expand your business in the online market.

There are so many well established and esteemed organizations are providing these features free of cost as the trial offer to customers. If you want permanent services, you need to pay nominal fees. You can download these services from their valuable web sites also.

Source: http://ezinearticles.com/?Data-Extraction---A-Guideline-to-Use-Scrapping-Tools-Effectively&id=3600918

Monday 15 December 2014

Do blog scraping sites violate the blog owner's copyright?

I noticed that my blog has been posted on one of these website scraping sites. This is the kind of site that has no original content, but just repeats or scrapes content others have written and does it to get some small amount of ad income from ads on the scraping site. In essence the scraping site is taking advantage of the content of the originating site in order to make a few dollars from people who go to the site looking for something else. Some of these websites prey on misspelling. If you accidentally misspell the name of an original site, you just may end up with one of these patently commercial scraping sites.

Google defines scraping as follows:

•    Sites that copy and republish content from other sites without adding any original content or value
•    Sites that copy content from other sites, modify it slightly (for example, by substituting synonyms or using automated techniques), and republish it
•    Sites that reproduce content feeds from other sites without providing some type of unique organization or benefit to the user

My question, as set out in the title to this post, is whether or not scraping is a violation of copyright. It turns out that the answer is likely very complicated.  You have to look at the definition of a scraping site very carefully. Let me give you some hypotheticals to show what I mean.

Let's suppose that I write a blog and put a link in my blog post to your blog. Does that link violate your copyright? I can't imagine that anyone would think that there was problem with linking to another website on the Web. In this case, there is no content from the originating site, just a link.

But let's carry the hypothetical a little further. What if I put a link to your site and quote some of your content? Does this violate copyright law? If you are acquainted with any of the terminology of copyright law; think fair use. The issue here is whether or not the "quoted" material is a substantial reproduction of the entire original content? I would have the opinion that duplicating an entire blog post either with or without attribution would be a violation of the originator's copyright.

So is the scraping website protected by the "fair use" doctrine? Does the fact that the motivation for listing the original websites is to make money have anything to do with how you would decide if there was or was not a violation of the originator's copyright? By the way, the copyright does not make a distinction between a commercial and non-commercial use of the original constituting or not constituting a violation of copyright. The fact that the reproducing (scraping) party does not make money from the reproduction is not a factor in the issue of violation, although it may ultimately be an issue as to the amount of damages assessed.

Does the fact that the actions of the scraper annoy me, make any difference? I would answer, not in the least. Whether or not you are annoyed by the violation of the copyright makes no difference as to whether or not there is a violation. Likewise, you have no independent claims for your wounded feelings because of the copied content. Copyright is a statutory action (i.e. based on statutory law) and unless the cause of action is recognized by the law, there is no cause of action. Now, in an outrageous case, you may have  some kind of tort (personal injury) claim, but that is way outside of my hypothetical situation.

So what is the answer? Does scraping violate the originator's copyright? If only a small portion of the blog is copied (scraped) then I would have to have the opinion that it is not. Essentially, no matter what the motivation of the scrapper, there is not enough content copied to violate the fair use doctrine. Now, that is my opinion. Your's might differ. That is what makes lawsuits.

Do I think there are other reasons why scraping websites are objectionable? Certainly, but those reasons have nothing to do with copyright and they are probably the subject of another different blog post. So, if you are reading this from scraping website, bear in mind that there may be a serious problem with that type of website.

Source:http://genealogysstar.blogspot.in/2013/05/do-blog-scraping-sites-violate-blog.html

Thursday 11 December 2014

Seven tools for web scraping – To use for data journalism & creating insightful content

I’ve been creating a lot of (data driven) creative content lately and one of the things I like to do is gathering as much data as I can from public sources. I even have some cases it is costing to much time to create and run database queries and my personal build PHP scraper is faster so I just wanted to share some tools that could be helpful. Just a short disclaimer: use these tools on your own risk! Scraping websites could generate high numbers of pageviews and with that, using bandwidth from the website you are scraping.

1. Scraper (Chrome plugin)

    Scraper is a simple data mining extension for Google Chrome™ that is useful for online research when you need to quickly analyze data in spreadsheet form.

You can select a specific data point, a price, a rating etc and then use your browser menu: click Scrape Similar and you will get multiple options to export or copy your data to Excel or Google Docs. This plugin is really basic but does the job it is build for: fast and easy screen scraping.

2. Simple PHP Scraper

PHP has a DOMXpath function. I’m not going to explain how this function works, but with the script below you can easily scrape a list of URLs. Since it is PHP, use a cronjob to hourly, daily or weekly scrape the desired data. If you are not used to creating Xpath references, use the Scraper for Chrome plugin by selecting the data point and see the Xpath reference directly.

scraper-example

– Click here to download the example script.

3. Kimono Labs

Kimono has two easy ways to scrape specific URLs: just paste the URL into their website or use their bookmark. Once you have pointed out the data you need, you can set how often and when you want the data to be collected. The data is saved in their database. I like the facts that their learning curve is not that steep and it doesn’t look like you need a PHD in engineering to use their software. The disadvantage of this tool is the fact you can’t upload multiple URLs at once.

4. Import.io

Import.io is a browser based web scraping tool. By following their easy step-by-step plan you select the data you want to scrape and the tool does the rest. It is a more sophisticated tool compared to Kimono. I like it because of the fact it shows a clear overview of all the scrapers you have active and you can scrape multiple URLs at once.

5. Outwit Hub

I will start with the two biggest differences compared to the previous tool: it is a softwarepackage to use on your PC or laptop and to use its full potential it will cost you 75 USD. The free version can only scrape 100 rows of data. What I do like is the number of preprogrammed options to scrape which makes it easy to start and learn about web scraping.

6. ScraperWiki

This tool is really for people wanting to scrape on a massive scale. You can code your own scrapers (in PHP, Ruby & Python) and pricing is really cheap looking to what you can get: 29USD / month for 100 datasets. You are completely free in using libraries and timers. And if your programming skills are not good enough, they can help you out (paid service though). Compared to other tools, this is the most advanced tool that offers the basics of web scraping.

7. Fminer.com

This tool made it possible to finally scrape all the data inside Google Webmaster Tools since it can deal with JavaScript and AJAX interfaces. Read my extensive review on this page: Scraping Webmaster Tools with FMiner!

But on the end, building your individual project scrapers will always be more effective than using predefined scrapers. Am I missing any tools in this sum up of tools?

Source: http://www.notprovided.eu/7-tools-web-scraping-use-data-journalism-creating-insightful-content/

Sunday 30 November 2014

Web Scraping’s 2013 Review – part 2

As promised we came back with the second part of this year’s web scraping review. Today we will focus not only on events of 2013 that regarded web scraping but also Big data and what this year meant for this concept.

First of all, we could not talked about the conferences in which data mining was involved without talking about TED conferences. This year the speakers focused on the power of data analysis to help medicine and to prevent possible crises in third world countries. Regarding data mining, everyone agreed that this is one of the best ways to obtain virtual data.

Also a study by MeriTalk  a government IT networking group, ordered by NetApp showed this year that companies are not prepared to receive the informational revolution. The survey found that state and local IT pros are struggling to keep up with data demands. Just 59% of state and local agencies are analyzing the data they collect and less than half are using it to make strategic decisions. State and local agencies estimate that they have just 46% of the data storage and access, 42% of the computing power, and 35% of the personnel they need to successfully leverage large data sets.

Some economists argue that it is often difficult to estimate the true value of new technologies, and that Big Data may already be delivering benefits that are uncounted in official economic statistics. Cat videos and television programs on Hulu, for example, produce pleasure for Web surfers — so shouldn’t economists find a way to value such intangible activity, whether or not it moves the needle of the gross domestic product?

We will end this article with some numbers about the sumptuous growth of data available on the internet.  There were 30 billion gigabytes of video, e-mails, Web transactions and business-to-business analytics in 2005. The total is expected to reach more than 20 times that figure in 2013, with off-the-charts increases to follow in the years ahead, according to researches conducted by Cisco, so as you can see we have good premises to believe that 2014 will be at least as good as 2013.

Source:http://thewebminer.com/blog/2013/12/

Wednesday 26 November 2014

Web Data Extraction: driving data your way

Most businesses rely on the web to gather data such as product specifications, pricing information, market trends, competitor information, and regulatory details. More often than not, companies collect this data manually—a process that not only takes a significant amount of time, but also has the potential to introduce costly errors.

By automating data extraction, you're able to free yourself (and your pointer finger) from hours of copy/pasting, eliminate human errors, and focus on the parts of your job that make you feel great.

Web data extraction: What it is, why it's used, and how to get it right on an ongoing basis

Web data extraction, screen scraping, web harvesting—while these terms may have different connotations they all essentially point to the same thing: plucking data from the web and populating it in an organized way in another place for further analysis or more focused use. In an era where “big data” has become a commonplace concept, the appeal of web data extraction has grown; it’s an extremely efficient alternative to web browsing, and culls very specific data for a focused purpose.

How it's used

While each company’s needs vary, data extraction is often used for:

    Competitive intelligence, including web popularity, social perception, other sites linking to them, and placement of competitor advertisements

    Gathering financial data including stock market movement, product pricing, and more

    Creating continuity between price sheets and online websites, catalogs, or inventory databases

    Capturing product specifications like dimensions, color, and materials

    Pulling tabular data from multiple sources for in-depth analysis

Interestingly, some people even find that web data extraction can aid them in their leisure time as well, pulling data from blogs and websites that pertain to their hobbies or interests, and creating their own library of organized information on a topic. Perhaps, for instance, you want a list of all the designers that George Clooney wears (hey- we won’t question what you do in your free time). By using web scraping tools, you could automatically extract this type of data from, say, a fashion blogger who follows celebrity style, and create your own up-to-date shopping list of items.

How it's done

When you think of gathering data from the web, you should mentally juxtapose two different images: one of gathering a bucket of sand one grain at a time, and one of filling a bucket with a shovel that has the perfect scoop size to fill it completely in one sitting. While clearly the second method makes the most sense, the majority of web data extraction happens much like the first process--manually, and slowly.

Let’s take a look at a few different ways organizations extract data today.

The least productive way: manually

While this method is the least efficient, it’s also the most widespread. On the plus side, you need to learn absolutely nothing except “Ctrl+C/V” to use this method, which explains why it is the generally preferred method, despite the hours of time it can take. Imagine, for instance, managing a sales spreadsheet that keeps inventory up to date so that the information can be properly disseminated to a global sales team. Not only does it take a significant amount of time to update the spreadsheet with information from, say, your internal database and manufacturer’s website, but information may change rapidly, leaving sales reps with inaccurate information regardless.

Finding someone in the organization with a talent for programming languages like Python

Generally, automating a task without dedicated automation software requires programming, and therefore an internal resource with a solid familiarity with programming languages to create the task and corresponding script. While most organizations do, in fact, have a resource in IT or engineering with this type of ability, it often doesn’t seem like a worthy time investment for that person to derail the initiatives he or she is working on to automate web data extraction. Additionally, if companies do choose to automate using in-house resources, that person will find himself beholden to a continuing obligation, since he or she will need to adjust scripting if web objects and attributes change, disabling the task.

Outsourcing via Elance or oDesk

Unless there is a dedicated resource ready to automate and maintain data extraction processes (and most organizations wouldn’t necessarily choose to use their in-house employee time this way), companies might turn to outsourcing companies such as Elance or oDesk to hire contract help. While this is an effective way to automate a task using a resource that has a level of acumen in automation, it represents an additional cost--be it one time or on a regular basis as data extraction requirements change or increase.

Using Excel web queries

Since more often than not, data extracted from the web is often populated into an Excel spreadsheet, it’s no wonder that Excel includes web query tools expressly for that purpose. These tools are particularly useful in pulling tabular data from a website (such as product specifications, legal codes, stock prices, and a host of other information) and automatically pushing the data into a spreadsheet. Excel queries do have limitations and a learning curve, however, particularly when creating dynamic web queries. And clearly, if you’re hoping to populate the information in other sources, such as external databases, there is yet another level of difficulty to navigate.

How automation simplifies web data extraction

Culling web data quickly

Using automation is the simplest way to extract web data. As you execute the steps necessary to perform the task one time, a macro recorder captures each action, automatically generates an easily-editable script, and lets you specify how often you would like to repeat the task, and at what speed.

Maintaining the highest level of accuracy

With humans copy/pasting data, or comparing between multiple screens and entering data manually into a spreadsheet, you’re likely to run into accuracy issues (sometimes directly proportionate to the amount of time spent on the task and amount of coffee in the office!) Automation software ensures that “what you see is what you get,” and that data is picked up from the web and put back down where you want it without a hitch.

Storing web data in your preferred format

Not only can you accurately transfer data with automation software, you can also ensure that it’s populated into spreadsheets or databases in the format you prefer. Rather than simply dumping the data into a spreadsheet, you can ensure that the right information is put into the proper column, row, field, and style (think, for instance, of the difference between writing a birth date as “03/13/1912” and “12/3/13”).

Simplifying data analysis

Automation software allows you to aggregate data from disparate sources or enormous stockpiles of structured or unstructured data in a way that makes sense for your business analysis needs. This way, the majority of employees in an organization can perform some level of analysis on their own, making it easier to surface information that informs business decisions.

Reacting to changes without a hitch

Because automation software is built to recognize icons, images, symbols, and other objects regardless of their position on a screen, it can automate processes in a self-perpetuating manner. For example, let’s say you automate data retrieval from a certain chart on a retailer’s website without automation software. If the retailer decides to move that object to another area of the screen, your task would no longer produce accurate results (or work at all), leaving you to make changes to the script (or find someone who can), or re-record the task altogether. With image recognition capabilities, however, the system “memorizes” the object itself, not merely its coordinates, so that the task can continue to run irrespective of changes.

The wide sweeping appeal of automation software

Companies often pick a comprehensive automation solution not only because of its ability to effectively automate any web data extraction task, but also because it goes beyond data extraction. Automation software can permeate into other areas of the business as well, making tasks such as application integration, data migration, IT processes, Excel automation, testing, and routine tasks such as launching applications or formatting files faster and more accurate. Because it requires no programming experience to use, adoption rates are higher and businesses get more “bang for their buck.”

Almost any organization can benefit from using automation software, particularly as they grow and scale. If you are looking to quit “moving grains of sand” and start claiming back time in your day, there are a few steps you can take:

 Watch a short video that shows how web data extraction is done with automation software

 Download a free trial and start reaping the benefits of downloading even just a couple of tasks today.

 See how tasks are automated with our short, step-by-step how-to-sheets (and then give it a try yourself!)

Source: https://www.automationanywhere.com/web-data-extraction

Wednesday 19 November 2014

Web Scraping for Fun & Profit

There’s a number of ways to retrieve data from a backend system within mobile projects. In an ideal world, everything would have a RESTful JSON API – but often, this isn’t the case.Sometimes, SOAP is the language of the backend. Sometimes, it’s some proprietary protocol which might not even be HTTP-based. Then, there’s scraping.

Retrieving information from web sites as a human is easy. The page communicates information using stylistic elements like headings, tables and lists – this is the communication protocol of the web. Machines retrieve information with a focus on structure rather than style, typically using communication protocols like XML or JSON. Web scraping attempts to bridge this human protocol into a machine-readable format like JSON. This is what we try to achieve with web scraping.

As a means of getting to data, it don’t get much worse than web scraping. Scrapers were often built with Regular Expressions to retrieve the data from the page. Difficult to craft, impossible to maintain, this means of retrieval was far from ideal. The risks are many – even the slightest layout change on a web page can upset scraper code, and break the entire integration. It’s a fragile means for building integrations, but sometimes it’s the only way.

Having built a scraper service recently, the most interesting observation for me is how far we’ve come from these “dark days”. Node.js, and the massive ecosystem of community built modules has done much to change how these scraper services are built.

Effectively Scraping Information

Websites are built on the Document Object Model, or DOM. This is a tree structure, which represents the information on a page.By interpreting the source of a website as a DOM, we can retrieve information much more reliably than using methods like regular expression matching. The most popular method of querying the DOM is using jQuery, which enables us to build powerful and maintainable queries for information. The JSDom Node module allows us to use a DOM-like structure in serverside code.

For purpose of Illustration, we’re going to scrape the blog page of FeedHenry’s website. I’ve built a small code snippet that retrieves the contents of the blog, and translates it into a JSON API. To find the queries I need to run, first I need to look at the HTML of the page. To do this, in Chrome, I right-click the element I’m looking to inspect on the page, and click “Inspect Element”.

Screen Shot 2014-09-30 at 10.44.38

Articles on the FeedHenry blog are a series of ‘div’ elements with the ‘.itemContainer’ class

Searching for a pattern in the HTML to query all blog post elements, we construct the `div.itemContainer` query. In jQuery, we can iterate over these using the .each method:

var posts = [];

$('div.itemContainer').each(function(index, item){

  // Make JSON objects of every post in here, pushing to the posts[] array

});

From there, we pick off the heading, author and post summary using a child selector on the original post, querying the relevant semantic elements:

    Post Title, using jQuery:

    $(item).find('h3').text()trim() // trim, because titles have white space either side

    Post Author, using jQuery:

    $(item).find('.catItemAuthor a').text()

    Post Body, using jQuery:

    $(item).find('p').text()

Adding some JSDom magic to our snippet, and pulling together the above two concept (iterating through posts, and picking off info from each post), we get this snippet:

var request = require('request'),

jsdom = require('jsdom');

jsdom.env(

  "http://www.feedhenry.com/category/blog",

  ["http://code.jquery.com/jquery.js"],

  function (errors, window) {

    var $ = window.$, // Alias jQUery

    posts = [];

    $('div.itemContainer').each(function(index, item){

      item = $(item); // make queryable in JQ

      posts.push({

        heading : item.find('h3').text().trim(),

        author : item.find('.catItemAuthor a').text(),

        teaser : item.find('p').text()

      });

    });

    console.log(posts);

  }

);

A note on building CSS Queries

As with styling web sites with CSS, building effective CSS queries is equally as important when building a scraper. It’s important to build queries that are not too specific, or likely to break when the structure of the page changes. Equally important is to pick a query that is not too general, and likely to select extra data from the page you don’t want to retrieve.

A neat trick for generating the relevant selector statement is to use Chrome’s “CSS Path” feature in the inspector. After finding the element in the inspector panel, right click, and select “Copy CSS Path”. This method is good for individual items, but for picking repeating patterns (like blog posts), this doesn’t work though. Often, the path it gives is much too specific, making for a fragile binding. Any changes to the page’s structure will break the query.

Making a Re-usable Scraping Service

Now that we’ve retrieved information from a web page, and made some JSON, let’s build a reusable API from this. We’re going to make a FeedHenry Blog Scraper service in FeedHenry3. For those of you not familiar with service creation, see this video walkthrough.

We’re going to start by creating a “new mBaaS Service”, rather than selecting one of the off-the-shelf services. To do this, we modify the application.js file of our service to include one route, /blog, which includes our code snippet from earlier:

// just boilerplate scraper setup

var mbaasApi = require('fh-mbaas-api'),

express = require('express'),

mbaasExpress = mbaasApi.mbaasExpress(),

cors = require('cors'),

request = require('request'),

jsdom = require('jsdom');

var app = express();

app.use(cors());

app.use('/sys', mbaasExpress.sys([]));

app.use('/mbaas', mbaasExpress.mbaas);

app.use(mbaasExpress.fhmiddleware());

// Our /blog scraper route

app.get('/blog', function(req, res, next){

  jsdom.env(

    "http://www.feedhenry.com/category/blog",

    ["http://code.jquery.com/jquery.js"],

    function (errors, window) {

      var $ = window.$, // Alias jQUery

      posts = [];

      $('div.itemContainer').each(function(index, item){

        item = $(item); // make queryable in JQ

        posts.push({

          heading : item.find('h3').text().trim(),

          author : item.find('.catItemAuthor a').text(),

          teaser : item.find('p').text()

        });

      });

      return res.json(posts);

    }

  );

});

app.use(mbaasExpress.errorHandler());

var port = process.env.FH_PORT || process.env.VCAP_APP_PORT || 8001;

var server = app.listen(port, function() {});

We’re also going to write some documentation for our service, so we (and other developers) can interact with it using the FeedHenry discovery console. We’re going to modify the README.md file to document what we’ve just done using API Blueprint documentation format:

# FeedHenry Blog Web Scraper

This is a feedhenry blog scraper service. It uses the `JSDom` and `request` modules to retrieve the contents of the FeedHenry developer blog, and parse the content using jQuery.

# Group Scraper API Group

# blog [/blog]

Blog Endpoint

## blog [GET]

Get blog posts endpoint, returns JSON data.

+ Response 200 (application/json)

    + Body

            [{ blog post}, { blog post}, { blog post}]

We can now try out the scraper service in the studio, and see the response:

Scraping – The Ultimate in API Creation?

Now that I’ve described some modern techniques for effectively scraping data from web sites, it’s time for some major caveats. First,  WordPress blogs like ours already have feeds and APIs available to developers - there’s no need to ever scrape any of this content. Web Scraping is not a replacement for an API. It should be used only as a last resort, after every endeavour to discover an API has already been made. Using a web scraper in a commercial setting requires much time set aside to maintain the queries, and an agreement with the source data is being scraped on to alert developers in the event the page changes structure.

With all this in mind, it can be a useful tool to iterate quickly on an integration when waiting for an API, or as a fun hack project.

Source: http://www.feedhenry.com/web-scraping-fun-profit/

Saturday 15 November 2014

A Web Scraper’s Guide to Kimono

Being a frequent reader of Hacker News, I noticed an item on the front page earlier this year which read, “Kimono – Never write a web scraper again.” Although it got a great number of upvotes, the tech junta was quick to note issues, especially if you are a developer who knows how to write scrapers. The biggest concern was a non-intuitive UX, followed by the inability of the first beta version to extract data items from websites as smoothly as the demo video suggested.

I decided to give it a few months before I tested it out, and I finally got the chance to do so recently.

Kimono is a Y-Combinator backed startup trying to do something in a field where others have failed. Kimono is focused on creating APIs for websites which don’t have one, another term would be web scraping. Imagine you have a website which shows some data you would like to dynamically process in your website or application. If the website doesn’t have an API, you can create one using Kimono by extracting the data items from the website.

Is it Legal?

Kimono provides an FAQ section, which says that web scraping from public websites “is 100% legal” as long as you check the robots.txt file to see which URL patterns they have disallowed. However, I would advise you to proceed with caution because some websites can pose a problem.

A robots.txt is a file that gives directions to crawlers (usually of search engines) visiting the website. If a webmaster wants a page to be available on search engines like Google, he would not disallow robots in the robots.txt file. If they’d prefer no one scrapes their content, they’d specifically mention it in their Terms of Service. You should always look at the terms before creating an API through Kimono.

An example of this is Medium. Their robots.txt file doesn’t mention anything about their public posts, but the following quote from their TOS page shows you shouldn’t scrape them (since it involves extracting data from their HTML/CSS).

    For the remainder of the site, you may not duplicate, copy, or reuse any portion of the HTML/CSS, JavaScipt, logos, or visual design elements without express written permission from Medium unless otherwise permitted by law.

If you check the #BuiltWithKimono section of their website, you’d notice a few straightforward applications. For instance, there is a price comparison API, which is built by extracting the prices from product pages on different websites.

Let us move on and see how we can use this service.

What are we about to do?

Let’s try to accomplish a task, while exploring Kimono. The Blog Bowl is a blog directory where you can share and discover blogs. The posts that have been shared by users are available on the feeds page. Let us try to get a list of blog posts from the page.

The simple thought process when scraping the data is parsing the HTML (or searching through it, in simpler terms) and extracting the information we require. In this case, let’s try to get the title of the post, its link, and the blogger’s name and profile page.

Source: http://www.sitepoint.com/web-scrapers-guide-kimono/

Monday 10 November 2014

Data Scraping vs. Data Crawling

One of our favorite quotes has been- ‘If a problem changes by an order, it becomes a totally different problem’ and in this lies the answer to- what’s the difference between scraping and crawling?

Crawling usually refers to dealing with large data-sets where you develop your own crawlers (or bots) which crawl to the deepest of the web pages. Data scraping on the other hand refers to retrieving information from any source (not necessarily the web). It’s more often the case that irrespective of the approaches involved, we refer to extracting data from the web as scraping (or harvesting) and that’s a serious misconception.

=>Below are some differences in our opinion- both evident and subtle

1.    Scraping data does not necessarily involve the web. Data scraping could refer to extracting information from a local machine, a database, or even if it is from the internet, a mere “Save as” link on the page is also a subset of the data scraping universe. Crawling on the other hand differs immensely in scale as well as in range. Firstly, crawling = web crawling which means on the web, we can only “crawl” data. Programs that perform this incredible job are called crawl agents or bots or spiders (please leave the other spider in spiderman’s world). Some web spiders are algorithmically designed to reach the maximum depth of a page and crawl them iteratively (did we ever say scrape?).

2.    Web is an open world and the quintessential practising platform of our right to freedom. Thus a lot of content gets created and then duplicated. For instance, the same blog might be posted on different pages and our spiders don’t understand that. Hence, data de-duplication (affectionately dedup) is an integral part of data crawling. This is done to achieve two things- keep our clients happy by not flooding their machines with the same data more than once, and saving our own servers some space. However, dedup is not necessarily a part of data scraping.

3.    One of the most challenging things in the web crawling space is to deal with coordination of successive crawls. Our spiders have to be polite with the servers that they hit so that they don’t piss them off and this creates an interesting situation to handle. Over a period of time, our intelligent spiders have to get more intelligent (and not crazy!) and learn to know when and how much to hit a server in order to crawl data on its web pages while complying with its politeness policies.

4.    Finally, different crawl agents are used to crawl different websites and hence you need to ensure they don’t conflict with each other in the process. This situation never arises when you intend to just scrape data.

On a concluding note, scraping represents a very superficial node of crawling which we call extraction and that again requires few algorithms and some automation in place.

Source:https://www.promptcloud.com/blog/data-scraping-vs-data-crawling/

Wednesday 5 November 2014

Web Scraping Popularity Soars

The world is stirred because of the ever-growing web scraping success in almost all of its services. Success stories pertaining to the benefits of online data collection in business, research, politics, health, and almost all aspects of human life are endless. With this popularity surge, it has become a hot issue and many are questioning its legality and reliability.

Looking back, this simple harvesting of pertinent data from competitors and the global market in general like anything else started as a non-threatening and advanced form of web research. Eventually, when the benefits begin to manifest and the system improves, many are lured into it that it has become one of the strongest and fastest growing business in the world.

Simple Beginnings

As naturally as a law of life that great things come from small beginnings, data mining was conceived as a process in gaining information, mostly in research. This act of collecting information through the internet was never imagined to be what it has become nowadays.

Source:http://www.loginworks.com/blogs/web-scraping-blogs/web-scraping-popularity-soars/

Monday 8 September 2014

Scraping webdata from a website that loads data in a streaming fashion

I'm trying to scrape some data off of the FEC.gov website using python for a project of mine. Normally I use python

mechanize and beautifulsoup to do the scraping.

I've been able to figure out most of the issues but can't seem to get around a problem. It seems like the data is

streamed into the table and mechanize.Browser() just stops listening.

So here's the issue: If you visit http://query.nictusa.com/cgi-bin/can_ind/2011_P80003338/1/A ... you get the first 500

contributors whose last name starts with A and have given money to candidate P80003338 ... however, if you use

browser.open() at that url all you get is the first ~5 rows.

I'm guessing its because mechanize isn't letting the page fully load before the .read() is executed. I tried putting a

time.sleep(10) between the .open() and .read() but that didn't make much difference.

And I checked, there's no javascript or AJAX in the website (or at least none are visible when you use the 'view-

source'). SO I don't think its a javascript issue.

Any thoughts or suggestions? I could use selenium or something similar but that's something that I'm trying to avoid.

-Will

2 Answers

Why not use an html parser like lxml with xpath expressions.

I tried

>>> import lxml.html as lh
>>> data = lh.parse('http://query.nictusa.com/cgi-bin/can_ind/2011_P80003338/1/A')
>>> name = data.xpath('/html/body/table[2]/tr[5]/td[1]/a/text()')
>>> name
[' AABY, TRYGVE']
>>> name = data.xpath('//table[2]/*/td[1]/a/text()')
>>> len(name)
500
>>> name[499]
' AHMED, ASHFAQ'
>>>



Similarly, you can create xpath expression of your choice to work with.


Source: http://stackoverflow.com/questions/9435512/scraping-webdata-from-a-website-that-loads-data-in-a-streaming-

fashion

How can I circumvent page view limits when scraping web data using Python?

I am using Python to scrape US postal code population data from http:/www.city-data.com, through this directory: http://www.city-data.com/zipDir.html. The specific pages I am trying to scrape are individual postal code pages with URLs like this: http://www.city-data.com/zips/01001.html. All of the individual zip code pages I need to access have this same URL Format, so my script simply does the following for postal_code in range:

    Creates URL given postal code
    Tries to get response from URL
    If (2), Check the HTTP of that URL
    If HTTP is 200, retrieves the HTML and scrapes the data into a list
    If HTTP is not 200, pass and count error (not a valid postal code/URL)
    If no response from URL because of error, pass that postal code and count error
    At end of script, print counter variables and timestamp

The problem is that I run the script and it works fine for ~500 postal codes, then suddenly stops working and returns repeated timeout errors. My suspicion is that the site's server is limiting the page views coming from my IP address, preventing me from completing the amount of scraping that I need to do (all 100,000 potential postal codes).

My question is as follows: Is there a way to confuse the site's server, for example using a proxy of some kind, so that it will not limit my page views and I can scrape all of the data I need?

Thanks for the help! Here is the code:

##POSTAL CODE POPULATION SCRAPER##

import requests

import re

import datetime

def zip_population_scrape():

    """
    This script will scrape population data for postal codes in range
    from city-data.com.
    """
    postal_code_data = [['zip','population']] #list for storing scraped data

    #Counters for keeping track:
    total_scraped = 0
    total_invalid = 0
    errors = 0


    for postal_code in range(1001,5000):

        #This if statement is necessary because the postal code can't start
        #with 0 in order for the for statement to interate successfully
        if postal_code <10000:
            postal_code_string = str(0)+str(postal_code)
        else:
            postal_code_string = str(postal_code)

        #all postal code URLs have the same format on this site
        url = 'http://www.city-data.com/zips/' + postal_code_string + '.html'

        #try to get current URL
        try:
            response = requests.get(url, timeout = 5)
            http = response.status_code

            #print current for logging purposes
            print url +" - HTTP:  " + str(http)

            #if valid webpage:
            if http == 200:

                #save html as text
                html = response.text

                #extra print statement for status updates
                print "HTML ready"

                #try to find two substrings in HTML text
                #add the substring in between them to list w/ postal code
                try:           

                    found = re.search('population in 2011:</b> (.*)<br>', html).group(1)

                    #add to # scraped counter
                    total_scraped +=1

                    postal_code_data.append([postal_code_string,found])

                    #print statement for logging
                    print postal_code_string + ": " + str(found) + ". Data scrape successful. " + str(total_scraped) + " total zips scraped."
                #if substrings not found, try searching for others
                #and doing the same as above   
                except AttributeError:
                    found = re.search('population in 2010:</b> (.*)<br>', html).group(1)

                    total_scraped +=1

                    postal_code_data.append([postal_code_string,found])
                    print postal_code_string + ": " + str(found) + ". Data scrape successful. " + str(total_scraped) + " total zips scraped."

            #if http =404, zip is not valid. Add to counter and print log        
            elif http == 404:
                total_invalid +=1

                print postal_code_string + ": Not a valid zip code. " + str(total_invalid) + " total invalid zips."

            #other http codes: add to error counter and print log
            else:
                errors +=1

                print postal_code_string + ": HTTP Code Error. " + str(errors) + " total errors."

        #if get url fails by connnection error, add to error count & pass
        except requests.exceptions.ConnectionError:
            errors +=1
            print postal_code_string + ": Connection Error. " + str(errors) + " total errors."
            pass

        #if get url fails by timeout error, add to error count & pass
        except requests.exceptions.Timeout:
            errors +=1
            print postal_code_string + ": Timeout Error. " + str(errors) + " total errors."
            pass


    #print final log/counter data, along with timestamp finished
    now= datetime.datetime.now()
    print now.strftime("%Y-%m-%d %H:%M")
    print str(total_scraped) + " total zips scraped."
    print str(total_invalid) + " total unavailable zips."
    print str(errors) + " total errors."



Source: http://stackoverflow.com/questions/25452798/how-can-i-circumvent-page-view-limits-when-scraping-web-data-using-python

Sunday 7 September 2014

Web data scraping (online news comments) with Scrapy (Python)

Since you seem like the try-first ask-question later type (that's a very good thing), I won't give you an answer, but a

(very detailed) guide on how to find the answer.

The thing is, unless you are a yahoo developer, you probably don't have access to the source code you're trying to

scrape. That is to say, you don't know exactly how the site is built and how your requests to it as a user are being

processed on the server-side. You can, however, investigate the client-side and try to emulate it. I like using Chrome

Developer Tools for this, but you can use others such as FF firebug.

So first off we need to figure out what's going on. So the way it works, is you click on the 'show comments' it loads

the first ten, then you need to keep clicking for the next ten comments each time. Notice, however, that all this

clicking isn't taking you to a different link, but lively fetches the comments, which is a very neat UI but for our

case requires a bit more work. I can tell two things right away:

    They're using javascript to load the comments (because I'm staying on the same page).
    They load them dynamically with AJAX calls each time you click (meaning instead of loading the comments with the

page and just showing them to you, with each click it does another request to the database).

Now let's right-click and inspect element on that button. It's actually just a simple span with text:

<span>View Comments (2077)</span>

By looking at that we still don't know how that's generated or what it does when clicked. Fine. Now, keeping the

devtools window open, let's click on it. This opened up the first ten. But in fact, a request was being made for us to

fetch them. A request that chrome devtools recorded. We look in the network tab of the devtools and see a lot of

confusing data. Wait, here's one that makes sense:

http://news.yahoo.com/_xhr/contentcomments/get_comments/?content_id=42f7f6e0-7bae-33d3-aa1d-

3dfc7fb5cdfc&_device=full&count=10&sortBy=highestRated&isNext=true&offset=20&pageNumber=2&_media.modules.content_commen

ts.switches._enable_view_others=1&_media.modules.content_comments.switches._enable_mutecommenter=1&enable_collapsed_com

ment=1

See? _xhr and then get_comments. That makes a lot of sense. Going to that link in the browser gave me a JSON object

(looks like a python dictionary) containing all the ten comments which that request fetched. Now that's the request you

need to emulate, because that's the one that gives you what you want. First let's translate this to some normal reqest

that a human can read:

go to this url: http://news.yahoo.com/_xhr/contentcomments/get_comments/
include these parameters: {'_device': 'full',
          '_media.modules.content_comments.switches._enable_mutecommenter': '1',
          '_media.modules.content_comments.switches._enable_view_others': '1',
          'content_id': '42f7f6e0-7bae-33d3-aa1d-3dfc7fb5cdfc',
          'count': '10',
          'enable_collapsed_comment': '1',
          'isNext': 'true',
          'offset': '20',
          'pageNumber': '2',
          'sortBy': 'highestRated'}

Now it's just a matter of trial-and-error. However, a few things to note here:

    Obviously the count is what decides how many comments you're getting. I tried changing it to 100 to see what

happens and got a bad request. And it was nice enough to tell me why - "Offset should be multiple of total rows". So

now we understand how to use offset

    The content_id is probably something that identifies the article you are reading. Meaning you need to fetch that

from the original page somehow. Try digging around a little, you'll find it.

    Also, you obviously don't want to fetch 10 comments at a time, so it's probably a good idea to find a way to fetch

the number of total comments somehow (either find out how the page gets it, or just fetch it from within the article

itself)

    Using the devtools you have access to all client-side scripts. So by digging you can find that that link to

/get_comments/ is kept within a javascript object named YUI. You can then try to understand how it is making the

request, and try to emulate that (though you can probably figure it out yourself)

    You might need to overcome some security measures. For example, you might need a session-key from the original

article before you can access the comments. This is used to prevent direct access to some parts of the sites. I won't

trouble you with the details, because it doesn't seem like a problem in this case, but you do need to be aware of it in

case it shows up.

    Finally, you'll have to parse the JSON object (python has excellent built-in tools for that) and then parse the

html comments you are getting (for which you might want to check out BeautifulSoup).

As you can see, this will require some work, but despite all I've written, it's not an extremely complicated task

either.

So don't panic.

It's just a matter of digging and digging until you find gold (also, having some basic WEB knowledge doesn't hurt).

Then, if you face a roadblock and really can't go any further, come back here to SO, and ask again. Someone will help

you.


Source: http://stackoverflow.com/questions/20218855/web-data-scraping-online-news-comments-with-scrapy-python

Saturday 6 September 2014

A good web data extraction/screen scraper program?


I need to capture product data from a site on a regular basis and wondered if any one knows of a good software program? I've trialed Mozenda but its a monthly subscription and pricey in the long term. Obviously something thats free would be best but I don't mind paying either. Just need a decent program thats reliable and doesn't require much programming knowledge.

You can try ScraperWiki.com if you know python.

I've experimented with Screen-Scraper and found it easy to use. The application comes in multiple versions: basic (which is free), professional, and enterprise. Also, multiple platforms are supported.

Hire a programmer to do it so that there is only a one off cost. I often see similar projects on freelancing websites like Elance and oDesk.

I really like iMacros. You can give it a test drive to see if it meets your needs with the totally free Firefox extension (there's also IE versions), but there are also more full featured application and "server" versions that have more features and ability to do thing in an unattended manner.

Here are some other alternatives to consider:

    License the data from the provider. Call em up and ask 'em.

    Use Amazon Mechanical Turk to get humans to copy and paste and format it for ya. They are cheap.

    For automation, it depends on how complicated the HTML is and how often it changes. You could use Excel's Web Data Import if it's really simple.


You can use irobot from IRobotSoft, which is totally free, and provides more functionalityies than other paid software. Watch demos here http://irobotsoft.com/help/ for how simple it is.

Questions on their forum were answered very quickly.


Source: http://stackoverflow.com/questions/2334164/a-good-web-data-extraction-screen-scraper-program

Friday 5 September 2014

How to login to website and extract data using PHP [closed]


I have installed the tiny tiny rss on to my computer (Windows) and also have Xampp installed (localhost).

I want to be able to use PHP to extract data from the Tiny tiny RSS webpage.

I have tried this it which just opens the front page:

<?php
$homepage = file_get_contents('my install tiny tiny rss url');
echo $homepage;
?>

But how do I login and extract the data.

You can use cURL to send post data and headers. To login you need to replicate the exact data exchange between the client and the server.


SOurce: http://stackoverflow.com/questions/20611918/how-to-login-to-website-and-extract-data-using-php

Is it ok to scrape data from Google results?

I'd like to fetch results from Google using curl to detect potential duplicate content. Is there a high risk of being banned by Google?

Google will eventually block your IP when you exceed a certain amount of requests.



Google disallows automated access in their TOS, so if you accept their terms you would break them.

That said, I know of no lawsuit from Google against a scraper. Even Microsoft scraped Google, they powered their search engine Bing with it. They got caught in 2011 red handed :)

There are two options to scrape Google results:

1) Use their API

    You can issue around 40 requests per hour You are limited to what they give you, it's not really useful if you want to track ranking positions or what a real user would see. That's something you are not allowed to gather.

    If you want a higher amount of API requests you need to pay.
    60 requests per hour cost 2000 USD per year, more queries require a custom deal.

2) Scrape the normal result pages

    Here comes the tricky part. It is possible to scrape the normal result pages. Google does not allow it.
    If you scrape at a rate higher than 15 keyword requests per hour you risk detection, higher than 20/h will get you blocked from my experience.
    By using multiple IPs you can up the rate, so with 100 IP addresses you can scrape up to 2000 requests per hour. (50k a day)
    There is an open source search engine scraper written in PHP at http://scraping.compunect.com It allows to reliable scrape Google, parses the results properly and manages IP addresses, delays, etc. So if you can use PHP it's a nice kickstart, otherwise the code will still be useful to learn how it is done.


Source: http://stackoverflow.com/questions/22657548/is-it-ok-to-scrape-data-from-google-results

Thursday 4 September 2014

Data Scraping from PDF and Excel

I am doing a little data scraping, There are 3 types of file from which i am scraping data.

1- HTML
2- PDF
3- Excel(xls)

For HTML i am comfortable, i am using HTML Agility for that.

For PDF and excel i need suggestions from anyone.



Concerning Excel. If you are in a MS environment you can either do Office Automation or use OLEDB. In a Java

environment look at Apache POI.

EDIT: Concerning PDF in Java try Apache PDFBox . Can also work in .NET using IKVM

I can recommend Cogniview's PDF2XL, a reasonably inexpensive commercial product, to extract data from tables in PDF

files into Excel. We have used it with great success.

HTML Agility is a library. Its good to use. But then, why do you need separate tools for different data extraction

purposes? Use Automation Anywhere to extract data from any source. As far as I know, it would work for all the three

sources you have specified. Google it.

Source: http://stackoverflow.com/questions/3147803/data-scraping-from-pdf-and-excel

Wednesday 3 September 2014

Excel VBA Data Mining Real-Time Data from a Web Page that Refreshes Data

I want to capture real-time data that updates into a table on a webpage; I prefer capturing it into excel using VBA, but I will write it in .NET C# or VB if I that is easier.

the data updates about 1 or 2 seconds, and I want to just grab the latest data quotes and log it into my spreadsheet; the table names are the same, only the data refreshes, and it does so automatically on the web page.

I've done a lot of Excel VBA and I know how to download a URL to a file--this is NOT what I want; I want to gain access to my webpage that is active and grab the data updates after I've logged into my site and selected a webpage that I like.

Is there a simple way to access this data on the webpage from Excel or .Net? Because it refreshes no more than once every 1 or 2 seconds, it is easy to just keep checking it for updates, and I can compare the latest data to see if it actually refreshed.


In Excel 2003, use Data/Import External Data/New Web Query
Browse to your page and select the table you want to import.
After that you can either do a manual Refresh, or use a timer procedure to do something like:

Source: http://stackoverflow.com/questions/9855794/excel-vba-data-mining-real-time-data-from-a-web-page-that-refreshes-data

Tuesday 2 September 2014

Need to pull data from a website…web query? macro?


I have a list of every DOT # (Dept. of Trans.) in the country. I want to find out insurance effective date for each one of these companies. If you go to http://li-public.fmcsa.dot.gov --> "continue" --> then from the dropdown select "carrier search" and hit "go" it'll take you to a search form (that is the only way to get to this screen).

From there, you can input a DOT # X (use 61222 as an example) and it'll bring you to another screen. Click "view report in HTML" and then down on the bottom you'll see "Active/Pending Insurance". I want to pull the "effective date" from that page and stick it in the spreadsheet next to the DOT # X that I already know.

Of the thousands of DOT #'s in my list, not all will have filings on this website, if that makes a difference.

Can this be done with a Macro or Excel Web Query? I know I probably sound like a total novice, but I'd appreciate any help I could get.

Can you do it? Frankly even if you could you'd lock up the spreadsheet while it's doing that processing. And in the end, how would you handle an error half-way through?

I'd not do this in a client-facing application. This sounds more like something to do in server-side app that can do the processing and gather the information in a more controlled environment. Then you Excel spreadsheet could query that app and get the information in one fell swoop. Error handling is much simpler and you don't end up sitting there staring at Excel why it works its way through thousands of web sites. It was not built to do that elegantly.

What do you write the web service I'm describing in? Well it depends on your preference. Me, I'd write it in Ruby on Rails since it can easily handle the scraping aspect of the task and can report the data out easily as well. But it really falls back to whatever you're most comfortable coding in.


Source: http://stackoverflow.com/questions/15286429/need-to-pull-data-from-a-website-web-query-macro

How to extract data from web 2.0 graphs using a scraper


I have recently come across a web page containing a graph object that displays the (x, y) values on the object as the

mouse is rolled across it. Is there any way to automate the extraction of this data?

How is the graph data loaded? If embedded in the page source then you can extract it with xpath or regex. Else use

Firebug to see how it is loaded.



You will need a solution that works inside the web browser, so the AJAX/Javascript is properly rendered.

I have used iMacros with good success for web scraping in the past. There are free/open-source and "PRO" paid editions

(comparison table here).

Another option is always to custom code something with the Microsoft webbrowser control.


Source: http://stackoverflow.com/questions/3980774/how-to-extract-data-from-web-2-0-graphs-using-a-scraper

Legality of Web Scraping vs Normal Use


I know the topic of web scraping has been discussed before (example), and I understand it's a bit of a grey area

depending on a lot of factors (e.g. website's terms of use).

What I'd like to ask is: how is web scraping any different from (a) how we access the webpage via a web browser, and

(b) how web crawlers (e.g. Google) download and index webpages?

Without knowing the legal background, I can't help but think that they're all just HTTP requests. If web scraping is

illegal, then so should crawling and indexing (for instance be illegal).

Of course if your program is hitting the server so hard that it causes a denial of service, it's a different story

altogether... my point is simply accessing and using data that is already open to the public.



I know this is a dead thread, but it would be nice to place some legal implications here due to its ranking in my

Google Search. I cannot help but figure I am not the only one who searches like I do.

Legally, in the US, there are a few factors that seem to be important.

    Are you doing anything that is akin to hacking or gaining unauthorized access via the Computer Fraud and Abuse Act.

Exploiting vulnerabilities and passing SQL in the URL to open a database no matter how bad the idiot programming like

that was is illegal with a 15 year sentence (see the cases where an individual exploited security vulnerabilities in

Verizon). Also, add a time out even if you round robin or use proxies. DDoS attacks are attacks. 1000 requests per

second can shut down a lot of servers providing public information. The result here is up to 15 years in jail.

    Copyright Law: As mentioned, pure replication of data is illegal. Even 4% replication has been deemed a breach.

With the recent gutting of the DMCA, a person is even more vulnerable to civil and criminal penalties.

    Trespass and Chattels: The following from wikipedia says it all.

    U.S. courts have acknowledged that users of "scrapers" or "robots" may be held liable for committing trespass to

chattels,[5][6] which involves a computer system itself being considered personal property upon which the user of a

scraper is trespassing. The best known of these cases, eBay v. Bidder's Edge, resulted in an injunction ordering

Bidder's Edge to stop accessing, collecting, and indexing auctions from the eBay web site.

    Paywalls and Product: When going behind paywalls and breaching contract by clicking an agreement not to do

something and then doing it, you add fuel to the protection of negligence v. willingness [an issue for damages and

penalties not guilt] in civil and any criminal trials. (sorry originally wanted to say ignorance but it really isn't a

defense)

    International: EU law and other law is way more lax. Corporations with big budgets dominate our legal landscape.

They control the system in a very real way with their $$$.

Basically, get public information and information that is available without going behind a pay wall. Think like a user

of the internet and combine a bunch of sources into a unique product. Don't just 'steal' an entire site (it isn't

really stealing if it is a government site that offers public data especially for download but is if you download all

or even more than a couple of the listings on ebay). Read the terms and conditions to know who actually owns the

content.

Here are a few examples. Trulia owns its information but you could use it to go to an agents website and collect a

legal amount of information. The legal amount is determinable. However, a public MLS listing lookup site with no

agreement or terms and offering data to the public is fair game. The MLS numbers lists, however, are normally not fair

game.

If a researcher can get to data, so can you. If a researcher needs permission, so do you. A computer is like having a

million corporate researchers at your disposal.

AS for company policy, it is usually used internally to shield from liability and serves as a warning but is not

entirely enforceable. The legal parts letting you know about copyrights and such are and usually are supposed to be

known by everyone. Complete ignorance is not a legal protection. It does provide a ground set of rules. Be nice, or get

banned is that message as far as I know.

My personal strategy is to start with public data and embellish it within legal means.


Source: http://stackoverflow.com/questions/14735791/legality-of-web-scraping-vs-normal-use

Anyone knows an online tool that can scrape a page and create a REST API for the scraped data?


I'm looking for a SaaS solution that is able to login to a platform, scrape data (reports) and then allow accessing the

data through an API. I have some reporting platforms that provide web reporting and email reporting but with no API.

Online reporting doesn't help and email reporting, although can be automated and scraped, isn't so reliable.

If you are willing to do the scraping through your own connection, have a look at Import IO. They have a desktop

application that you use to teach the system how to scrape a page, and then you run the crawler from that application -

and you can run it for as long as you like, as far as I can tell.

You may then upload your data to the Import cloud, from where it is available via an API on the import.io servers.

Useful data can be made public to donate it "to the commons" if you wish.


I did some more digging, found iMacros as a possible solution. Its Windows based, which is a drawback in my case, but

it does allow automation of the scraping and afterwards interaction via common web scripting languages like PHP and

ASP.net.


If you are familiar with jQuery, I think you can use node.js and Cheerio module, then you can create a simple

application to do auto scraping. Actually I have already built a site to do on line web scraping based on the above

mentioned tech, the site is www.datafiddle.net, you can take a look at it.


Source: http://stackoverflow.com/questions/19646028/anyone-knows-an-online-tool-that-can-scrape-a-page-and-create-a-

rest-api-for-the

Wednesday 27 August 2014

Extract data from Web Scraping C#


I am MVC ASP.NET developer.

I have received the contents from any url, i.e. http, https etc. using WebRequest class.

I have received all the content of that particular url. (for now I took http://google.com)

My next step is to extract buttons, header, footer, colors, text etc.

Here is my code for now:

public ActionResult GetContent(UrlModel model) //model having a string URL
which is entered in a text box and method hits using submit button.
{
    //WebRequest request = WebRequest.Create(model.URL);

    WebRequest request = WebRequest.Create(model.URL);

    request.Credentials = CredentialCache.DefaultCredentials;

    WebResponse response = request.GetResponse();

    Stream dataStream = response.GetResponseStream();

    StreamReader reader = new StreamReader(dataStream);

    string responseFromServer = reader.ReadToEnd();
    ViewBag.Response = responseFromServer;

    reader.Close();
    response.Close();
    return View();
}

Can someone help me with writing the code ?

Also do suggest me with some techniques of data extraction in C#.



Source: http://stackoverflow.com/questions/21901162/extract-data-from-web-scraping-c-sharp

Scrapy, scraping price data from StubHub


I've been having a difficult time with this one.

I want to scrape all the prices listed for this Bruno Mars concert at the Hollywood Bowl so I can get the average price.

http://www.stubhub.com/bruno-mars-tickets/bruno-mars-hollywood-hollywood-bowl-31-5-2014-4449604/

I've located the prices in the HTML and the xpath is pretty straightforward but I cannot get any values to return.

I think it has something to do with the content being generated via javascript or ajax but I can't figure out how to send the correct request to get the code to work.

Here's what I have:

from scrapy.spider import BaseSpider
from scrapy.selector import Selector

from deeptix.items import DeeptixItem

class TicketSpider(BaseSpider):
    name = "deeptix"
    allowed_domains = ["stubhub.com"]
    start_urls = ["http://www.stubhub.com/bruno-mars-tickets/bruno-mars-hollywood-hollywood-bowl-31-5-2014-4449604/"]

def parse(self, response):
    sel = Selector(response)
    sites = sel.xpath('//div[contains(@class, "q_cont")]')
    items = []
    for site in sites:
        item = DeeptixItem()
        item['price'] = site.xpath('span[contains(@class, "q")]/text()').extract()
        items.append(item)
    return items

Any help would be greatly appreciated I've been struggling with this one for quite some time now. Thank you in advance!


Source: http://stackoverflow.com/questions/22770917/scrapy-scraping-price-data-from-stubhub

Tuesday 26 August 2014

How do you scrape AJAX pages?

Overview:

All screen scraping first requires manual review of the page you want to extract resources from. When dealing with AJAX you usually just need to analyze a bit more than just simply the HTML.

When dealing with AJAX this just means that the value you want is not in the initial HTML document that you requested, but that javascript will be exectued which asks the server for the extra information you want.

You can therefore usually simply analyze the javascript and see which request the javascript makes and just call this URL instead from the start.

Example:

Take this as an example, assume the page you want to scrape from has the following script:

<script type="text/javascript">
function ajaxFunction()
{
var xmlHttp;
try
  {
  // Firefox, Opera 8.0+, Safari
  xmlHttp=new XMLHttpRequest();
  }
catch (e)
  {
  // Internet Explorer
  try
    {
    xmlHttp=new ActiveXObject("Msxml2.XMLHTTP");
    }
  catch (e)
    {
    try
      {
      xmlHttp=new ActiveXObject("Microsoft.XMLHTTP");
      }
    catch (e)
      {
      alert("Your browser does not support AJAX!");
      return false;
      }
    }
  }
  xmlHttp.onreadystatechange=function()
    {
    if(xmlHttp.readyState==4)
      {
      document.myForm.time.value=xmlHttp.responseText;
      }
    }
  xmlHttp.open("GET","time.asp",true);
  xmlHttp.send(null);
  }
</script>

Then all you need to do is instead do an HTTP request to time.asp of the same server instead. Example from w3schools.


Sporce: http://stackoverflow.com/questions/260540/how-do-you-scrape-ajax-pages

using Perl to scrape a website


I am interested in writing a perl script that goes to the following link and extracts the number 1975: https://familysearch.org/search/collection/results#count=20&query=%2Bevent_place_level_1%3ACalifornia%20%2Bevent_place_level_2%3A%22San%20Diego%22%20%2Bbirth_year%3A1923-1923~%20%2Bgender%3AM%20%2Brace%3AWhite&collection_id=2000219

That website is the amount of white men born in the year 1923 who live in San Diego County, California in 1940. I am trying to do this in a loop structure to generalize over multiple counties and birth years.

In the file, locations.txt, I put the list of counties, such as San Diego County.

The current code runs, but instead of the # 1975, it displays unknown. The number 1975 should be in $val\n.

I would very much appreciate any help!

#!/usr/bin/perl

use strict;

use LWP::Simple;

open(L, "locations26.txt");

my $url = 'https://familysearch.org/search/collection/results#count=20&query=%2Bevent_place_level_1%3A%22California%22%20%2Bevent_place_level_2%3A%22%LOCATION%%22%20%2Bbirth_year%3A%YEAR%-%YEAR%~%20%2Bgender%3AM%20%2Brace%3AWhite&collection_id=2000219';

open(O, ">out26.txt");
 my $oldh = select(O);
 $| = 1;
 select($oldh);
 while (my $location = <L>) {
     chomp($location);
     $location =~ s/ /+/g;
      foreach my $year (1923..1923) {
                 my $u = $url;
                 $u =~ s/%LOCATION%/$location/;
                 $u =~ s/%YEAR%/$year/;
                 #print "$u\n";
                 my $content = get($u);
                 my $val = 'unknown';
                 if ($content =~ / of .strong.([0-9,]+)..strong. /) {
                         $val = $1;
                 }
                 $val =~ s/,//g;
                 $location =~ s/\+/ /g;
                 print "'$location',$year,$val\n";
                 print O "'$location',$year,$val\n";
         }
     }

Update: API is not a viable solution. I have been in contact with the site developer. The API does not apply to that part of the webpage. Hence, any solution pertaining to JSON will not be applicbale.



Source: http://stackoverflow.com/questions/14654288/using-perl-to-scrape-a-website

Monday 25 August 2014

Data Scraping using php


Here is my code

    $ip=$_SERVER['REMOTE_ADDR'];

    $url=file_get_contents("http://whatismyipaddress.com/ip/$ip");

    preg_match_all('/<th>(.*?)<\/th><td>(.*?)<\/td>/s',$url,$output,PREG_SET_ORDER);

    $isp=$output[1][2];

    $city=$output[9][2];

    $state=$output[8][2];

    $zipcode=$output[12][2];

    $country=$output[7][2];

    ?>
    <body>
    <table align="center">
    <tr><td>ISP :</td><td><?php echo $isp;?></td></tr>
    <tr><td>City :</td><td><?php echo $city;?></td></tr>
    <tr><td>State :</td><td><?php echo $state;?></td></tr>
    <tr><td>Zipcode :</td><td><?php echo $zipcode;?></td></tr>
    <tr><td>Country :</td><td><?php echo $country;?></td></tr>
    </table>
    </body>

How do I find out the ISP provider of a person viewing a PHP page?

Is it possible to use PHP to track or reveal it?

Error: http://i.imgur.com/LGWI8.png

Curl Scrapping

<?php
$curl_handle=curl_init();
curl_setopt( $curl_handle, CURLOPT_FOLLOWLOCATION, true );
$url='http://www.whatismyipaddress.com/ip/132.123.23.23';
curl_setopt($curl_handle, CURLOPT_URL,$url);
curl_setopt($curl_handle, CURLOPT_HTTPHEADER, Array("User-Agent: Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.8.1.15) Gecko/20080623 Firefox/2.0.0.15") );
curl_setopt($curl_handle, CURLOPT_CONNECTTIMEOUT, 2);
curl_setopt($curl_handle, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($curl_handle, CURLOPT_USERAGENT, 'Your application name');
$query = curl_exec($curl_handle);

curl_close($curl_handle);
preg_match_all('/<th>(.*?)<\/th><td>(.*?)<\/td>/s',$url,$output,PREG_SET_ORDER);
echo $query;
$isp=$output[1][2];

$city=$output[9][2];

$state=$output[8][2];

$zipcode=$output[12][2];

$country=$output[7][2];
?>
<body>
<table align="center">
<tr><td>ISP :</td><td><?php echo $isp;?></td></tr>
<tr><td>City :</td><td><?php echo $city;?></td></tr>
<tr><td>State :</td><td><?php echo $state;?></td></tr>
<tr><td>Zipcode :</td><td><?php echo $zipcode;?></td></tr>
<tr><td>Country :</td><td><?php echo $country;?></td></tr>
</table>
</body>

Error: http://i.imgur.com/FJIq6.png

What's is wrong with my code here? Any alternative code , that i can use here.

I am not able to scrape that data as described here. http://i.imgur.com/FJIq6.png

P.S. Please post full code. It would be easier for me to understand.



Source: http://stackoverflow.com/questions/10461088/data-scraping-using-php

PDF scraping using R

I have been using the XML package successfully for extracting HTML tables but want to extend to PDF's. From previous questions it does not appear that there is a simple R solution but wondered if there had been any recent developments

Failing that, is there some way in Python (in which I am a complete Novice) to obtain and manipulate pdfs so that I could finish the job off with the R XML package

Extracting text from PDFs is hard, and nearly always requires lots of care.

I'd start with the command line tools such as pdftotext and see what they spit out. The problem is that PDFs can store the text in any order, can use awkward font encodings, and can do things like use ligature characters (the joined up 'ff' and 'ij' that you see in proper typesetting) to throw you.

pdftotext is installable on any Linux system



Source: http://stackoverflow.com/questions/7918718/pdf-scraping-using-r

Sunday 24 August 2014

Php Scraping data from a website

I am very new to programming and need a little help with getting data from a website and passing it into my PHP script.

The website is http://www.birthdatabase.com/.

I would like to plug in a name (First and Last) and retrieve the result. I know you can query the site by passing the name in the URL, but I am having problems scraping the results.

http://www.birthdatabase.com/cgi-bin/query.pl?textfield=FIRST&textfield2=LAST&age=&affid=

I am using the file_get_contents($URL) function to get the page but need help after that. Specifically, I would like to scrape only the results from a certain state if there are multiple results for that name.



You need the awesome simple_html_dom class.

With this class you can query the webpage's DOM in a similar way to jQuery.

First include the class in your page, then get the page content with this snippet:

$html = file_get_html('http://www.birthdatabase.com/cgi-bin/query.pl?textfield=' . $first . '&textfield2=' . $last . '&age=&affid=');

Then you can use CSS selections to scrape your data (something like this):

$n = 0;
foreach($html->find('table tbody tr td div font b table tbody') as $element) {
    @$row[$n]['tr']  = $element->find('tr')->text;
    $n++;
}

// output your data
print_r($row);



Source: http://stackoverflow.com/questions/15601584/php-scraping-data-from-a-website

Obtaining reddit data

I am interested in obtaining data from different reddit subreddits. Does anyone know

if there is a reddit/other api similar like twitter does to crawl all the pages?


Yes, reddit has an API that can be used for a variety of purposes such as data

collection, automatic commenting bots, or even to assist in subreddit moderation.

There are a few places to discover information on reddit's API:

    github reddit wiki -- provides the overview and rules for using reddit's API

(follow the rules)
    automatically generated API docs -- provides information on the requests needed to

access most of the API endpoints
    /r/redditdev -- the reddit community dedicated to answering questions both about

reddit's source code and about reddit's API

If there is a particular programming language you are already familiar with, you

should check out the existing set of API wrappers for various languages. Despite my

bias (I am the package maintainer) I am quite certain PRAW, for python, has support

for the largest number of reddit API features.



Source: http://stackoverflow.com/questions/14322834/obtaining-reddit-data