Saturday, 23 August 2014

Scraping data in dynamic sites

I'm trying to scrape data from our local government. What I want is address from kids adoption offices. Here, in Brazil, all adoptions go through the government. So I have the URL of one office, there are 2 or 3 thousands more. But if I can manage to get one, the others will be easy. I made many attempts, bellow I show three.

The problem could be related to a Javascript (Ajax maybe) that refresh the page.

Note: I am not a PHP developer.

First attempt

echo '<html><head></head><body>';
echo '<h1>Scraper PHP GET 1</h1>';

echo ini_get("allow_url_fopen");
echo ini_get("allow_url_fopen");

// I used this url for test
//$url = 'http://www.portaldaadocao.com.br';

//This is the URL that I really want
$url = 'http://www.cnj.jus.br/cna/Controle/ConsultaPublicaBuscaControle.php?transacao=CONSULTA&vara=2673';

$html = file_get_contents($url);
var_dump($html);

echo '</body></html>';

// Output
// 11
// Warning:
file_get_contents(http://www.cnj.jus.br/cna/Controle/ConsultaPublicaBuscaControle.php?
transacao=CONSULTA&vara=2673) [function.file-get-contents]: failed to open stream: HTTP
request failed! HTTP/1.1 404 Not Found in /home/rsl/www/sc01_get.php on line 14
// bool(false)

Second attempt

echo '<html><head></head><body>';
echo '<h1>Scraper PHP CURL 3</h1>';

// I used this url for test
//$url = 'http://www.portaldaadocao.com.br';

//This is the URL that I really want
$url = 'http://www.cnj.jus.br/cna/Controle/ConsultaPublicaBuscaControle.php?transacao=CONSULTA&vara=2673';

$curl = curl_init($url);
@curl_setopt($curl, CURLOPT_POSTFIELDS, "foo");
@curl_setopt($curl, CURLOPT_FOLLOWLOCATION, true);
@curl_setopt($curl, CURLOPT_CUSTOMREQUEST, "POST");;

$html=@curl_exec($curl);

if (!$html) {
    echo "<br />cURL error number:" .curl_errno($curl);
    echo "<br />cURL error:" . curl_error($curl);
    exit;
}
else{
   echo '<br>begin HTML[';
    echo  $html;
   echo '<br>]end html ';
}
echo '</body></html>';

// Output
// 1

third attempt

function curl($url){
    $ch = curl_init();
    curl_setopt($ch, CURLOPT_URL, $url);
    curl_setopt($ch, CURLOPT_RETURNTRANSFER,1);
    curl_setopt($ch, CURLOPT_USERAGENT, 'Mozilla/5.0 (Windows NT 5.1) AppleWebKit/535.6 (KHTML, like Gecko) Chrome/16.0.897.0 Safari/535.6');
    curl_setopt($ch, CURLOPT_HEADER, true);
    curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
    curl_setopt($ch, CURLOPT_COOKIEFILE, "cookie.txt");
    curl_setopt($ch, CURLOPT_COOKIEJAR, "cookie.txt");
    curl_setopt($ch, CURLOPT_CONNECTTIMEOUT, 30);
    curl_setopt($ch, CURLOPT_REFERER, "http://www.windowsphone.com");

    $data = curl_exec($ch);
    curl_close($ch);
    return $data;
}

echo '<html><head></head><body>';
echo '<h1>Scraper PHP CURL 5</h1>';

// I used this url for test
//$url = 'http://www.portaldaadocao.com.br';

//This is the URL that I really want
$url = 'http://www.cnj.jus.br/cna/Controle/ConsultaPublicaBuscaControle.php?transacao=CONSULTA&vara=2673';

$curl = curl_init($url);
@curl_setopt($curl, CURLOPT_POSTFIELDS, "foo");
@curl_setopt($curl, CURLOPT_FOLLOWLOCATION, true);
@curl_setopt($curl, CURLOPT_CUSTOMREQUEST, "POST");;

$html=@curl($curl);


if (!$html) {
    echo "<br />cURL error number:" .curl_errno($curl);
    echo "<br />cURL error:" . curl_error($curl);
    exit;
}
else{
    echo '<br>begin HTML[';
    echo  $html;
    echo '<br>]end html ';
}
echo '</body></html>';

// Output
// cURL error number:0
// cURL error:

If the pages are really ajax based meaning the information that you need to scrape is loaded or shown through javascript execution, you will need another approach. You would need to automate with a real browser. You can go the Selenium route which can be written in a number of languages or use CasperJS with Javascript as the programming language.



Source: http://stackoverflow.com/questions/24611046/scraping-data-in-dynamic-sites

Friday, 22 August 2014

Scraping dynamic data


I am scraping profiles on ask.fm for a research question. The problem is that only the top most recent questions are viewable and I have to click "view more" to see the next 15.

The source code for clicking view more looks like this:

<input class="submit-button-more submit-button-more-active" name="commit" onclick="return Forms.More.allowSubmit(this)" type="submit" value="View more" />

What is an easy way of calling this 4 times before scraping it. I want the most recent 60 posts on the site. Python is preferable.

You could probably use selenium to browse to the website and click on the button/link a few times. You can get that here:

    https://pypi.python.org/pypi/selenium

Or you might be able to do it with mechanize:

    http://wwwsearch.sourceforge.net/mechanize/

I have also heard good things about twill, but never used it myself:

    http://twill.idyll.org/



Source: http://stackoverflow.com/questions/19437782/scraping-dynamic-data

Wednesday, 20 August 2014

Web Scraping data from different sites


I am looking for a few ideas on how can I solve a design problem I'm going to be faced with building a web scraper to scrape multiple sites. Writing the scraper(s) is not the problem, matching the data from different sites (which may have small differences) is.

For the sake of being generic assume that I am scraping something like this from two or more different sites:

    public class Data {
        public int id;
        public String firstname;
        public String surname;
        ....
    }

If i scrape this from two different sites, I will encounter the situation where I could have the following:

Site A: id=100, firstname=William, surname=Doe

Site B: id=1974, firstname=Bill, surname=Doe

Essentially, I would like to consider these two sets of data the same (they are the same person but with their name slightly different on each site). I am looking for possible design solutions that can handle this.

The only idea I've come up with is scraping the data from a third location and using it as a reference list. Then when I scrape site A or B I can, over time, build up a list of failures and store them in a list for each scraper so that it can know (if i find id=100 then i know that the firstname will be William etc). I can't help but feel this is a rubbish idea!

If you need any more info, or if you think my description is a bit naff, let me know!

Thanks,

DMcB


Source: http://stackoverflow.com/questions/23970057/web-scraping-data-from-different-sites

Scrape Data Point Using Python


I am looking to scrape a data point using Python off of the url http://www.cavirtex.com/orderbook .

The data point I am looking to scrape is the lowest bid offer, which at the current moment looks like this:

<tr>
 <td><b>Jan. 19, 2014, 2:37 a.m.</b></td>
 <td><b>0.0775/0.1146</b></td>
 <td><b>860.00000</b></td>
 <td><b>66.65 CAD</b></td>
</tr>

The relevant point being the 860.00 . I am looking to build this into a script which can send me an email to alert me of certain price differentials compared to other exchanges.

I'm quite noobie so if in your explanations you could offer your thought process on why you've done certain things it would be very much appreciated.

Thank you in advance!

Edit: This is what I have so far which will return me the name of the title correctly, I'm having trouble grabbing the table data though.

import urllib2, sys
from bs4 import BeautifulSoup

site= "http://cavirtex.com/orderbook"
hdr = {'User-Agent': 'Mozilla/5.0'}
req = urllib2.Request(site,headers=hdr)
page = urllib2.urlopen(req)
soup = BeautifulSoup(page)
print soup.title



Here is the code for scraping the lowest bid from the 'Buying BTC' table:

from selenium import webdriver

fp = webdriver.FirefoxProfile()
browser = webdriver.Firefox(firefox_profile=fp)
browser.get('http://www.cavirtex.com/orderbook')

lowest_bid = float('inf')
elements = browser.find_elements_by_xpath('//div[@id="orderbook_buy"]/table/tbody/tr/td')

for element in elements:
    text = element.get_attribute('innerHTML').strip('<b>|</b>')
    try:
        bid = float(text)
        if lowest_bid > bid:
            lowest_bid = bid
    except:
        pass

browser.quit()
print lowest_bid

In order to install Selenium for Python on your Windows-PC, run from a command line:

pip install selenium (or pip install selenium --upgrade if you already have it).

If you want the 'Selling BTC' table instead, then change "orderbook_buy" to "orderbook_sell".

If you want the 'Last Trades' table instead, then change "orderbook_buy" to "orderbook_trades".

Note:

If you consider performance critical, then you can implement the data-scraping via URL-Connection instead of Selenium, and have your program running much faster. However, your code will probably end up being a lot "messier", due to the tedious XML parsing that you'll be obliged to apply...

Here is the code for sending the previous output in an email from yourself to yourself:

import smtplib,ssl

def SendMail(username,password,contents):
    server = Connect(username)
    try:
        server.login(username,password)
        server.sendmail(username,username,contents)
    except smtplib.SMTPException,error:
        Print(error)
    Disconnect(server)

def Connect(username):
    serverName = username[username.index("@")+1:username.index(".")]
    while True:
        try:
            server = smtplib.SMTP(serverDict[serverName])
        except smtplib.SMTPException,error:
            Print(error)
            continue
        try:
            server.ehlo()
            if server.has_extn("starttls"):
                server.starttls()
                server.ehlo()
        except (smtplib.SMTPException,ssl.SSLError),error:
            Print(error)
            Disconnect(server)
            continue
        break
    return server

def Disconnect(server):
    try:
        server.quit()
    except smtplib.SMTPException,error:
        Print(error)

serverDict = {
    "gmail"  :"smtp.gmail.com",
    "hotmail":"smtp.live.com",
    "yahoo"  :"smtp.mail.yahoo.com"
}

SendMail("your_username@your_provider.com","your_password",str(lowest_bid))

The above code should work if your email provider is either gmail or hotmail or yahoo.

Please note that depending on your firewall configuration, it may ask your permission upon the first time you try it...



Source: http://stackoverflow.com/questions/21217034/scrape-data-point-using-python

Sunday, 15 June 2014

Data Entry: Why Data Entry Outsourcing?

With the developed technology the world has become a global village, therefore has converted to a single market place. Due to this heavy competition is seen where reduction of overhead for business is possible with the only solution of outsourcing your business to developing countries.

This helps in reducing overhead costs of your business. Companies outsource work in many areas as: call center, software development, CAD, AutoCAD, designing, writing, etc. Out of all these data entry is the most popular, probably because it could be learned easily. Professionals providing services do not only enter your data, but also helps you in managing it for upfront requirements.

These developing services cover several business areas as: online and offline data entry, audio and video entry, image entry, insurance claims entry, numeric data entry, etc. It has become popular because of its impressive benefits associated with it.

Why to outsource Data entry work?

Go for the following benefits associated with it:

Cost effective: Outsourcing your data to enter serves as perfect solution in saving your extra overheads. Outsourcing companies reduce on extra expenditure on resources, receiving equal competency benefits. Outsourcing gives you maximum return on your investment.

Superior Quality: Developing countries having huge human resource are able to provide equal or superior work quality because of its skilled man power. Able to recruit experienced and trained resource are able to provide higher quality work.

Satisfied workforce: As companies get the work done at lower cost compared to developed countries, they pay better to their human capital compared to other work available in their country. This satisfied workforce enthusiastically gives desired results as per client's requirements.

Consistent Data Source in a timely manner: Companies providing outsourcing services to enter data give you consistent and accurate data in a timely manner which can be easily used for the benefits of the organizational needs. Ensuring the efficiency in work flow, this is helpful in saving time.

Developed technology: Though you are outsourcing your entry needs to a developing country, you will find that the technology here you get is efficient, equally upgraded to international standards.

All in one service: Companies providing outsourcing services is also capable of providing related services to its clients in areas such as: online and offline data entry, audio and video entry, image entry, insurance claims, numeric, PDF conversion, image scanning or OCR scanning, image editing, etc.

Report management set-up: Companies providing outsourcing services have a well developed reporting system, with different level heads at communication level. This periodically gives you accurate progress graph of the work assigned.

Thus these services will certainly help you to focus on your core business operations and thus improve your overall productivity.

Offshore Data Entry - We are leading affordable services provider at 60% cutting rates.

Source:http://ezinearticles.com/?Data-Entry:-Why-Data-Entry-Outsourcing?&id=6425948

Monday, 19 May 2014

Tips to Use While Picking Online Data Recovery Services

Most of the businesses use technology every day and night to make their daily tasks easier and faster. Eventually, keeping technology at fingertips makes our lives simpler, and we can get the best opportunities to carry out business processes under smartest technology advantages. The gradually advancing technology will soon make it possible for every individual to virtually use comprehensive services within easy distance from their living or work place. Irrespective of the distance and type of technology services, every technical problem will be resolved in minimum turnaround time so as to not let work hamper. And similar concept goes for the data recovery solutions!

No matter whether you are using a new PC or a hard drive, it is not going to work perfectly throughout your life. You might face technical errors in your machine that should not be ignored; instead, they must be diagnosed and fixed soon so as to avoid any work hamper and data losses. There are times when your bad luck brings computer viruses on your hard drive or some other external damage occurs to it causing data loss. In such a scenario, it is impossible to get back the original data on your own, and you surely need to rush to a professional who can handle the data recovery emergency. However, to acquire most trusted online data recovery services, it is important to hire a reliable data recovery provider.

Below are few helpful tips to find a reputed data recovery service.

•    Look for Recommendations

As mouth-to-mouth campaigning is considered as the most trusted conventional method of marketing, it is recommended to look for valuable suggestions from your business partners, colleagues and friends to find a reputed service company. If similar tragedy happened with some other business owner and they were successful in retrieving their data. What would you do? You will try getting the contact details of the data recovery company, if you can ever need to call them. This is how recommendations help you a long way in acquiring expert online data recovery services.

•    Perform Internet Research

As every data retrieval service provider has online presence today, it is easy to differentiate the good ones from all possible options through Internet. Type relevant keywords on the Google search box and look for the best search results. When you prepare a list of potential service providers, check out the offered set of services and their detailed description. This is how you will be easily able to shortlist the best data recovery companies without compromising your comfort.

•    Do Homework

When you are unable to receive any referrals, you need to look for more companies and check out which ones are offering services that fit well in your price bucket. Most of them might be in the industry for too long and others may have involved recently. Ensure you measure and analyze all the parameters before making the final investment and getting right data recovery solutions.

Source:http://blogs.siliconindia.com/Techvedic/Technology/Tips-to-Use-While-Picking-Online-Data-Recovery-Services-bid-FTE076KD72078993.html

Sunday, 17 November 2013

Data scraping tool for non-coding journalists launches

A tool which helps non-coding journalists scrape data from websites has launched in public beta today.

Import.io lets you extract data from any website into a spreadsheet simply by mousing over a few rows of information.

Until now import.io, which we reported on back in April, has been available in private developer preview and has been Windows only. It is now also available for Mac and is open to all.

Although import.io plans to charge for some services at a later date, there will always be a free option.

The London-based start-up is trying to solve the problem of the fact that there is "lots of data on the web, but it's difficult to get at", Andrew Fogg, founder of import.io, said in a webinar last week.

Those with the know-how can write a scraper or use an API to get at data, Fogg said. "But imagine if you could turn any website into a spreadsheet or API."

Uses for journalists

Journalists can find stories in data. For example, if I wanted to do a story on the type of journalism jobs being advertised and the salaries offered, I could research this by looking at various websites which advertise journalism jobs.

If I were to gather the data from four different jobs boards and enter the information manually into a spreadsheet it would take would take hours if not days; if I were to write a screen scraper for each of the sites it would require knowledge and would probably take a couple of hours. Using import.io I can create a single dataset from multiple sources in a few minutes.

I can then search and sort the dataset and find out different facts, such as how many unpaid internships are advertised, or how many editors are currently being sought.

How it works

When you download the import.io application you see a web browser. This browser allows you to enter a URL for any site you want to scrape data from.

To take the example of the jobs board, this is structured data, with the job role, description and salaries displayed.

The first step is to set up 'connectors' and to do this you need to teach the system where the data is on the page. This is done by hitting a 'record' button on the right of the browser window and mousing over a few examples, in this case advertised jobs. You then click 'train rows'.

It takes between two and five examples to teach import.io where all of the rows are, Fogg explained in the webinar.

The next step is to declare the type of data and add column names. For example there may be columns for 'job title', 'job description' and 'salary'. Data is then extracted into the table below the browser window.

Data from different websites can then be "mixed" into a single searchable database.

In the example used in the webinar, Fogg demonstrated how import.io could take data relating to rucksacks for sale on a shopping website. The tool can learn the "extraction pattern", Fogg explained, and apply that to to another product. So rather than mousing over the different rows of sleeping bags advertised, for example, import.io was automatically able to detect where the price and product details were on the page as it had learnt the structure from how the rucksacks were organised. The really smart bit is that the data from all products can then be automatically scraped and pulled into the spreadsheet. You can then search 'shoes' and find the data has already been pulled into your database.

When a site changes its code a screen scraper would become ineffective. Import.io has a "resilience to change", Fogg said. It runs tests twice a day and users get notified of any changes and can retrain a connector.

It is worth noting that a site that has been scraped will be able to detect that import.io has extracted the data as it will appear in the source site's web logs.

Case studies

A few organisations have already used import.io for data extraction. Fogg outlined three.

    British Red Cross

The British Red Cross wanted to create an iPhone app with data from the NHS Choices website. The NHS wanted the charity to use the data but the health site does not have an API.

By using import.io, data was scraped from the NHS site. The app is now in the iTunes store and users can use it to enter a postcode to find hospital information based on the data from the NHS site.

"It allowed them to build an API for a website where there wasn't one," Fogg said.

    Hewlett Packard

Fogg explained that Hewlett Packard wanted to monitor the prices of its laptops on retailers' websites.

They used import.io to scrape the data from the various sites and were able monitor the prices at which the laptops were being sold in real-time.

    Recruitment site

A US recruitment firm wanted to set up a system so that when any job vacancy appeared on a competitor's website, they could extract the details and push that into their Salesforce software. The initial solution was to write scrapers, Fogg said, but this was costly and in the end they gave up. Instead they used import.io to scrape the sites and collate the data.


Source: http://www.journalism.co.uk/news/data-scraping-tool-for-non-coding-journalists-launches/s2/a554002/