Information System

Scraping

http://wiki.issuecrawler.net/twiki/bin/view/Dmi/MethodsByTheme#Scrape

Sources

Uncategorized section of ResearchGuide.

Coop America is currently the best resource we have.is probably the best resource we have. They have a search page and here's what I get from looking for "cocoa": http://www.googlesyndicatedsearch.com/u/coopamerica?q=cocoa&x=0&y=0

Dotherightthing.com for a site that provides easy to read user generated summaries that quickrate would be ideal for. Also check out KnowMore?.org has a search function that you might find helpful. Here's what I got from looking for "cocoa": http://knowmore.org/wiki/index.php?search=cocoa&fulltext=

Proposal

I bet that you can parse the searches of Coop America in a consistent way because their reports are organized well.

First off, a search result can come up with several similar (or perhaps identical) pages. So we have to be careful. What we really want (at least to start off with) are pages in the following format:

The sections are as follows in both of these two companies http://www.coopamerica.org/programs/responsibleshopper/company.cfm?id=187 and also http://www.coopamerica.org/programs/responsibleshopper/company.cfm?id=198.

So each of their company listings probably follows this format:

Opening section - This section has a series of bullet points each of which could be a quick review.

About x - This section contains information that could be added to the description of a company.

Campaigns - This section includes summaries that are listed once you click on "read more". These could be reviews in and of themselves. It also includes links to more detailed sources where the stories come from (which we should include and perhaps also scrape it or save it somewhere to be manually gone through.)

Affiliates - This section has a listing of any known affiliates. These are the equivalent of other nodes and it would be great if these were automatically added to our system as nodes that are connected to the one being focused on as its subsidiaries. The first two companies shown are not included in the list when "read more" is clicked on and the list is expanded; so they should be added to that longer list.

Contact x - This contact information can also be added to the description of the company. (Note, once a node has been modified automatically that it has received information from coop america it should no longer allow the addition of more "contact" or "about" information from later queries from coop america. Either that or perhaps there is a section of the company info which gets rewritten by newer automatic reviews (let's talk more about this.)

Alerts - These should be treated the same as campaigns except that there is an addition feature provided. Each "review" is listed as part of a subsection like "environment" or "ethics and governance". The parser should be able to tell which subsection the review is within and include that subsection at the top of the review. Perhaps these subsections could also help people by narrowing down the choices of interests that they tag a review to (for example, the "environment" subsection might prompt users to choose between "ecology", "climate change", "pollution", etc.)

Each review must also contain: the name of the company being discussed - the url where the info came from - the date that the review was retrieved - the name of the site where the info came from. Anything else?