PeARS Responsible Data Blog Post

Responsible Data Challenges In The PeARS Project.

The part of PeARS development that I am responsible for is to process the URL's in such a way that they are in a useful format for semantic processing. I am also responsible for the user experience of blacklisting domains that will not be included in the search results. At this time this script only works on modern Linuxes (tested on Ubuntu and Arch) that use Firefox as their browser.

How It Works

Running the script at this time will take the user's Firefox history, retrieve the links, extract the body data from the document and store it in a SQLite database called history.db. This SQLite database is located on the user's hard drive, not in the PeARS directory so it will not be accidentally "pushed" to the PeARS repository. The reasons for this are technical - you don't want to try and version a large binary from each user, and privacy related - users will not want to see their own history database in a public Git repository.

So far running these scripts provides the user with a scrolling commentary of every URL they have ever visited on that computer with the Firefox browser! This is desired behaviour that can be filtered by using the .pearsignore file which, for privacy reasons, is kept in the root of the user's home directory and not versioned with the PeARS project.

The .pearsignore file consists of a comma-delimited list of domain names that the user wishes to be excluded from their history.db.

Web pages that are browsed in incognito mode in Firefox are not recorded in the web history and so therefore not picked up by PeARS either.

Challenge 1 - Getting User Participation

Once you have someone's buy-in to commit to running PeARS they will have the choice of a creating and pasting in a predefined .pearsignore or taking the supplied file, editing it and placing it in their home directory. Either way, the program will not run without the file's existence. This is very important from a privacy standpoint in that the user is forced to acknowledge the fact that they need a blacklist in order to participate.

Once the program runs, all links scroll by for the user to see so if they see a domain they want filtered all they need do is to stop the execution of the program, edit .pearsignore to include the new domain name and rerun create_history_db.py.

Once this has been accomplished a user will typically end up with a focused set of URL's that reflect the user's interests. This is the data that is saved to history.db

Challenge 2 - Presenting It In A Palatable Way

So far creating a focussed .pearsignore file is a tedious process of watching lines of text scroll by. This has to change in order to make this process user-friendly and therefore more likely to be adopted. The plan is to present like URL's together as a block, and give the user a GUI from which to select either individual URL's or blocks of URL's to approve for "denial". The blocking will happen at the domain level rather than the individual URL level.

veesa/responsible-data.md

Responsible Data Challenges In The PeARS Project.

How It Works

Challenge 1 - Getting User Participation

Challenge 2 - Presenting It In A Palatable Way