Citybot: A word about crowd-sourced location data

One of the challenges in creating an algorithmic travel guide - or any location-based service, for that matter, is acquiring quality data. The Internet made all the content in the world available to everyone, yet it is becoming increasingly difficult to find quality content underneath a growing pile of duplicated and inaccurate results. Companies whose main business is search recognize the problem and are trying to solve it on the global scale. That's why we see many startups and big companies trying to apply social angle to search, power search by real people, and use semantics to make sense out of the confusion of content. Someone has to solve this problem and hopefully someone will.

Crowd-sourcing is a great idea. With the explosion of self publishing and web 2.0 crowd-sourcing became the primary way to amass huge databases of travel related content. In no time, sites like TripAdvisor, Yelp, YP and Citysearch were able to collect hundreds of thousands of reviews from amateur writers and upset customers. However, there is a downside, and it's the same reason why you get laughed at for citing a Wikipedia article in research papers. Lacking a reliable validation system, crowd-sourcing tends to produce unreliable and often incorrect results. The primary solution so far (manual review by armies of editors) works to an extent, but ultimately is as flawed as the approach that fueled the rise of crowd-sourcing in the first place. Here's a great article about why reviews on TripAdvisor and Yelp are often misleading: Why Online Review Sites Get One Star and a Wall Street Journal article about TripAdvisor. Granted, both of these are fairly old articles and the internet has changed a lot since 2007, and many consumer review sites have since started hiring armies of professional editors and writers to weed out and edit bad content. However, the problems are still there. Let's take a look.

Duplication of results

look at this screenshot from Yelp.com:

Obviously SFMOMA and San Francisco Museum of Modern Art are the same place, yet these two listings above have slightly different addresses and different phone numbers. The first listing is likely a red herring or a result of someone's inaccurate submission, as it has no ratings and no reviews. Yelp's editors are probably working tirelessly to weed out these kinds of entries, but the Yelp's growing popularity makes this an uphill battle. They may even have some automation tools that flag duplicate results, (I hope they don't do it all manually!) but reliably identifying duplicate results algorithmically is also a non-trivial problem.

Ambiguous categorization

This is my personal favorite. It seems that Yelp has a quite liberal approach when it comes to classifying locations, perhaps in the interest of erring on the side of extra information. This leads to some interesting observations: apparently, in San Francisco, every tatoo studio, coffee shop, hair salon and day spa that happens to have photos or paintings on the walls is considered an "Art Gallery":

Yes, I get it. The folks at Ginger Rubio are not just hair stylists, they are artists, they are obviously great at what they do, and they have 4.5 star rating and 106 reviews on Yelp to show for it. And yet we are to believe that it is an Art Gallery? Check out this screenshot of browsing "Art Galleries" category on Yelp:

Notice how the first two items in the list are actually tattoo studios. If you drill down and read the reviews you will find out that they are great tattoo studios that deliver outstanding results and top-notch customer service. All that's great - unless, of course, you are an art connouisseur visiting San Francisco looking to spend a day browsing the city's finest art offerings. It may be said that Yelp's content is tailored towards locals searching for consumer reviews rather then towards tourists looking for destination recommendations, but this only underscores the limitations inherent to such systems.

Conflicting or incorrect information

Let's pretend I am in New York and would like to visit some museums. Checking Lonely Planet's website, I see that the Tenement Museum comes up on the first page. (I should perhaps mention here that I love Lonely Planet guide books I have enough of them to fill a bookshelf). Here's what it says about museum hours:

Looks good, right? It tells me everything I need to know about the opening hours of this museum so I can plan my trip to be there on time. However, if I were to check Yelp for customer reviews, I might stumble onto the following:

This does not exactly contradict Lonely Planet's information, but it seems more precise. Just for the heck of it, let's check one more respected source, Frommer's (again, my collection of these guide books rivals that of Lonely Planet's):

Interesting. This may be the most useful information so far, since it warns that the schedule is complicated and implores you to check for yourself. If you take this advice and go to the museum's website, you are presented with a neat AJAXy calendar and tour schedule:

Problem Solved.

And yet, something is wrong with this picture. In an age when practically everyone and everything is connected with endless streams real-time information, is it too much to expect to quickly and easily be able to find the operating hours of a large museum in New York City? Is it unreasonable to expect to do this in two minutes, right on your smartphone, instead of cross-checking multiple sources, browsing websites, and calling around? We at Citybot definitely don’t think so. Someone has to solve this problem, and if we can play even a small part in the solution, we would be helping travelers everywhere enjoy their day.

Sign up

Thursday, 8 September 2011

A word about crowd-sourced location data

No comments:

Post a Comment