Friday, March 27, 2009

How does the technology spider and track infringing images on the web?

Hi Randy,

Thanks for taking the time to create and open this blog. Can you give us some idea of the technology that your web spider uses to track infringing images on the web?

Thanks

Stan Rowin

(This question is from a comment to the Welcome message at 9:38 AM, March 27, 2009, by Stan Rowin ... see reply in comments in this thread)

4 comments:

  1. Hi Stan,

    Thanks for asking such an intelligent question. The spidering technology that we’ve developed combines targeting and scalability to accomplish its core task, which basically is to wander around the Internet and find files. What the spider gives us back are text URLs to “objects”. One URL points to the web page at which the object was found and the date it was found there. The other URL points to where the object is stored. In practical terms, we might find an image of, say, Marilyn Monroe that is on a MySpace page while the image is actually be stored and hot linked from PhotoBucket.

    In targeting, each spider or web crawler can be tasked to look for files with specific search parameters. For example, we can tell it to search for only one file type, such as “.jpg”, “.gif”, “.mp3”, “.mov” or about 50 other file types. We can tell it to only spider one web domain if we really want to focus on one place that is using a lot of images, for example. We can also tell the spider to look for web pages that contain specific words instead of just wandering. This is helpful if we want to target pages with content relating to trademark phrases of companies, people’s names or other trigger words.

    In making it scalable, each spider is a self-contained software program. This means that we can (and do) run multiple spiders simultaneously on the same computer. And we can (and do) run multiple spiders on multiple servers from multiple locations. There is almost no limit to how many spiders can run.

    All these URLs and dates that are captured are placed in a list. A second process then goes to each URL and creates a unique ID number from the file at that URL. Because a digital image file is basically ones and zeros, every file by its very nature has a unique ID number. This unique ID is then stored with the URL and the date it was found and re-found and re-found again at that URL. This creates a history of use of each image file at each location.

    Because we store and compare each unique ID, a user can find the owner of an image or other creative work from most any web site in the world. Note that we’ve reversed the logic on this. So, we don’t need to spider the entire Internet to display owner contact information for a particular image file. As long as a photographer has “reclaimed” his or her ownership of an image file in our database, someone can click the C-Tools bookmark at most any web page in the world that has a copy of that file – and it’s probably a copy that the photographer didn’t already know about. Our process then generates a unique ID from that new file and compares it to what’s already in the C-Registry database, finds the match, and displays the contact information, ecommerce link, U.S. copyright registration link, etc. that the photographer has chosen to associate with that image file.

    So, what’s important is that photographers reclaim as many of their images as possible. In doing so, all the copies of those reclaimed files will lead a user back to the photographer, regardless of where in the world the user finds the copy.

    As the database grows from a combination of ongoing spidering, creator additions and user-triggered additions, the likelihood of matches also grows. Photographers will see more URLs of Use for their image files, and users will find more content owners. I should reiterate that photographers do NOT need to upload any images with our process. It’s all done with the text URLs. This makes it extremely efficient.

    And finally, I should point out that we’re not actually finding infringements. We’re just finding files that match. It’s up to each photographer to decide what is an infringement, a previously paid use or a fair use. Each photographer chooses whether to send an invoice and keep 100% of that revenue, to turn the sales lead over to their agent, or to send a DMCA Take Down order. The photographer is in control of what they do with the information provided by our spider.

    ReplyDelete
  2. Hi Randy,

    From the above information I can't tell if your spider is searching for your watermark, or if it is actually doing a pattern recognition to figure out what the picture is of. Can you elaborate?

    Thanks again,
    Stan

    ReplyDelete
  3. The spider does not search for the watermark, nor does it do image recognition. It just goes about finding photos on its own. More accurately, it’s finding the URLs that point to the photos. Either we direct it to go look for something specific, or we tell it to just wander around and find whatever images it can. The spider is bunch of robots that are roaming, looking for image files (or other file types).

    A second process then takes each URL and does some analysis on each image file, but without downloading the file. This ingestion process checks if there is IPTC populated, then captures and displays it if it’s present in the file. And, it creates the unique ID that is specific to that image file.

    (Note: It is true that we plan to offer image recognition. But, that is not required to find URLs of use or image owners. Image recognition search is a future product that will be part of the upgraded account level that is free for ASMP members.)

    I should note that the spider and the watermark (which we call Veripixel™) are different processes for different purposes. Veripixel is intended to be a new form of copyright notice, supplementing IPTC and a text credit line adjacent to the image.

    ReplyDelete
  4. Image recognition added. To update this blog, image recognition was added to C-Registry.us in 2009.

    This function assists users who are looking for the rights holders of content, such as photographic images. So, if a user finds an image at a third party web site and clicks the C-Registry bookmark (or other trigger) to see who owns it, the process first looks for an exact match, then expands the search to look for the most similar pictures in order of similarity.

    In this way, once a photograph is registered at The Copyright Registry at C-Registry.us, every copy of that image on the Internet, including uncredited copies, can find its owner using this process. (Some estimates say there are an average of 100 unauthorized copies for every authorized use of each professional image. Many thousands of copies for more popular pictures.)

    There are nearly a dozen companies that employ image recognition with stock photos and other types of photography. These services are focused primarily on catching copyright infringers. C-Registry's patent-pending process is different, indeed unique, in that this technology is optimized in the opposite direction - to find the owners from the copies of the images. The process is language neutral, which means the third-party web page that contains the image can be in any language.

    (Note: Photographers often describe this as "image recognition". But in fact, this type of search is more accurately described as "pattern recognition" because it also works for video and music.)

    ReplyDelete