Web Scraping Images at Scale

A huge portion of this project revolved around examining a phenomena that I was observing at scale. After failing at a systematic way to create a catalog using documents, screenshots, and spreadsheets, I realized that I needed to roll up my sleeves and learn how to scrape the web. Collecting data at scale required me to learn the tools and technology behind web scraping. It also required me to develop a systematic approach to collecting the data and lastly a need to try and "scrub" the data. The initial scraping push yielded over 125,000 images. Using python, I did some aggressive "garbage collection" to try and programmatically weed out images that were not the woman and multiple images of the same photoshopped flag. By taking a more aggressive approach, I likely lost some versions of photoshopped flags along the way. What remained was a collection of about 12,500 images of the woman holding flags of unique content.

Learning to Scrape

At the beginning of this project, I did not know or realize how many variations of this woman holding the flag existed. Each time I discovered her image being used, I felt like I had happened upon another one. As I found myself documenting more and more images, the tools for collecting them needed to scale up to tackle the problem. Web scraping is incredibly powerful for this type of data collection.

The first thing I needed to do was get acquainted with the tools and processes. For this, I decided to start with a DataCamp course on Web Scraping in Python. I find when first getting introduced that a course can be a helpful way to have someone walk you through the theory, while showing you the tools and tech stacks that they use.

Datacamp

This course got me familiar with css locators and xpath notation. It also got me started working in PyCharm and re-picking up Python. Lastly, it introduced me to scrapy and the concepts of building a spider. Armed with some new knowledge, I set about writing my first scraper. Following along with a few YouTube videos specifically geared towards scraping Amazon with scrapy, I was able to get my first spider running. This was a tool which looked at a specific brand on Amazon and scrapped information about the search results. I have run it multiple times as I was interested in seeing how this data changed over time. While it didn't work perfectly (it kept getting kicked off of Amazon after about 100 pages), this was enough of an encouraging first step for me to keep on keeping on.

I also realized that sometimes when you are scraping, because things are so much about structure, your console log is often very beautiful.

Web scraping can create accidental patterns

Scrapy + SplashRequest

Looking more specifically at my thesis area of interest, I realized that there was a challenge between me and scraping information about this image. At this point, I can pretty reliably spot which flag products on Amazon are going to have an image of the woman with the flag (first image is at a slight angle, waves in the flag itself, and a black flag pole). But that wasn't exactly something I could easily communicate to a computer. After sleeping on the problem for a few nights, I had a solution. What if instead of searching and scraping from Amazon, I search, scrape, and access Amazon from a Google reverse image search.

With that, I started to sketch out the parameters and information that I would get from my ideal spider.

Pseudocode for my ideal spider

I realized that I wanted to be able to do a few things that I really did not know how to do while scraping. Take screenshots of pages and save images. I also wanted to be able to search google images and TinEye at the same time. Given that my scraping experience was grounded in JSONs as the output, I started with building some of the structure for a scrapy spider. I had a few things in place and felt confident that I could suss out the informational data. So armed with a not-terribly-useful json, I went to tackle a stickier problem.

The World's Least Helpful JSON Feed

Next, I started tackling the screenshot problem. This led me to SplashRequest. After some noodling around, I was able to get this working but I had a problem. Things were working ok from TinEye, but google images was a bust. Nothing was rendering in the screenshot and I was really struggling to reliably collect data. I assumed that my css locators were wrong. I also assumed that if I was pinging the locators correctly, then that would force enough of the page to load and allow me to get a nice little screenshot. After working under these inaccuracies for longer than I'd care to admit, I came upon a neat little chrome extension. https://github.com/hermit-crab/ScrapeMate#readme This helped me realize that the problem wasn't in my locators, but rather existed somewhere else. I started to investigate infinite scrolls, finding the network tool to locate the data source, and a tool named zenserp.

Selenium

Eventually, I found my way on to this fantastically helpful article: https://medium.com/@wwwanandsuresh/web-scraping-images-from-google-9084545808a2. I set about integrating Selenium on top of scrapy and splash. While I was getting some success using both tools, I decided that this was a silly approach and started fresh using Selenium and focusing on just scraping the images from the Google Image Search page.

This was successful and extremely satisfying to get up and running!

End result! I now have a tool that I can use to scrape oh so many versions of this woman with a flag. I ran it twice on two different start images and just now collected 2,633 images in a half hour. Including this gem, which might be my new fav flag.

Possibly my favorite flag

Process

Watching the scraper open an instance of Chrome and pull every image
Reviewing the images pulled from one source
Signs of scraping, multiple instances of Chrome

Dealing with the Results

It took almost two weeks, but I was able to scrape 215 reverse image links to collect 125,559 images. Some of these images are not of the woman holding the flag. And I'm sure that plenty of them are duplicates. Next step in this process will be to collect those images and see if I can remove the duplicates to get a total count of how many variations on this flag I was able to collect.

The goal is to get as many variations of the image as possible. I landed on a process utilizing Google's Reverse Image search to yield pages of flag variations.

As I've been working on this project, I have a sizeable list of different brands which are using the image in question. This list was the starting point for collecting the images.

Going to the product page, I discovered that I got the best results by using Google Lens to provide a reverse image link (as opposed to uploading the image, or using the image URL directly into Google's reverse image search.) Once I landed there, if there was an "All sizes" option for the results, then I would be able to access that URL. The URL was then added to my spreadsheet of URLs to be scraped, where I was methodically going through and pulling images.

I do this for each flag variation on the product page, to try and get the most results possible.

The first step of scrubbing the data was manually removing images that were not the woman holding the flag