Fetching Data from Multiple pages using Beautiful Soup

Shivaji Ray Chaudhuri
5 min readOct 17, 2020
Photo by AbsolutVision on Unsplash

It is rightfully said by many that data is one of the most important things in the world we are living right now. Whatever we do in our day to day life has something to do with data be it stored in our neurons or stored in a giant data center in some other foreign country. Humans have a progressive mind and that is why we always try to make our lives easier and if evolution has taught us anything, it definitely shows we are in a very good overall direction.

We aim to make our life easier and this is only possible when we observe, learn from the surroundings, recollect that and think of a possible way to overcome any obstacles. Technology has helped us in the way that it has given us resources to store that information as a record. Data has helped to shape the modern technological world, everything that makes our life easy is as a result of some computation and processing of data either in real time or in research labs. Now since data hold such importance, it is also worthwhile to know some methods how to collect data.

In this article, I have presented a simple method in which how one can get started with collecting data. This here provides an example of how to create a small dataset from a website. In this we will firstly look how to scrape data and how with help of some simple preprocessing one can create a dataset.

The library used for collecting data from websites is the Beautiful Soup library which is genuinely one of the best libraries available to scrape data from webpages. It collects data from html pages by going through the tags.

Now first steps first….Lets import the libraries required for the segment.

from selenium import webdriver
from bs4 import BeautifulSoup
driver=webdriver.Chrome("chromedriver.exe")

Now before we go further ….we need to know what is a web driver. Basically web drivers allow automated testing of web based applications and chromedriver is simply a standalone server that implements this standard. Now a point to note, make sure that the chromedriver.exe file is present in the directory where the python file is present, if not then provide the correct location. Now next things next, I am going to fetch data from the Flipkart.com website from there Laptops section. So the next step will be to list all the webpages from which you want to fetch data.

pages=["https://www.flipkart.com/laptops/pr?sid=6bo%2Cb5g&p%5B%5D=facets.processor%255B%255D%3DCore%2Bi5&pageUID=1591711874987&otracker=clp_metro_expandable_2_32.metroExpandable.METRO_EXPANDABLE_i5_laptops-store_32W7H5NMHFBQ_wp16&fm=neo%2Fmerchandising&iid=M_61e23960-eda8-4178-bdbc-6bdf00e0bc39_32.32W7H5NMHFBQ&ppt=clp&ppn=laptops-store&ssid=44d452czs00000001602750012860"]

In this I have mentioned only one otherwise it would have been a mess. Do check my code on GitHub where I have shared the entire code.

# Create lists for storing the fetched dataproducts=[] #For Name of products
prices=[] # For product prices
ratings=[] # For storing ratings
revs=[] # For calculation processes
nratings=[] #For number of ratings
nreviews=[] #For number of reviews

Now lets start with the fetching part, now the real fun begins. First we take a look at what we want to extract.

We want to extract the highlighted parts.

Now when we are extracting from a website where all the products are listed, then most of the times the products are enclosed in the same divs which are looped over and over again.

First of all we want to get the class ID of each container in which the details of the product is enclosed. In this case it is _31qSD5.
For the name of the product, it is contained within the div ‘_3wU53n’. When we extract information from this div, we receive a bs4 object kind of a thing from which we can extract the text part.
Similar to the name we can extract the price of the product.
However when we want to extract the ratings, the number of reviews and number of ratings from their respective divs, span classes, there is a problem.
#Code for data scraping
for j in range(len(pages)):
driver.get(pages[j])
content=driver.page_source
soup=BeautifulSoup(content)
for a in soup.findAll('a',href=True,attrs={'class':'_31qSD5'}):
name=a.find('div',attrs={'class':'_3wU53n'})
price=a.find('div',attrs={'class':'_1vC4OE _2rQ-NK'})
rating=a.find('div',attrs={'class':'hGSR34'})
rev=a.find('span',attrs={'class':'_38sUEc'})
products.append(name.text)
prices.append(price.text)
ratings.append(rating)
revs.append(rev)

Now the problem that I was mentioning earlier is noticed in the ratings and reviews part. I have observed in various tutorials and videos that we can extract the data by similar methods as that of product name and price, but it results in a “NoneType object has no text field” error. To solve this we have extracted the entire information from that div and we will apply some text preprocessing to get the required information.
First we start by converting the price values which are currently a string to a float value, for that we need to remove the Rupee symbol along with the commas that are used.

for i in range(len(prices)):
prices[i]=float(prices[i][1:].replace(',',''))

Now for the ratings, some text slicing is needed. First lets see what we are dealing with.

This is one element of the list, first we need to convert it into string and then extract the number 4.8. The thing to point here is that the number will always be at the same starting index.
#Extracting the rating and converting it to float
for i in range(len(ratings)):
ratings[i]=str(ratings[i])[20:23]
for i in range(len(ratings)):
if ratings[i]=='':
ratings[i]=0.0
elif ratings[i][1]=="<" or ratings[i]=="1nL":
ratings[i]=float(ratings[i][0])
else:
ratings[i]=float(ratings[i])

Now what is done here is first we convert the entire list into string and then we convert it to float. The first two conditions are there for exceptions, like when no rating is present, there is some other value instead of that (“<” , “1nL”), there might be some different in your case, just print the ratings list and add the value that is appearing in the elif condition.
Now coming to the number of reviews and number of ratings. Similar procedure is used for them also.

We are interested in the two highlighted numbers here.

Now to extract these two numbers, there are various methods available. It is up to you to experiment in this to extract the data. The procedure is the same, convert it to string first, set conditions for exceptions and then typecast it to float or integer.

for i in range(len(revs)):
if str(revs[i])!="None":
a=str(revs[i])[34:str(revs[i]).index(' ',34)]
b=56+len(a)
ind1=str(revs[i]).index("<span>",b)+7
ind2=str(revs[i]).index(' ',ind1)
nratings.append(a)
nreviews.append(str(revs[i])[ind1:ind2])
else:
nratings.append('0')
nreviews.append('0')
for i in range(len(nratings)):
nratings[i]=int(nratings[i].replace(',',''))
nreviews[i]=int(nreviews[i].replace(',',''))

Finally we can create our dataset…

df=pd.DataFrame({'Product Name':products,'Price':prices,'Ratings':ratings,'Number of Ratings':nratings,'Number of Reviews':nreviews})
df.to_csv("dataset.csv")

So with just few of these simple steps you can create a dataset of your own from scratch.

Now I did it for an e-commerce website, this can be done on any website. Just look for the tags and extract it, if there are problems extracting certain parts then extract the whole information and do some text slicing here and there to get the desired data.

Check the entire code for fetching information from multiple pages by clicking here.

Thanks for reading !

--

--

Shivaji Ray Chaudhuri

Data Scientist fueled by a passion for travel and capturing moments. A die-hard Chelsea fan, always striving to turn data into insightful stories.