Fetching Data from Multiple pages using Beautiful Soup
It is rightfully said by many that data is one of the most important things in the world we are living right now. Whatever we do in our day to day life has something to do with data be it stored in our neurons or stored in a giant data center in some other foreign country. Humans have a progressive mind and that is why we always try to make our lives easier and if evolution has taught us anything, it definitely shows we are in a very good overall direction.
We aim to make our life easier and this is only possible when we observe, learn from the surroundings, recollect that and think of a possible way to overcome any obstacles. Technology has helped us in the way that it has given us resources to store that information as a record. Data has helped to shape the modern technological world, everything that makes our life easy is as a result of some computation and processing of data either in real time or in research labs. Now since data hold such importance, it is also worthwhile to know some methods how to collect data.
In this article, I have presented a simple method in which how one can get started with collecting data. This here provides an example of how to create a small dataset from a website. In this we will firstly look how to scrape data and how with help of some simple preprocessing one can create a dataset.
The library used for collecting data from websites is the Beautiful Soup library which is genuinely one of the best libraries available to scrape data from webpages. It collects data from html pages by going through the tags.
Now first steps first….Lets import the libraries required for the segment.
from selenium import webdriver
from bs4 import BeautifulSoup
driver=webdriver.Chrome("chromedriver.exe")
Now before we go further ….we need to know what is a web driver. Basically web drivers allow automated testing of web based applications and chromedriver is simply a standalone server that implements this standard. Now a point to note, make sure that the chromedriver.exe file is present in the directory where the python file is present, if not then provide the correct location. Now next things next, I am going to fetch data from the Flipkart.com website from there Laptops section. So the next step will be to list all the webpages from which you want to fetch data.
pages=["https://www.flipkart.com/laptops/pr?sid=6bo%2Cb5g&p%5B%5D=facets.processor%255B%255D%3DCore%2Bi5&pageUID=1591711874987&otracker=clp_metro_expandable_2_32.metroExpandable.METRO_EXPANDABLE_i5_laptops-store_32W7H5NMHFBQ_wp16&fm=neo%2Fmerchandising&iid=M_61e23960-eda8-4178-bdbc-6bdf00e0bc39_32.32W7H5NMHFBQ&ppt=clp&ppn=laptops-store&ssid=44d452czs00000001602750012860"]
In this I have mentioned only one otherwise it would have been a mess. Do check my code on GitHub where I have shared the entire code.
# Create lists for storing the fetched dataproducts=[] #For Name of products
prices=[] # For product prices
ratings=[] # For storing ratings
revs=[] # For calculation processes
nratings=[] #For number of ratings
nreviews=[] #For number of reviews
Now lets start with the fetching part, now the real fun begins. First we take a look at what we want to extract.
Now when we are extracting from a website where all the products are listed, then most of the times the products are enclosed in the same divs which are looped over and over again.
#Code for data scraping
for j in range(len(pages)):
driver.get(pages[j])
content=driver.page_source
soup=BeautifulSoup(content)
for a in soup.findAll('a',href=True,attrs={'class':'_31qSD5'}):
name=a.find('div',attrs={'class':'_3wU53n'})
price=a.find('div',attrs={'class':'_1vC4OE _2rQ-NK'})
rating=a.find('div',attrs={'class':'hGSR34'})
rev=a.find('span',attrs={'class':'_38sUEc'})
products.append(name.text)
prices.append(price.text)
ratings.append(rating)
revs.append(rev)
Now the problem that I was mentioning earlier is noticed in the ratings and reviews part. I have observed in various tutorials and videos that we can extract the data by similar methods as that of product name and price, but it results in a “NoneType object has no text field” error. To solve this we have extracted the entire information from that div and we will apply some text preprocessing to get the required information.
First we start by converting the price values which are currently a string to a float value, for that we need to remove the Rupee symbol along with the commas that are used.
for i in range(len(prices)):
prices[i]=float(prices[i][1:].replace(',',''))
Now for the ratings, some text slicing is needed. First lets see what we are dealing with.
#Extracting the rating and converting it to float
for i in range(len(ratings)):
ratings[i]=str(ratings[i])[20:23]
for i in range(len(ratings)):
if ratings[i]=='':
ratings[i]=0.0
elif ratings[i][1]=="<" or ratings[i]=="1nL":
ratings[i]=float(ratings[i][0])
else:
ratings[i]=float(ratings[i])
Now what is done here is first we convert the entire list into string and then we convert it to float. The first two conditions are there for exceptions, like when no rating is present, there is some other value instead of that (“<” , “1nL”), there might be some different in your case, just print the ratings list and add the value that is appearing in the elif condition.
Now coming to the number of reviews and number of ratings. Similar procedure is used for them also.
Now to extract these two numbers, there are various methods available. It is up to you to experiment in this to extract the data. The procedure is the same, convert it to string first, set conditions for exceptions and then typecast it to float or integer.
for i in range(len(revs)):
if str(revs[i])!="None":
a=str(revs[i])[34:str(revs[i]).index(' ',34)]
b=56+len(a)
ind1=str(revs[i]).index("<span>",b)+7
ind2=str(revs[i]).index(' ',ind1)
nratings.append(a)
nreviews.append(str(revs[i])[ind1:ind2])
else:
nratings.append('0')
nreviews.append('0')for i in range(len(nratings)):
nratings[i]=int(nratings[i].replace(',',''))
nreviews[i]=int(nreviews[i].replace(',',''))
Finally we can create our dataset…
df=pd.DataFrame({'Product Name':products,'Price':prices,'Ratings':ratings,'Number of Ratings':nratings,'Number of Reviews':nreviews})
df.to_csv("dataset.csv")
So with just few of these simple steps you can create a dataset of your own from scratch.
Now I did it for an e-commerce website, this can be done on any website. Just look for the tags and extract it, if there are problems extracting certain parts then extract the whole information and do some text slicing here and there to get the desired data.
Check the entire code for fetching information from multiple pages by clicking here.
Thanks for reading !