How to use web scraping amazon:
Web Scraping amazon, we are going to see how we can scrape the amazon customer review using Beautiful Soup in Python.
Module needed and installation:
BeautifulSoup: Our primary module contains a method to access a webpage over HTTP.
pip install bs4
lxml: Helper library to process webpages in python language.
pip install lxml
requests: Makes the process of sending HTTP requests flawless.the output of the function
pip install requests
To begin with web scraping, we first have to do some setup. Import all the required modules. Get the cookies data for making the request to amazon, without this you can not able to scrape. Create a header that contains your request cookies, without cookies you can not scrape amazon data it always shows some error. This website will provide you, specific user agent.
Pass the URL in the getdata() function(User Defined Function) to that will request to a URL, it returns a response. We are using get method to retrieve information from the given server using a given URL of web scraping amazon.
Syntax:
requests.get(url, args)
Syntax:
soup = BeautifulSoup(r.content, ‘html5lib’)
Parameters:
- r.content : It is the raw HTML content.
- html.parser : Specifying the HTML parser we want to use.
Now filter the required data using soup.Find_all function.
Program:
# import module import requests from bs4 import BeautifulSoup HEADERS = ({'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) \ AppleWebKit/537.36 (KHTML, like Gecko) \ Chrome/90.0.4430.212 Safari/537.36', 'Accept-Language': 'en-US, en;q=0.5'}) # user define function # Scrape the data def getdata(url): r = requests.get(url, headers=HEADERS) return r.text def html_code(url): # pass the url # into getdata function htmldata = getdata(url) soup = BeautifulSoup(htmldata, 'html.parser') # display html code return (soup) url = "https://www.amazon.in/Columbia-Mens-wind-\ resistant-Glove/dp/B0772WVHPS/?_encoding=UTF8&pd_rd\ _w=d9RS9&pf_rd_p=3d2ae0df-d986-4d1d-8c95-aa25d2ade606&pf\ _rd_r=7MP3ZDYBBV88PYJ7KEMJ&pd_rd_r=550bec4d-5268-41d5-\ 87cb-8af40554a01e&pd_rd_wg=oy8v8&ref_=pd_gw_cr_cartx&th=1" soup = html_code(url) print(soup)
Output:
Note: This is only HTML code or Raw data.
Now since the core setup is done let us see how scraping for a specific requirement can be done.
Scrape Customer Name
Now find the customer list with span tag where class_ = a-profile-name. You can open the webpage in the browser and inspect the relevant element by pressing right-click as shown in the figure of web scraping amazon.
You have to pass the tag name and attribute with its corresponding value to the find_all() function.
Code:
def cus_data(soup): # find the Html tag # with find() # and convert into string data_str = "" cus_list = [] for item in soup.find_all("span", class_="a-profile-name"): data_str = data_str + item.get_text() cus_list.append(data_str) data_str = "" return cus_list cus_res = cus_data(soup) print(cus_res)
Output:
[‘Amaze’, ‘Robert’, ‘D. Kong’, ‘Alexey’, ‘Charl’, ‘RBostillo’]
Scrape User Review:
web scraping amazon,Now find the customer review as same above methods. Find the unique class name with a specific tag, here we use div tag.
Code:
def cus_rev(soup): # find the Html tag # with find() # and convert into string data_str = "" for item in soup.find_all("div", class_="a-expander-content \ reviewText review-text-content a-expander-partial-collapse-content"): data_str = data_str + item.get_text() result = data_str.split("\n") return (result) rev_data = cus_rev(soup) rev_result = [] for i in rev_data: if i is "": pass else: rev_result.append(i) rev_result
Output:
Scraping Production Information
Here we will scrape product information like product name, ASIN number, Weight, dimension. By doing this we will use the span tag and with a specific unique class name of web scraping amazon.
Code:
def product_info(soup): # find the Html tag # with find() # and convert into string data_str = "" pro_info = [] for item in soup.find_all("ul", class_="a-unordered-list a-nostyle\ a-vertical a-spacing-none detail-bullet-list"): data_str = data_str + item.get_text() pro_info.append(data_str.split("\n")) data_str = "" return pro_info pro_result = product_info(soup) # Filter the required data for item in pro_result: for j in item: if j is "": pass else: print(j)
Output:
Scraping Review Image:
Here we will extract the image link from the review of the product using the same as the above methods of web scraping amazon. The tag name and attribute of the tag is passed to findAll() as above.
Code:.
def rev_img(soup): # find the Html tag # with find() # and convert into string data_str = "" cus_list = [] images = [] for img in soup.findAll('img', class_="cr-lightbox-image-thumbnail"): images.append(img.get('src')) return images img_result = rev_img(soup) img_result
Output:
Saving details into CSV file:
Here we will save the details into the CSV file of web scraping amazon, We will convert the data into dataframe and then export it into the CSV, Let us see how to export a Pandas DataFrame to a CSV file. We will be using the to_csv() function to save a DataFrame as a CSV file.
Syntax : to_csv(parameters)
Parameters :
- path_or_buf : File path or object, if None is provided the result is returned as a string.
Code:
import pandas as pd # initialise data of lists. data = {'Name': cus_res, 'review': rev_result} # Create DataFrame df = pd.DataFrame(data) # Save the output. df.to_csv('amazon_review.csv')
Output: