
News scraping - La Discusión newspaper
Last script test - 03/03/2022
This script is a data extractor (scraping), in this case, I extracted the content from the newspaper https://www.ladiscusion.cl, which is a newspaper from my city. This scraping is divided into two parts:
Before starting, we checked robots.txt and verified the permissions granted by the newspaper.
Once ready, we continue.
First step - Extraction and creation of dataframes with the links of the news by category from the navigation bar.
-
The script creates two folders, data and data_content. The categories with the links will be stored in a dataframe in the data folder, and the content will be stored in the data_content folder.
-
The first step to use the script is to use the first option:
- Once the [1] command is entered, the following will appear:
-
If the 1 command is entered, it will start scraping all the links from the navigation bar.
-
However, option 2 allows us to scrape only the data we want. For demonstration purposes, I will show pictures of option 2.
- In the previous case, I entered that I only want 0, 3, and 4. Then I ran the code and it started generating the tables.
Once the first process is finished.
- We will have the data folder with the dataframes.
- Here's a glimpse of the first dataframe.
Second step - Creation of dataframes with Title - Time - Content - Subtitles
- With the first step, we will already have all the news links from the page that were available at the time of script execution.
- The next step is simple.
- In the script menu, option [2] is available.
- This option displays another menu, where it will use all the links found in the files in the data folder.
- Option [1] is used to do it with all the files.
- Option [2] is used to scrape specific files.
- I will use option two as a test.
- We can see that some links cannot be opened, but the script continues.
- Once finished, we will have the dataframes and content ready for manipulation in our data_content folder.
-Let's look at the first one as an example of a dataframe.
df.info()
as well.
Scraping duration
- 40 minutes, First part.
- 10 hours, Second part.
Aspects to optimize
- Threads can be used to reduce the scraping time by performing parallel scraping.
- The files that did not make a get request can be captured for independent scraping.