Python - Multi-pass data scrape

So I decided to do some "practice" code and write myself some web scraping toolkit. I've been trying to secure some freelance work thru upwork.com.

I have not been succesfull at securing a contract in over a month and have applied to a various few. It appears that my lack of upwork reputation is likely keeping me from getting up to bat...

My choice of solution is to just do the easier scrape jobs that are clearly defined on Upwork and send a proposal with a sample of the output requested.

Without much looking I found a job that asked for 95 index pages with 100 detail page links on each one; A nearly 10,000 unit data pull.

Wrote a startup script in python to do the following:

Create a detail data download directory: ./detail-page-data/
Create a blank, executable bash script file: ./index-page-wget.sh
Create a blank, executable bash script file: ./detail-page-wget.sh
Populate ./index-page-wget.sh with `wget` command line calls to download the index HTML pages for local, offline processing.

start.py


  #!/usr/bin/env python
  # -*- coding: utf-8  -*-
  # start.py - christopher.morton.leftsideways.42
  # run this script first to create needed directories and script files.

  import os
  import stat

  startscriptheader = [ '#!/usr/bin/env bash', '# -*- coding: utf-8  -*-', '' ]
  searchurl = 'https://website.com/directory/endpoint?page='

  # create detail data directory: ./detail-page-data/
  if not os.path.isdir('./detail-page-data/'):
      os.mkdir('./detail-page-data/')

  # create executable script file; do not overwrite: ./index-page-wget.sh
  if not os.path.exists('./index-page-wget.sh'):
      tmp = './index-page-wget.sh'
      os.mknod(tmp)
      st = os.stat(tmp)
      os.chmod(tmp, st.st_mode | stat.S_IXUSR | stat.S_IXGRP | stat.S_IXOTH)

  # create executable script file, do not overwrite: ./detail-page-wget.sh
  if not os.path.exists('./detail-page-wget.sh'):
      tmp = './detail-page-data-wget.sh'
      os.mknod(tmp)
      st = os.stat(tmp)
      os.chmod(tmp, st.st_mode | stat.S_IXUSR | stat.S_IXGRP | stat.S_IXOTH)

  # populate ./index-page-wget.sh with wget instructions to run
  with open('./index-page-wget.sh', "w") as startscriptfile:
      for line in startscriptheader:
          startscriptfile.write(line + '\n')
      for x in range(1,96):
          outstr = f"wget '{searchurl}{str(x)} -O './index-page-{str(x).zfill(2)}.html'"
          startscriptfile.write(outstr + '\n')

Below is the head and tail of the resulting bash script:

Running this data pull took less than one minute. The 95 index files were saved in the same directory as the script.

index-page-wget.sh


  #!/usr/bin/env bash
  # -*- coding: utf-8  -*-

  wget 'https://website.com/directory/endpoint?page=1' -O './index-page-01.html'
  wget 'https://website.com/directory/endpoint?page=2' -O './index-page-02.html'
  wget 'https://website.com/directory/endpoint?page=3' -O './index-page-03.html'
  ...
  wget 'https://website.com/directory/endpoint?page=93' -O './index-page-93.html'
  wget 'https://website.com/directory/endpoint?page=94' -O './index-page-94.html'
  wget 'https://website.com/directory/endpoint?page=95' -O './index-page-95.html'

Coded up another python script to process the scraped index files into another `wget` bash script.

index-page-process.py


  #!/usr/bin/env python
  # -*- coding: utf-8  -*-
  # index-page-process.py - christopher.morton.leftsideways.42
  # run after start.py and index-page-wget.sh

  import os
  from lxml.html import parse

  directory = './'
  datadirectory = './detail-page-data'
  outputscriptheader = [ '#!/usr/bin/env bash', '# -*- coding: utf-8  -*-', '' ]
  wgetprefix = 'https://website.com/detail-directory'
  filecount = 0
  linecount = 0

  # populate ./detail-page-wget.sh with wget instructions
  allfiles = os.listdir(directory)
  with open('./detail-page-wget.sh', "w") as outputscript:
      for line in outputscriptheader:
          outputscript.write(line + '\n')
      for filename in allfiles:
          if filename.endswith('.html'):
              thefile = os.path.join(directory, filename)
              doc = parse(thefile).getroot()
              for item in doc.cssselect('div.directory-item a'):
                  link = item.get('href')
                  outstr = f"wget '{wgetprefix}{link}' -O '{datadirectory}{link}.html'"
                  outputscript.write(outstr + '\n')
                  linecount = linecount + 1
              filecount = filecount + 1
              continue
          else:
              continue

  print(f'html scrape files processed: {filecount}')
  print(f'html files in output script: {linecount}')

Below is the head and tail of the resulting bash script:

Running this data pull took nearly two hours. The 9413 detail HTML files were saved for local, offline processing.

detail-page-wget.sh


  #!/usr/bin/env bash
  # -*- coding: utf-8  -*-

  wget 'https://website.com/detail-directory/ann-taylor-pittman' -O './detail-page-data/ann-taylor-pittman.html'
  wget 'https://website.com/detail-directory/christine-pittman-1' -O './detail-page-data/christine-pittman-1.html'
  wget 'https://website.com/detail-directory/melodytravels' -O './detail-page-data/melodytravels.html'
  ...
  wget 'https://website.com/detail-directory/paulina-piwowarek' -O './detail-page-data/paulina-piwowarek.html'
  wget 'https://website.com/detail-directory/laura-plant' -O './detail-page-data/laura-plant.html'
  wget 'https://website.com/detail-directory/ellie-plass' -O './detail-page-data/ellie-plass.html'

One more python file to process the detail pages into a CSV as requested. There were a few things in the data that had to be coded around:

The first and last name splits were better represented in the index file than from the detail file.
A few records exist where there was more than one comma.
Pre-processing the detail files found that there could be up to seven media outlets listed; this was taken care of.

detail-page-process.py


  #!/usr/bin/env python
  # -*- coding: utf-8  -*-
  # detail-page-process.py - christopher.morton.leftsideways.42
  # run after start.py, index-page-wget.sh, index-page-process.py
  # and detail-page-wget.sh

  import os
  import csv
  from lxml.html import parse

  directory = './'
  detaildirectory = "./detail-page-data"
  baseurl = 'https://website.com/detail-directory'
  csvheader = [ 'detail_url', 'lastname', 'firstname', 'media_outlet1', 'media_outlet2', 'media_outlet3', 'media_outlet4', 'media_outlet5', 'media_outlet6','media_outlet7' ]

  allfiles = os.listdir(directory)
  allfiles.sort()
  with open('detail-page-data.csv', 'w', newline='') as csvfile:
      csvout = csv.writer(csvfile, quoting=csv.QUOTE_MINIMAL)
      csvout.writerow(csvheader)
      for filename in allfiles:
          if filename.endswith('.html'):
              thefile = os.path.join(directory, filename)
              doc1 = parse(thefile).getroot()
              for item1 in doc1.cssselect('div.directory-item a'):
                  link = item1.get('href')
                  # first and last names with multiple commas
                  names = item1.text_content().split(',')
                  if len(names) == 2:
                      firstname = names[1].strip()
                      lastname = names[0].strip()
                  else:
                      i = len(names) - 1
                      firstname = names[i].strip()
                      lastname = ''.join(names[0:i])

                  detailfilename = f'{detaildirectory}{link}.html'
                  csvline = [ f'{baseurl}{link.strip()}', lastname.strip(), firstname.strip() ]
                  doc2 = parse(detailfilename).getroot()
                  for item2 in doc2.cssselect('div.details-item. div a'):
                      # there can be up to seven of these for each detail page
                      csvline = csvline + [item2.text_content()]
                  csvout.writerow(csvline)

Running this script took a bit of time, but not as much as I was anticipating. Below is a sample of the 9413 lines of data that this produced.

detail-page-data-sample.csv


  details_url,lastname,firstname,media_outlet1,media_outlet_url1,media_outlet2,media_outlet_url2,media_outlet3,media_outlet_url3,media_outlet4,media_outlet_url4,media_outlet5,media_outlet_url5,media_outlet6,media_outlet_url6,media_outlet7,media_outlet_url7
  https://website.com/detail-directory/alexis-rubinstein-barriere,(Rubinstein) Barriere,Alexis,INTL FCStone
  https://website.com/detail-directory/alleigh-a,A.,Alleigh,A Glass After Work
  https://website.com/detail-directory/fredacourt,A'Court,Fred,Freelance
  https://website.com/detail-directory/steve-osmith,O'Smith,Steve,B&I Catering,EDUcatering,H2O Publishing,OOH magazine,Pub & Bar,Sports & Leisure Catering,TUCO Magazine

Done deal and now have a better grasp on doing this with python; time to find a few more data sets to grab.

I let the client know that I have their data, and then some. Let's see if doing the work and asking for the money helps.