So I decided to do some "practice" code and write myself some web scraping toolkit. I've been trying to secure some freelance work thru upwork.com.
I have not been succesfull at securing a contract in over a month and have applied to a various few. It appears that my lack of upwork reputation is likely keeping me from getting up to bat...
My choice of solution is to just do the easier scrape jobs that are clearly defined on Upwork and send a proposal with a sample of the output requested.
Without much looking I found a job that asked for 95 index pages with 100 detail page links on each one; A nearly 10,000 unit data pull.
Wrote a startup script in python to do the following:
./detail-page-data/
./index-page-wget.sh
./detail-page-wget.sh
./index-page-wget.sh
with `wget`
command line calls to download the index HTML pages for local, offline processing.
#!/usr/bin/env python
# -*- coding: utf-8 -*-
# start.py - christopher.morton.leftsideways.42
# run this script first to create needed directories and script files.
import os
import stat
startscriptheader = [ '#!/usr/bin/env bash', '# -*- coding: utf-8 -*-', '' ]
searchurl = 'https://website.com/directory/endpoint?page='
# create detail data directory: ./detail-page-data/
if not os.path.isdir('./detail-page-data/'):
os.mkdir('./detail-page-data/')
# create executable script file; do not overwrite: ./index-page-wget.sh
if not os.path.exists('./index-page-wget.sh'):
tmp = './index-page-wget.sh'
os.mknod(tmp)
st = os.stat(tmp)
os.chmod(tmp, st.st_mode | stat.S_IXUSR | stat.S_IXGRP | stat.S_IXOTH)
# create executable script file, do not overwrite: ./detail-page-wget.sh
if not os.path.exists('./detail-page-wget.sh'):
tmp = './detail-page-data-wget.sh'
os.mknod(tmp)
st = os.stat(tmp)
os.chmod(tmp, st.st_mode | stat.S_IXUSR | stat.S_IXGRP | stat.S_IXOTH)
# populate ./index-page-wget.sh with wget instructions to run
with open('./index-page-wget.sh', "w") as startscriptfile:
for line in startscriptheader:
startscriptfile.write(line + '\n')
for x in range(1,96):
outstr = f"wget '{searchurl}{str(x)} -O './index-page-{str(x).zfill(2)}.html'"
startscriptfile.write(outstr + '\n')
Below is the head and tail of the resulting bash script:
Running this data pull took less than one minute. The 95 index files were saved in the same directory as the script.
#!/usr/bin/env bash
# -*- coding: utf-8 -*-
wget 'https://website.com/directory/endpoint?page=1' -O './index-page-01.html'
wget 'https://website.com/directory/endpoint?page=2' -O './index-page-02.html'
wget 'https://website.com/directory/endpoint?page=3' -O './index-page-03.html'
...
wget 'https://website.com/directory/endpoint?page=93' -O './index-page-93.html'
wget 'https://website.com/directory/endpoint?page=94' -O './index-page-94.html'
wget 'https://website.com/directory/endpoint?page=95' -O './index-page-95.html'
Coded up another python script to process the scraped index files into another `wget`
bash script.
#!/usr/bin/env python
# -*- coding: utf-8 -*-
# index-page-process.py - christopher.morton.leftsideways.42
# run after start.py and index-page-wget.sh
import os
from lxml.html import parse
directory = './'
datadirectory = './detail-page-data'
outputscriptheader = [ '#!/usr/bin/env bash', '# -*- coding: utf-8 -*-', '' ]
wgetprefix = 'https://website.com/detail-directory'
filecount = 0
linecount = 0
# populate ./detail-page-wget.sh with wget instructions
allfiles = os.listdir(directory)
with open('./detail-page-wget.sh', "w") as outputscript:
for line in outputscriptheader:
outputscript.write(line + '\n')
for filename in allfiles:
if filename.endswith('.html'):
thefile = os.path.join(directory, filename)
doc = parse(thefile).getroot()
for item in doc.cssselect('div.directory-item a'):
link = item.get('href')
outstr = f"wget '{wgetprefix}{link}' -O '{datadirectory}{link}.html'"
outputscript.write(outstr + '\n')
linecount = linecount + 1
filecount = filecount + 1
continue
else:
continue
print(f'html scrape files processed: {filecount}')
print(f'html files in output script: {linecount}')
Below is the head and tail of the resulting bash script:
Running this data pull took nearly two hours. The 9413 detail HTML files were saved for local, offline processing.
#!/usr/bin/env bash
# -*- coding: utf-8 -*-
wget 'https://website.com/detail-directory/ann-taylor-pittman' -O './detail-page-data/ann-taylor-pittman.html'
wget 'https://website.com/detail-directory/christine-pittman-1' -O './detail-page-data/christine-pittman-1.html'
wget 'https://website.com/detail-directory/melodytravels' -O './detail-page-data/melodytravels.html'
...
wget 'https://website.com/detail-directory/paulina-piwowarek' -O './detail-page-data/paulina-piwowarek.html'
wget 'https://website.com/detail-directory/laura-plant' -O './detail-page-data/laura-plant.html'
wget 'https://website.com/detail-directory/ellie-plass' -O './detail-page-data/ellie-plass.html'
One more python file to process the detail pages into a CSV as requested. There were a few things in the data that had to be coded around:
#!/usr/bin/env python
# -*- coding: utf-8 -*-
# detail-page-process.py - christopher.morton.leftsideways.42
# run after start.py, index-page-wget.sh, index-page-process.py
# and detail-page-wget.sh
import os
import csv
from lxml.html import parse
directory = './'
detaildirectory = "./detail-page-data"
baseurl = 'https://website.com/detail-directory'
csvheader = [ 'detail_url', 'lastname', 'firstname', 'media_outlet1', 'media_outlet2', 'media_outlet3', 'media_outlet4', 'media_outlet5', 'media_outlet6','media_outlet7' ]
allfiles = os.listdir(directory)
allfiles.sort()
with open('detail-page-data.csv', 'w', newline='') as csvfile:
csvout = csv.writer(csvfile, quoting=csv.QUOTE_MINIMAL)
csvout.writerow(csvheader)
for filename in allfiles:
if filename.endswith('.html'):
thefile = os.path.join(directory, filename)
doc1 = parse(thefile).getroot()
for item1 in doc1.cssselect('div.directory-item a'):
link = item1.get('href')
# first and last names with multiple commas
names = item1.text_content().split(',')
if len(names) == 2:
firstname = names[1].strip()
lastname = names[0].strip()
else:
i = len(names) - 1
firstname = names[i].strip()
lastname = ''.join(names[0:i])
detailfilename = f'{detaildirectory}{link}.html'
csvline = [ f'{baseurl}{link.strip()}', lastname.strip(), firstname.strip() ]
doc2 = parse(detailfilename).getroot()
for item2 in doc2.cssselect('div.details-item. div a'):
# there can be up to seven of these for each detail page
csvline = csvline + [item2.text_content()]
csvout.writerow(csvline)
Running this script took a bit of time, but not as much as I was anticipating. Below is a sample of the 9413 lines of data that this produced.
details_url,lastname,firstname,media_outlet1,media_outlet_url1,media_outlet2,media_outlet_url2,media_outlet3,media_outlet_url3,media_outlet4,media_outlet_url4,media_outlet5,media_outlet_url5,media_outlet6,media_outlet_url6,media_outlet7,media_outlet_url7
https://website.com/detail-directory/alexis-rubinstein-barriere,(Rubinstein) Barriere,Alexis,INTL FCStone
https://website.com/detail-directory/alleigh-a,A.,Alleigh,A Glass After Work
https://website.com/detail-directory/fredacourt,A'Court,Fred,Freelance
https://website.com/detail-directory/steve-osmith,O'Smith,Steve,B&I Catering,EDUcatering,H2O Publishing,OOH magazine,Pub & Bar,Sports & Leisure Catering,TUCO Magazine
Done deal and now have a better grasp on doing this with python; time to find a few more data sets to grab.
I let the client know that I have their data, and then some. Let's see if doing the work and asking for the money helps.