Web Scraping 102
Introduction
Welcome to Web Scraping 102! In Web Scraping 101 we conducted our initial Recon and Research for scraping CivitAI. This article will cover retrieval and saving the data from the images endpoint we discovered earlier.
If you're sitting comfortably, let's begin.
Code Recap
Currently our code looks something like this:
from curl_cffi import requests
import json
from urllib.parse import quote
url = "https://civitai.com/api/trpc/image.getInfinite"
payload = {
"json": {
"period": "Week",
"sort": "Most Reactions",
"types": ["image"],
"browsingLevel": 1,
"include": ["cosmetics"],
"cursor": None,
},
"meta": {"values": {"cursor": ["undefined"]}},
}
query = f'input={quote(json.dumps(payload, separators=(",", ":")))}'
r = requests.get(f"{url}?{query}", impersonate="chrome")
j = r.json()
print(j)
We have defined the endpoint and payload, we are processing the payload to match the original request then making the request and print the retrieved data.
Our goal now is to retrieve more data and save it for later processing.
Stage 2: Retrieval
We are going to use jsonlines
also known as JSONL
to efficiently save the data as it's retrieved.
If you haven't already, install jsonlines
now using pip
.
pip install jsonlines
jsonlines
provides an easy to use interface that automatically handles encoding, let's set that up:
import jsonlines
import pathlib
base_path = "/your/base/path"
BASE = pathlib.Path(base_path)
IMAGES = BASE / "images.jsonl"
writer = jsonlines.open(IMAGES, mode="a")
We're using pathlib
for its helpful functionality that will be very useful in later stages when we download the images.
We use mode="a"
to append results the next time we run the script.
Let's begin by saving the data from that first request.
{'result': {'data': {'json': {'nextCursor': '798|170|525|24907786',
'items': [{'id': 24294279,
...
The data we are looking for is result.data.json.items
, this is a list of the images.
>>> j['result']['data']['json']['items'][0]
{'id': 24294279,
'name': '00018-2785547559.png',
'url': 'bc6700f4-7fd3-4fc8-84ce-dcb154161850',
'nsfwLevel': 1,
'width': 1080,
'height': 1680,
'hash': 'U7BMuu0000~V%N%M4.OY00_4^*R457E1_4IT',
'hideMeta': False,
'hasMeta': True,
'onSite': False,
'generationProcess': 'img2img',
'createdAt': '2024-08-14T17:20:13.567Z',
'sortAt': '2024-08-14T17:20:49.238Z',
'mimeType': 'image/png',
'type': 'image',
'metadata': {'hash': 'U7BMuu0000~V%N%M4.OY00_4^*R457E1_4IT',
'size': 1928858,
'width': 1080,
'height': 1680},
'ingestion': 'Scanned',
'scannedAt': '2024-08-14T17:20:22.351Z',
'needsReview': None,
'postId': 5424571,
'postTitle': None,
'index': 1,
'publishedAt': '2024-08-14T17:20:49.238Z',
'modelVersionId': None,
'availability': 'Public',
'user': {'id': 1409647,
'username': 'Tommu',
'image': 'e162fba0-61d0-4884-8da6-a2ad760e2b4f',
'deletedAt': None,
'cosmetics': [{'data': None,
'cosmetic': {'id': 106,
'data': {'url': 'e34e2479-8a48-4b7b-8e62-31a70fe1490c'},
'type': 'Badge',
'source': 'Trophy',
'name': 'Bronze Generator Badge'}}],
'profilePicture': None},
'stats': {'cryCountAllTime': 154,
'laughCountAllTime': 312,
'likeCountAllTime': 3302,
'dislikeCountAllTime': 0,
'heartCountAllTime': 1277,
'commentCountAllTime': 0,
'collectedCountAllTime': 165,
'tippedAmountCountAllTime': 360,
'viewCountAllTime': 0},
'reactions': [],
'tags': None,
'tagIds': [4,
5262,
3642,
3628,
122902,
161864,
162034,
5148,
16001,
120104,
3629,
111850,
2687,
114877,
6997,
161945,
115513,
111656,
5453,
1853,
27182,
140255,
112994,
162490,
531,
118075,
153354,
119779,
162489,
5773,
7084,
4213,
161952,
5784,
163963,
120504],
'cosmetic': None}
You may notice the url
field is just a uuid
, the full url must be built from this and other metadata. Let's head back to the Explore all images
page for some more recon, as there are images on this page we are looking for an example url so that we can determine the format. Often the easiest way to do this is just right click an image and copy the image link.
https://image.civitai.com/xG1nkqKTMzGDvpLrqFT7WA/bc6700f4-7fd3-4fc8-84ce-dcb154161850/anim=false,width=450/00018-2785547559.jpeg
We recognize some of this from the data:
url
bc6700f4-7fd3-4fc8-84ce-dcb154161850
name
00018-2785547559.jpeg
The extension of the copied image url is jpeg
; this either means a jpeg
version is being served, or the extension is wrong as you may have experienced before when you've saved a webp
thinking you're getting a jpg
.
https://image.civitai.com/xG1nkqKTMzGDvpLrqFT7WA/
This part looks like a prefix, we can confirm xG1nkqKTMzGDvpLrqFT7WA
doesn't change by checking some more image urls.
anim=false,width=450
This part appears to affect the size of the image, let's test what happens when we remove it:
https://image.civitai.com/xG1nkqKTMzGDvpLrqFT7WA/bc6700f4-7fd3-4fc8-84ce-dcb154161850/00018-2785547559.jpeg
Cool! we're getting the full size image.
Let's test with a different extension:
https://image.civitai.com/xG1nkqKTMzGDvpLrqFT7WA/bc6700f4-7fd3-4fc8-84ce-dcb154161850/00018-2785547559.png
You'll notice the file size is the same. This means the name
in our data is the original filename and the server is configured to return the image it has under any extension. We will stick to using jpeg
as the extension as there is likely some caching in place behind the scenes based on the url meaning the jpeg
extension will load faster.
So, our image url format is something like:
"https://image.civitai.com/xG1nkqKTMzGDvpLrqFT7WA/{url}/{name}.jpeg"
We can process each record before saving it to replace the url with this format, or process it at a later point. Let's do it as we save the records.
url_format = "https://image.civitai.com/xG1nkqKTMzGDvpLrqFT7WA/{url}/{name}.jpeg"
for image in j['result']['data']['json']['items']:
image["url"] = url_format.format(
url=image["url"], name=image["name"].split(".")[0]
)
writer.write(image)
Great work! At this point we could also process other fields or remove ones we don't want. There's something more important though, what about duplicate records? The current payload is for retrieving images from period: Week
with sort: Most Reactions
, naturally this will change over time and as users react to images. We need to keep track of what we already have.
If we were using a database like MongoDB
or PostgresSQL
we could simply add a unique index on the image's id
, we can accomplish the same thing by storing the id
of images we've already seen. We'll use a set
for performance compared to a list
.
image_ids = set()
for image in j['result']['data']['json']['items']:
if image['id'] in image_ids:
continue
image_ids.add(image['id'])
image["url"] = url_format.format(
url=image["url"], name=image["name"].split(".")[0]
)
writer.write(image)
Pretty simple changes, if the image's id
is already in the set we skip it. However, we need this to be more robust so we can restart the script at any point. We'll use regular json for this:
IMAGE_IDS = BASE / "image_ids.json"
if IMAGE_IDS.exists():
image_ids = set(json.load(IMAGE_IDS.open()))
else:
image_ids = set()
and at the end of our script we'll use something like:
json.dump(list(image_ids), IMAGE_IDS.open("a"))
writer.close()
We also close the jsonlines
file at that point.
Putting everything so far together we have something like:
from curl_cffi import requests
import json
import jsonlines
import pathlib
from urllib.parse import quote
base_path = "/your/base/path"
BASE = pathlib.Path(base_path)
IMAGES = BASE / "images.jsonl"
IMAGE_IDS = BASE / "image_ids.json"
if IMAGE_IDS.exists():
image_ids = set(json.load(IMAGE_IDS.open()))
else:
image_ids = set()
writer = jsonlines.open(IMAGES, mode="a")
url_format = "https://image.civitai.com/xG1nkqKTMzGDvpLrqFT7WA/{url}/{name}.jpeg"
url = "https://civitai.com/api/trpc/image.getInfinite"
payload = {
"json": {
"period": "Week",
"sort": "Most Reactions",
"types": ["image"],
"browsingLevel": 1,
"include": ["cosmetics"],
"cursor": None,
},
"meta": {"values": {"cursor": ["undefined"]}},
}
query = f'input={quote(json.dumps(payload, separators=(",", ":")))}'
r = requests.get(f"{url}?{query}", impersonate="chrome")
j = r.json()
for image in j['result']['data']['json']['items']:
if image['id'] in image_ids:
continue
image_ids.add(image['id'])
image["url"] = url_format.format(
url=image["url"], name=image["name"].split(".")[0]
)
writer.write(image)
json.dump(list(image_ids), IMAGE_IDS.open("w"))
writer.close()
Great! Now how do we get more images? You may have noticed nextCursor
in the data and cursor
as part of the payload.
{'result': {'data': {'json': {'nextCursor': '799|152|530|24469485',
That's what we'll be using, we'll need to set up a loop, replacing the cursor
for each subsequent request. We'll use a while
loop, but we need to consider a stop condition; we can do some further recon and research to figure that out.
Head back to the Explore all images
with Developer Tools open as before, and apply some filters to get a smaller set of results, something like Time Period: Day
and Base Model: PixArt E
should work. Yep, only a few results, let's check the response: nextCursor
is null
or None
in Python, that's our stop condition. Let's implement the loop:
cursor = None
process = True
while process:
payload['json']['cursor'] = cursor
query = f'input={quote(json.dumps(payload, separators=(",", ":")))}'
r = requests.get(f"{url}?{query}", impersonate="chrome")
j = r.json()
for image in j['result']['data']['json']['items']:
if image['id'] in image_ids:
continue
image_ids.add(image['id'])
image["url"] = url_format.format(
url=image["url"], name=image["name"].split(".")[0]
)
writer.write(image)
process = j['result']['data']['json']['nextCursor'] is not None
cursor = j['result']['data']['json']['nextCursor']
We've set the initial cursor
, for each iteration we set cursor
in the payload then cursor
is set to the value of nextCursor
. We set process
by checking whether nextCursor
is None
.
Awesome! Our code in full now looks something like:
from curl_cffi import requests
import json
import jsonlines
import pathlib
from urllib.parse import quote
base_path = "/your/base/path"
BASE = pathlib.Path(base_path)
IMAGES = BASE / "images.jsonl"
IMAGE_IDS = BASE / "image_ids.json"
if IMAGE_IDS.exists():
image_ids = set(json.load(IMAGE_IDS.open()))
else:
image_ids = set()
writer = jsonlines.open(IMAGES, mode="a")
url_format = "https://image.civitai.com/xG1nkqKTMzGDvpLrqFT7WA/{url}/{name}.jpeg"
url = "https://civitai.com/api/trpc/image.getInfinite"
payload = {
"json": {
"period": "Week",
"sort": "Most Reactions",
"types": ["image"],
"browsingLevel": 1,
"include": ["cosmetics"],
"cursor": None,
},
"meta": {"values": {"cursor": ["undefined"]}},
}
cursor = None
process = True
while process:
payload['json']['cursor'] = cursor
query = f'input={quote(json.dumps(payload, separators=(",", ":")))}'
r = requests.get(f"{url}?{query}", impersonate="chrome")
j = r.json()
for image in j['result']['data']['json']['items']:
if image['id'] in image_ids:
continue
image_ids.add(image['id'])
image["url"] = url_format.format(
url=image["url"], name=image["name"].split(".")[0]
)
writer.write(image)
process = j['result']['data']['json']['nextCursor'] is not None
cursor = j['result']['data']['json']['nextCursor']
json.dump(list(image_ids), IMAGE_IDS.open("w"))
writer.close()
Let's run it and see what happens:
KeyError: 'result'
{'error': {'json': {'message': 'Please use the public API instead: https://github.com/civitai/civitai/wiki/REST-API-Reference',
'code': -32001,
'data': {'code': 'UNAUTHORIZED',
'httpStatus': 401,
'path': 'image.getInfinite'}}}}
Oh no! there must be something wrong. Let's go back and do some more recon. Scroll down the page to let more images load, check the new request and look at the payload, it looks like the meta
field is removed when cursor
is not None
. Let's sort that out:
while process:
payload["json"]["cursor"] = cursor
if cursor is not None:
_ = payload.pop("meta", None)
Oh no! it's still not working. There must be something else. Our goal is to match the original requests, so let's check out that request in Developer Tools
as before. We'll notice a bunch of Request Headers
, let's try replicating those.
You'll notice the standard headers, like accept
, content-type
, but these look special:
x-client:
web
x-client-date:
1724139341492
x-client-version:
4.0.169
Indeed they are. These are custom headers set by CivtiAI's web application.
x-client-date
looks like a timestamp, so we should generate this when we send the request.
We'll also add accept
, accept-language
, content-type
and Referer
. As we need to generate x-client-date
we'll use a function to return the headers:
def headers():
return {
"accept": "*/*",
"accept-language": "en-US,en;q=0.9",
"content-type": "application/json",
"x-client": "web",
"x-client-date": str(int(datetime.datetime.now().timestamp() * 1000)),
"x-client-version": "4.0.169",
"Referer": "https://civitai.com/images",
}
Then modify the request to include these headers:
r = requests.get(f"{url}?{query}", headers=headers(), impersonate="chrome")
While we're making changes, let's add some basic progress report with a print:
...
process = j["result"]["data"]["json"]["nextCursor"] is not None
cursor = j["result"]["data"]["json"]["nextCursor"]
print(len(image_ids))
To recap, our code now looks something like this:
from curl_cffi import requests
import json
import jsonlines
import pathlib
from urllib.parse import quote
import datetime
def headers():
return {
"accept": "*/*",
"accept-language": "en-US,en;q=0.9",
"content-type": "application/json",
"x-client": "web",
"x-client-date": str(int(datetime.datetime.now().timestamp() * 1000)),
"x-client-version": "4.0.169",
"Referer": "https://civitai.com/images",
}
base_path = "/your/base/path"
BASE = pathlib.Path(base_path)
IMAGES = BASE / "images.jsonl"
IMAGE_IDS = BASE / "image_ids.json"
if IMAGE_IDS.exists():
image_ids = set(json.load(IMAGE_IDS.open()))
else:
image_ids = set()
writer = jsonlines.open(IMAGES, mode="a")
url_format = "https://image.civitai.com/xG1nkqKTMzGDvpLrqFT7WA/{url}/{name}.jpeg"
url = "https://civitai.com/api/trpc/image.getInfinite"
payload = {
"json": {
"period": "Week",
"sort": "Most Reactions",
"types": ["image"],
"browsingLevel": 1,
"include": ["cosmetics"],
"cursor": None,
},
"meta": {"values": {"cursor": ["undefined"]}},
}
cursor = None
process = True
while process:
payload["json"]["cursor"] = cursor
if cursor is not None:
_ = payload.pop("meta", None)
query = f'input={quote(json.dumps(payload, separators=(",", ":")))}'
r = requests.get(f"{url}?{query}", headers=headers(), impersonate="chrome")
j = r.json()
for image in j["result"]["data"]["json"]["items"]:
if image["id"] in image_ids:
continue
image_ids.add(image["id"])
image["url"] = url_format.format(
url=image["url"], name=image["name"].split(".")[0]
)
writer.write(image)
process = j["result"]["data"]["json"]["nextCursor"] is not None
cursor = j["result"]["data"]["json"]["nextCursor"]
print(len(image_ids))
json.dump(list(image_ids), IMAGE_IDS.open("w"))
writer.close()
Time to run the script again ๐ค
100
200
300
...
2598
2698
2798
...
Wow! Awesome! Data acquired ๐
We've learned the importance of matching not only the request payload but the headers too, which can distinguish your request from the original, and how to efficiently save our acquired data while keeping track of data we already have. Great work!
We'll take a short break at this point while we prepare for stage 3, where we'll refine our process; adding error checking, better progress and more!