[Date Prev][Date Next] [Thread Prev][Thread Next] [Date Index] [Thread Index]

Re: trying to parse lines from an awkwardly formatted HAR file ...



>On Sat, Mar 23, 2024 at 1:44 AM <tomas@tuxteam.de> wrote:
>> On Sat, Mar 23, 2024 at 12:53:24AM -0500, Albretch Mueller wrote:
>> out of a HAR file containing lots of obfuscating js cr@p and all kinds of
>> nonsense I was able to extract line looking like:

>It's not "js cr@p", It is called JSON. And there's a spec for
>it.

 Well, I am old enough to remember when JSON meant: "JavaScript Object
Notation" in the form of human-readable attribute:value text files.

 a) using a chromium-derived browser, which can be used to dump the
HAR file log of the network back and forth, go, e. g.:
  https://en.wikipedia.org/wiki/Anaxagoras
 b) click on the link that says: "Works by or about Anaxagoras" (at
Internet Archive)
 c) on the archive.org page, select "texts" and "always available"
(meaning text which is public domain, he died 25 centuries ago)
 d) then to produce the HAR file, go:
 d.1) More Tools > Developer Tools;
 d.2) click on "Network" tab;
 d.3) Filter: GET
 d.4) check: "Preserve Log"
 d.5) scroll down the page all the way to make the client-server back
and forth cascade
 d.6) save the network log as HAR file to then open and eyeball it!

>> I have tried substring substitution, sed et tr to no avail.
>You might have a lot of fun trying to parse JSON with sed and
>tr.

 1) That HAR file is not properly formatted. Instead of
"attribute":value pairs in the standard way, they have used front
slash + quote pairs (instead of just quotes) erratically all around
the file. That is why you can't use jq.
 2) since they (archive.org) have been changing the format they use on
their pages (to avoid html scrappers?), I don't try to make sense of
what they do. I would just use quick hacks and "keep moving".
 2.a) make editing copy of the file
 2.b) using sed I would parse out the lines with the data I need:
  sed --in-place --expression
's/{\\"index\\":\\"/\n{\\"index\\":\\"/g' "<editing copy>"
 2.c) once you extract them, you then need to parse the fields for
post processing.

 I have tried substring substitution, sed et tr to first replace all
front slash + quote pairs into quotes to then be able to use jq in the
happy way you should. I haven't been successful (is that the reason
why they obfuscate their pages in that way?)

 lbrtchx


Reply to: