Re: trying to parse lines from an awkwardly formatted HAR file ...

To: debian-user@lists.debian.org
Cc: debian-user <debian-user@lists.debian.org>
Subject: Re: trying to parse lines from an awkwardly formatted HAR file ...
From: Albretch Mueller <lbrtchx@gmail.com>
Date: Sat, 23 Mar 2024 09:54:05 -0500
Message-id: <[🔎] CAFakBwiUq78DsXh+V_ey+FaHQJjDoHsymuBzH-PeA_Ge8XViFQ@mail.gmail.com>
In-reply-to: <[🔎] Zf56MtJn4IyGRFHw@tuxteam.de>
References: <[🔎] CAFakBwhVPFPUpiaYPxhgO4motEKFPCaFB5860c_od=Zhs72nnA@mail.gmail.com> <[🔎] Zf56MtJn4IyGRFHw@tuxteam.de>

>On Sat, Mar 23, 2024 at 1:44 AM <tomas@tuxteam.de> wrote:
>> On Sat, Mar 23, 2024 at 12:53:24AM -0500, Albretch Mueller wrote:
>> out of a HAR file containing lots of obfuscating js cr@p and all kinds of
>> nonsense I was able to extract line looking like:

>It's not "js cr@p", It is called JSON. And there's a spec for
>it.

 Well, I am old enough to remember when JSON meant: "JavaScript Object
Notation" in the form of human-readable attribute:value text files.

 a) using a chromium-derived browser, which can be used to dump the
HAR file log of the network back and forth, go, e. g.:
  https://en.wikipedia.org/wiki/Anaxagoras
 b) click on the link that says: "Works by or about Anaxagoras" (at
Internet Archive)
 c) on the archive.org page, select "texts" and "always available"
(meaning text which is public domain, he died 25 centuries ago)
 d) then to produce the HAR file, go:
 d.1) More Tools > Developer Tools;
 d.2) click on "Network" tab;
 d.3) Filter: GET
 d.4) check: "Preserve Log"
 d.5) scroll down the page all the way to make the client-server back
and forth cascade
 d.6) save the network log as HAR file to then open and eyeball it!

>> I have tried substring substitution, sed et tr to no avail.
>You might have a lot of fun trying to parse JSON with sed and
>tr.

 1) That HAR file is not properly formatted. Instead of
"attribute":value pairs in the standard way, they have used front
slash + quote pairs (instead of just quotes) erratically all around
the file. That is why you can't use jq.
 2) since they (archive.org) have been changing the format they use on
their pages (to avoid html scrappers?), I don't try to make sense of
what they do. I would just use quick hacks and "keep moving".
 2.a) make editing copy of the file
 2.b) using sed I would parse out the lines with the data I need:
  sed --in-place --expression
's/{\\"index\\":\\"/\n{\\"index\\":\\"/g' "<editing copy>"
 2.c) once you extract them, you then need to parse the fields for
post processing.

 I have tried substring substitution, sed et tr to first replace all
front slash + quote pairs into quotes to then be able to use jq in the
happy way you should. I haven't been successful (is that the reason
why they obfuscate their pages in that way?)

 lbrtchx

Reply to:

Follow-Ups:
- Re: trying to parse lines from an awkwardly formatted HAR file ...
  - From: Greg Wooledge <greg@wooledge.org>

References:
- trying to parse lines from an awkwardly formatted HAR file ...
  - From: Albretch Mueller <lbrtchx@gmail.com>
- Re: trying to parse lines from an awkwardly formatted HAR file ...
  - From: <tomas@tuxteam.de>

Prev by Date: Re: trying to parse lines from an awkwardly formatted HAR file ...
Next by Date: Re: trying to parse lines from an awkwardly formatted HAR file ...
Previous by thread: Re: trying to parse lines from an awkwardly formatted HAR file ...
Next by thread: Re: trying to parse lines from an awkwardly formatted HAR file ...
Index(es):
- Date
- Thread