|
jonnyawsom3
|
2025-12-03 04:19:38
|
I'm not sure how that would work... You can't just mix coefficients of different sizes
|
|
2025-12-03 04:20:18
|
Block merging is an encoder heuristuc, not a coding tool
|
|
|
AccessViolation_
|
2025-12-03 04:20:27
|
yeah merging blocks requires decoding them to pixel data
|
|
|
Magnap
|
|
AccessViolation_
yeah merging blocks requires decoding them to pixel data
|
|
2025-12-03 04:21:07
|
I know, I was thinking that by looking at the coefficients you could make really good guesses as to which blocks could be merged
|
|
|
AccessViolation_
|
2025-12-03 04:21:37
|
oh maybe. I was thinking of user the encoder's native method
|
|
|
Magnap
|
|
I'm not sure how that would work... You can't just mix coefficients of different sizes
|
|
2025-12-03 04:21:42
|
you can create a 16x16 block that decodes to the same pixels as 4 8x8 blocks is what I had in mind
|
|
|
AccessViolation_
|
2025-12-03 04:21:47
|
libjxl starts with 8x8 blocks for everything and then sees which can be "merged" i.e. replaced by a larger block
|
|
|
Magnap
|
2025-12-03 04:23:03
|
well that wouldn't be very coding-system-aware π
|
|
2025-12-03 04:25:49
|
I don't have quite the necessary intuition about DCT to be sure, but I feel like e.g. a 2x2 block of 8x8 blocks with mostly low-frequency coefficients would do well as a 16x16 block, etc
|
|
|
AccessViolation_
|
|
Magnap
well that wouldn't be very coding-system-aware π
|
|
2025-12-03 06:16:06
|
oh to be clear I don't mean that libjxl merges blocks after they've already been encoded. it probably looks at the blocks after encoding them and if they seem good candidates for a merge, it encodes a larger block there, from the original pixel data. or I assume so at least, I haven't looked at the source code in depth
|
|
|
Magnap
|
|
AccessViolation_
oh to be clear I don't mean that libjxl merges blocks after they've already been encoded. it probably looks at the blocks after encoding them and if they seem good candidates for a merge, it encodes a larger block there, from the original pixel data. or I assume so at least, I haven't looked at the source code in depth
|
|
2025-12-03 06:26:09
|
I didn't think that's what you meant, my point was just that if you have to go all the way to pixels without being able to use the coefficients to speed things up or otherwise make better decisions, then it's not using the known format for anything, but I think in this case you might be able to make decisions on the coefficients
|
|
|
AccessViolation_
|
2025-12-03 06:26:36
|
ah
|
|
2025-12-03 06:27:09
|
well, you can still use the knowledge of where block borders are for example
|
|
2025-12-03 06:28:05
|
so that you're not potentially wasting bits to recompress artifacts from block borders, and instead can reuse the block layout
|
|
|
jonnyawsom3
|
|
AccessViolation_
oh to be clear I don't mean that libjxl merges blocks after they've already been encoded. it probably looks at the blocks after encoding them and if they seem good candidates for a merge, it encodes a larger block there, from the original pixel data. or I assume so at least, I haven't looked at the source code in depth
|
|
2025-12-03 06:37:28
|
From what I can tell, it just checks the entropy of a block and if it's below a threshold, it checks the entropy of a larger block against a higher threshold
|
|
|
AccessViolation_
|
2025-12-04 09:45:26
|
I've been thinking about how you can reversibly blur an image by multiplying it by a kernel, and then unblur it by dividing it by the kernel. and since blurry images are easier to predict, you could potentially compress images by blurring them and losslessly compressing the blurred version
|
|
2025-12-04 09:46:49
|
this might count as lossless if you have enough precision, but otherwise it'll be interesting to see how this compresses even with slight data loss due to precision issues
|
|
2025-12-04 09:51:51
|
I haven't figured out the logic error in this yet, because you could potentially do this to a very noisy image and compress it really good, and still be able to unblur it, effectively compressing random data quite effectively which shouldn't be possible. I suspect the problem is that the stronger your blur, the more bits are required in the blurred representation for enough precision to keep it perfectly reversible in practice
|
|
|
lonjil
|
2025-12-04 10:26:37
|
correct
|
|
2025-12-04 10:26:57
|
also, I think that the blurred image needs to be bigger than the original image for it to be theoretically lossless?
|
|
2025-12-04 10:27:07
|
as in, dimensionally, more pixels
|
|
|
AccessViolation_
|
2025-12-04 10:27:59
|
not as far as I'm aware
|
|
2025-12-04 10:29:01
|
https://www.youtube.com/watch?v=xDLxFGXuPEc
|
|
|
lonjil
|
2025-12-04 10:43:14
|
I don't think it's reversible along the edges of the images unless there are more pixels for the unblur kernel to read outside of the original image bounds. Since a blur kernel "spreads out" the data from each pixel, some of that must end up outside of the original image.
|
|
|
AccessViolation_
|
2025-12-04 10:56:31
|
ah like that, yeah that might be the case
|
|
2025-12-04 10:56:46
|
I thought you meant it had to be several times bigger than the original or something like that
|
|
2025-12-04 10:57:12
|
I guess it would if the kernel is as large as the image, then
|
|
|
Exorcist
|
|
AccessViolation_
I've been thinking about how you can reversibly blur an image by multiplying it by a kernel, and then unblur it by dividing it by the kernel. and since blurry images are easier to predict, you could potentially compress images by blurring them and losslessly compressing the blurred version
|
|
2025-12-04 11:08:10
|
> since blurry images are easier to predict
Prove it?
|
|
|
_wb_
|
2025-12-04 11:20:24
|
gaborish does exactly the opposite of this: it sharpens before encode and blurs after decode.
I don't think it would work to do blur before encode and then sharpen after decode. The reason blurred images compress better is that the amplitude of high freqs drops by blurring, so more of those coeffs will get quantized away, but that also means the information cannot be restored by undoing the blur.
|
|
|
|
ignaloidas
|
2025-12-04 11:45:58
|
I think if you were to avoid quantization, you wouldn't get good results with blurring, if your kernel is only interger multiplies, then you have to bump the bitdepth of the image by ceil(log2(sum of kernel coffs))
|
|
|
AccessViolation_
|
|
Exorcist
> since blurry images are easier to predict
Prove it?
|
|
2025-12-04 11:46:44
|
|
|
|
|
ignaloidas
|
2025-12-04 11:48:17
|
now check if the blurry image can be de-blurred into the original, I expect significant loss π
|
|
|
AccessViolation_
|
2025-12-04 11:51:00
|
(RE: now deleted message)
they're the same base image, one has a gaussian blur applied. the point was that depending on the type of blur used, you can losslessly undo the blur if you know how the image was blurred. hence the unblurred image can, in theory, be reconstructed into the original
|
|
|
Exorcist
|
2025-12-04 11:51:37
|
> losslessly undo the blur
Prove it can "losslessly"?
|
|
|
|
ignaloidas
|
2025-12-04 11:52:05
|
not with floats, you're getting implicit quantization when adding up numbers of different magnitudes
|
|
2025-12-04 11:52:20
|
and with ints, you'll be increasing bitdepth
|
|
|
AccessViolation_
|
|
Exorcist
> losslessly undo the blur
Prove it can "losslessly"?
|
|
2025-12-04 11:54:48
|
first you multiply the image by a blur kernel, blurring it, and then when you divide the image by the same blur kernel you get the original image back. the losslessness is a guarantee because that's how multiplication and division work. but you need enough bits to represent them, so that if you're working with say 8 bit per channel images, you're not losing so much precision that the values are reversed to the same integer
|
|
|
Exorcist
|
2025-12-04 11:56:40
|
Talk is cheap, show me the result
Can you restore from `blur-modular-e9.jxl`?
|
|
|
AccessViolation_
|
|
_wb_
gaborish does exactly the opposite of this: it sharpens before encode and blurs after decode.
I don't think it would work to do blur before encode and then sharpen after decode. The reason blurred images compress better is that the amplitude of high freqs drops by blurring, so more of those coeffs will get quantized away, but that also means the information cannot be restored by undoing the blur.
|
|
2025-12-04 11:56:54
|
yeah, if you touch the blurred image basically at all it significantly corrupts the final result, which is also why it's not trivially possible to unblur a lossily compressed image even if you know exactly how it was blurred. there are apparently ways to get around this issue but the result then presumably is more like an estimation and not lossless
|
|
|
Exorcist
Talk is cheap, show me the result
Can you restore from `blur-modular-e9.jxl`?
|
|
2025-12-04 11:57:53
|
this is an open conversation I started out of curiosity and interest, I'm not trying to convince anyone
|
|
|
Exorcist
|
2025-12-04 11:58:59
|
> I'm not trying to convince anyone
Nice cope
|
|
|
Tirr
|
2025-12-04 11:59:54
|
"dividing by the kernel" implies that there is an inverse of the convolution kernel, which I doubt
|
|
|
AccessViolation_
|
|
Exorcist
> I'm not trying to convince anyone
Nice cope
|
|
2025-12-04 12:00:08
|
I don't even know how to respond to that
|
|
|
Quackdoc
|
2025-12-04 12:00:31
|
Blur can easily be undone in a LOT of contexts, which is why blur is never used as a serious destructive mechanism for protecting peoples identity
|
|
|
Magnap
|
|
Exorcist
> I'm not trying to convince anyone
Nice cope
|
|
2025-12-04 12:02:58
|
why so hostile?
|
|
|
AccessViolation_
|
|
_wb_
gaborish does exactly the opposite of this: it sharpens before encode and blurs after decode.
I don't think it would work to do blur before encode and then sharpen after decode. The reason blurred images compress better is that the amplitude of high freqs drops by blurring, so more of those coeffs will get quantized away, but that also means the information cannot be restored by undoing the blur.
|
|
2025-12-04 12:06:21
|
I do remember reading that the blur at decode time isn't the inverse of the sharpen at encode time, specifically the blur kernel is a bit larger than the shapen kernel, unlike this, where the kernel needs to be exactly the same for it to work
|
|
|
Quackdoc
|
2025-12-04 12:06:30
|
perhaps instead if lossless it would be better to say without a significant degree of loss
|
|
|
Exorcist
|
2025-12-04 12:06:37
|
I think he can "so easy to unblur" since he know the kernel
|
|
|
_wb_
|
2025-12-04 12:07:56
|
you can't invert a blur with a kernel the same size, we currently use a 5x5 'inverse' for the 3x3 but even that is not an exact inverse, not sure if an exact one is possible with a finite kernel size
|
|
2025-12-04 12:08:15
|
(in any case 'exact' is modulo the precision of float32 anyway)
|
|
|
Tirr
|
2025-12-04 12:08:27
|
approximation would be possible, yeah
|
|
2025-12-04 12:08:47
|
but then it's not mathematically lossless
|
|
|
AccessViolation_
|
2025-12-04 12:10:02
|
as I understand, it's lossless in the realm of mathematics, but in reality you're going hit precision limits at some point
|
|
|
_wb_
you can't invert a blur with a kernel the same size, we currently use a 5x5 'inverse' for the 3x3 but even that is not an exact inverse, not sure if an exact one is possible with a finite kernel size
|
|
2025-12-04 12:10:30
|
oh! I didn't know that. the video I linked above implied you had to use the exact same kernel for multiplication and division. maybe they presented it like that as a simplification
|
|
2025-12-04 12:23:55
|
I forgot to mention, the unblurring trick may only work if you're working in the frequency domain, like with a fourier transfer of the image and a fourier transfer of the kernel. or it may not. but that's what they did in the video
|
|
|
Exorcist
|
2025-12-04 12:25:31
|
https://gemini.google.com/share/0694c2e514c4
|
|
|
AccessViolation_
|
2025-12-04 12:29:11
|
I will say posting LLM slop is a bold move after expecting me to prove to you whether the thing I was theorizing about worked
|
|
|
Exorcist
|
2025-12-04 12:30:04
|
I only want to recall the formula, but Gemini reminder me there exist noise term
|
|
2025-12-04 12:32:06
|
Every rounding error is a kind of noise, so I wish float32 can enough precise for your theory
|
|
|
|
ignaloidas
|
|
Tirr
"dividing by the kernel" implies that there is an inverse of the convolution kernel, which I doubt
|
|
2025-12-04 12:32:23
|
if the kernel has no 0 terms, there always is an inverse
|
|
|
Exorcist
|
2025-12-04 02:09:29
|
https://boards.4chan.org/g/thread/107330239/anyone-else-sticking-with-webp-at-reasonable
|
|
2025-12-04 02:09:49
|
Google return 4chan result<:FeelsAmazingMan:808826295768449054>
|
|
|
Quackdoc
|
|
Exorcist
I think he can "so easy to unblur" since he know the kernel
|
|
2025-12-04 02:21:36
|
Yes, this is the case we're talking about. But even without that, there's only so many popular kernels out in the wild.
|
|
2025-12-04 02:21:55
|
You can actually often just brute force to get a usable result, but that's not the use case we're talking about here.
|
|
|
Magnap
|
2025-12-04 02:25:59
|
encoded a correct header (SizeHeader and ImageHeader) π my library is coming along nicely π
|
|
2025-12-04 02:26:35
|
now to debug my FrameHeader/TOC/LfGlobal encoders because the image still doesn't actually decode
|
|
|
|
ignaloidas
|
|
Magnap
encoded a correct header (SizeHeader and ImageHeader) π my library is coming along nicely π
|
|
2025-12-04 02:26:52
|
what you're writing it in?
|
|
|
Magnap
|
|
ignaloidas
what you're writing it in?
|
|
2025-12-04 02:27:34
|
Rust
|
|
|
spider-mario
|
2025-12-04 03:24:41
|
cursed realisation of the day about C preprocessor macros (or perhaps rediscovery of something I once knew and then chose to forget):
```c
#define MY_MACRO() const int kVar##__LINE__ = whatever
```
nope, not substituted, just produces `kVar__LINE__` so it conflicts if you call it twice
```c
#define CONCAT(a, b) a##b
#define MY_MACRO() const int CONCAT(kVar, __LINE__) = whatever
```
still not
```c
#define CONCAT1(a, b) a##b
#define CONCAT2(a, b) CONCAT1(a, b)
#define MY_MACRO() const int CONCAT2(kVar, __LINE__) = whatever
```
_now_ yes
|
|
|
lonjil
|
2025-12-04 04:09:06
|
Aaaaa
|
|
|
Magnap
|
|
Magnap
now to debug my FrameHeader/TOC/LfGlobal encoders because the image still doesn't actually decode
|
|
2025-12-04 04:17:47
|
FrameHeader and TOC are fine, but I cannot for the life of me figure out why my GlobalModular isn't working
|
|
2025-12-04 06:03:40
|
Encoded my first correct JXL π
|
|
2025-12-04 06:04:05
|
a 16-byte "animation" of a single black 256x256 frame π
|
|
|
AccessViolation_
|
2025-12-04 09:24:05
|
nice!
|
|
|
monad
|
2025-12-04 09:37:27
|
that's freaking amaze
|
|
|
Magnap
|
2025-12-04 09:38:41
|
I am extremely not aiming for a full encoder, I just wanna make little sprite-based animations π
|
|
|
AccessViolation_
nice!
|
|
2025-12-04 09:39:56
|
It was actually your comment about making animations of chess games that inspired me, that's what I'm aiming for with this project, and then ideally I'll keep the library reasonably general so it can be used for more general animations
|
|
|
AccessViolation_
|
2025-12-04 09:40:23
|
oooo cool
|
|
2025-12-04 09:41:22
|
it would be awesome to eventually have the "export to JPEG XL" option in Lichess, especially if they can be crazy small :3
|
|
|
Magnap
|
2025-12-04 09:41:43
|
I have this idea that you can have a single reference-only frame for each piece and then build a sprite sheet at decode time
|
|
|
|
veluca
|
2025-12-04 09:42:00
|
yep, that would work
|
|
|
AccessViolation_
|
2025-12-04 09:42:09
|
I had the same idea I think
|
|
2025-12-04 09:42:13
|
I don't remember what exactly I wrote
|
|
|
Magnap
|
2025-12-04 09:42:24
|
I still need to read up on the specifics of what you can use as a source when and where, but I think I can leave at least 2 slots available to the user
|
|
|
|
veluca
|
2025-12-04 09:42:44
|
there's 4 saved frames at any point in time
|
|
|
AccessViolation_
|
2025-12-04 09:43:02
|
oh shoot you need to propagate the reference frame?
|
|
|
Magnap
|
|
veluca
there's 4 saved frames at any point in time
|
|
2025-12-04 09:43:04
|
Yeah but you can't really control slot 0
|
|
|
|
veluca
|
2025-12-04 09:43:05
|
any frame framed that is saved before color transforms can be used for patches
|
|
|
AccessViolation_
|
2025-12-04 09:43:05
|
hadn't thought of that
|
|
|
|
veluca
|
|
Magnap
Yeah but you can't really control slot 0
|
|
2025-12-04 09:43:13
|
why not?
|
|
2025-12-04 09:43:22
|
I mean, it gets used for blending by the encoder
|
|
2025-12-04 09:43:32
|
but it doesn't *have* to
|
|
2025-12-04 09:43:41
|
(IIRC)
|
|
|
Magnap
|
|
veluca
why not?
|
|
2025-12-04 09:43:53
|
IIRC duration-0 frames get automatically saved there? But again, I'm not super clear on the specifics
|
|
|
|
veluca
|
2025-12-04 09:44:03
|
lemme check...
|
|
|
Magnap
|
2025-12-04 09:44:45
|
Something about the can_reference condition, I was using the HTML version of the spec and I killed Firefox to free up RAM π
|
|
|
|
veluca
|
2025-12-04 09:44:48
|
I don't think so
|
|
2025-12-04 09:45:59
|
(of course, jxl-rs *could* be broken here)
|
|
2025-12-04 09:46:12
|
anyway, one sprite sheet is more than enough, I think
|
|
2025-12-04 09:46:27
|
write it in slot 3 or whatever
|
|
2025-12-04 09:46:38
|
and then pick good offsets for copying the sprites π
|
|
|
Magnap
|
2025-12-04 09:49:40
|
Found it!
> Let `can_reference` denote the expression `!is_last and (duration == 0 or save_as_reference != 0) and frame_type != kLFFrame`
> [...]
> If `can_reference`, then the samples of the decoded frame are recorded as Reference[`save_as_reference`] and may be referenced by subsequent frames
|
|
2025-12-04 09:50:41
|
So normal or reference-only frames get saved to slot 0 iff they have duration 0 and you don't save them elsewhere
|
|
|
jonnyawsom3
|
|
veluca
write it in slot 3 or whatever
|
|
2025-12-04 09:51:22
|
Maybe not 3 yet ;P https://github.com/libjxl/libjxl/pull/4512
|
|
2025-12-04 09:51:54
|
Wait no I'm dumb, that's if you're doing stuff *other* than patches
|
|
2025-12-04 09:52:56
|
I was actually going to try increasing patch density soon too. Right now it's hardcoded to use the gradient predictor, but patch detection is so slow already, we might as well not handicap the results
|
|
|
|
veluca
|
|
Magnap
So normal or reference-only frames get saved to slot 0 iff they have duration 0 and you don't save them elsewhere
|
|
2025-12-04 09:57:59
|
that's what the encoder does, but not what that condition says π (save_as_reference is encoded unconditionally if is_last is not true, I believe)
|
|
|
Magnap
|
|
veluca
that's what the encoder does, but not what that condition says π (save_as_reference is encoded unconditionally if is_last is not true, I believe)
|
|
2025-12-04 10:00:41
|
And not an LF frame, yes. But doesn't the second paragraph imply that if `!can_reference` then you can't refer to the frame later?
|
|
|
|
veluca
|
2025-12-04 10:02:41
|
it does, indeed
|
|
2025-12-04 10:03:21
|
in that way, slot 0 is special, in that if a frame is not zero duration then it *can't* be saved in slot 0
|
|
|
Magnap
|
2025-12-04 10:05:38
|
But you also can't prevent duration-0 frames (that are not LF frames and are not the last frame) from being saved, right? Ig slot 0 isn't that special in that case, tho, it's just the default
|
|
2025-12-04 10:05:49
|
That's a more useful of way thinking about it than "you can't control slot 0" (if it's correct)
|
|
|
|
veluca
|
2025-12-04 10:10:01
|
yep
|
|
2025-12-04 10:10:20
|
and "the default" doesn't mean much since you can't actually omit specifying where it should be saved
|
|
|
Magnap
|
2025-12-05 08:12:09
|
BTW, anyone got any tips for prefix coding with distribution clustering? I am, uh, not about to write an ANS encoder π
but I figured there might be some nice tricks for deciding when to merge distributions, rather than the somewhat brute force idea I had in mind of "greedily merge the pair of distributions with the smallest Wasserstein distance until the increased symbol signaling cost outweighs the lowered cost of distribution signaling"
|
|
2025-12-05 08:19:34
|
Gotta learn how to construct a Huffman tree first and the Brotli way of coding a prefix code, but that still sounds easier than ANS π
|
|
|
_wb_
|
2025-12-05 08:45:29
|
I think you could take a look at how e1 lossless does it
|
|
|
|
veluca
|
|
_wb_
I think you could take a look at how e1 lossless does it
|
|
2025-12-05 09:23:17
|
no, because it doesn't cluster distributions π
|
|
|
Magnap
BTW, anyone got any tips for prefix coding with distribution clustering? I am, uh, not about to write an ANS encoder π
but I figured there might be some nice tricks for deciding when to merge distributions, rather than the somewhat brute force idea I had in mind of "greedily merge the pair of distributions with the smallest Wasserstein distance until the increased symbol signaling cost outweighs the lowered cost of distribution signaling"
|
|
2025-12-05 09:24:24
|
the algorithm libjxl uses is relatively simple, something similar to the initialization step of kmeans++ to pick candidate centers, then assigning every other distribution to the closest center
|
|
|
Magnap
|
|
veluca
the algorithm libjxl uses is relatively simple, something similar to the initialization step of kmeans++ to pick candidate centers, then assigning every other distribution to the closest center
|
|
2025-12-05 09:28:07
|
Oh sweet, didn't know there was an algorithm for picking good k-means starting points
|
|
|
lonjil
|
2025-12-05 12:13:28
|
<@&807636211489177661>
|
|
|
Jyrki Alakuijala
|
|
AccessViolation_
it would be awesome to eventually have the "export to JPEG XL" option in Lichess, especially if they can be crazy small :3
|
|
2025-12-05 10:52:47
|
I made a chess AI games in my early 30s as a night time hobby -- pychess (rubychess was based on it), and 'Deep Pocket Chess', which later powered Mephisto Mobile Edition for its AI, I even made some pocket money with this hobby
|
|
|
AccessViolation_
|
2025-12-05 10:54:10
|
oh wow!
|
|
2025-12-05 10:54:30
|
I investigated chess programming not too long ago, but never got into it unfortunately
|
|
|
Jyrki Alakuijala
|
2025-12-05 10:55:59
|
the mephisto mobile edition sold 100k+ copies
|
|
2025-12-05 10:56:09
|
in 2003 or so
|
|
|
AccessViolation_
|
2025-12-05 10:57:25
|
was it this?
|
|
|
Jyrki Alakuijala
|
2025-12-05 10:58:01
|
yes, I made the graphics design, too, and drew the 'soft' graphics with colored pencils
|
|
2025-12-05 10:59:21
|
Numeric Garden was the company I had together with Jarkko Oikarinen (the author of IRC)
|
|
2025-12-05 11:00:25
|
I learned a lot about coding when doing chess programming -- I can recommend it as an exercise
|
|
|
AccessViolation_
|
2025-12-05 11:01:00
|
that's some nice history
|
|
|
Jyrki Alakuijala
|
2025-12-05 11:01:30
|
Unlike every other chess program I used 9x8 squares on Nokia's 96x64 pixels black and white display
|
|
2025-12-05 11:02:06
|
it was a non-engineer like solution, because every usual engineer would use 8x8 squares (64 pixels per piece), because otherwise a square is not square
|
|
2025-12-05 11:02:50
|
I understood that if I use odd squares, I get much more expressivity, since most pieces need to have mirror symmetry -- I could have the central axis one pixel wide this way
|
|
2025-12-05 11:03:09
|
and my b&w design was thus far far superior to any competition
|
|
|
AccessViolation_
|
2025-12-05 11:06:29
|
oh you're talking about the size of the tiles in pixels, I thought you meant the chessboard was 9x8 tiles and was very confused
|
|
2025-12-05 11:06:58
|
that's neat
|
|
|
Jyrki Alakuijala
|
2025-12-05 11:07:15
|
yes, 8x8 board, but 9x8 "squares"
|
|
2025-12-05 11:07:24
|
https://mobile.phoneky.com/games/?id=j4j88342
|
|
2025-12-05 11:08:34
|
Mephisto Chess M.E. Java Game -- Disney Mobile was selling it at a time
|
|
2025-12-05 11:09:42
|
looks like someone has pirated it here -- if there is a working J2ME emulator somewhere to try it -- https://javagamephone.blogspot.com/2009/02/mephisto-chess-me.html
|
|
|
AccessViolation_
|
2025-12-05 11:10:19
|
once again piracy proves useful for game preservation :p
|
|
|
Jyrki Alakuijala
|
2025-12-05 11:10:28
|
the B&W version was ~30 KB
|
|
|
AccessViolation_
|
2025-12-05 11:11:17
|
it would be interesting to recreate the algorithm and see how it fares against today's chess engines at different levels of 'difficulty'
|
|
2025-12-05 11:13:00
|
if I can find an emulator I can exchange moves between it and another engine
|
|
|
Jyrki Alakuijala
|
2025-12-05 11:14:08
|
heh
|
|
2025-12-05 11:14:26
|
I think it can be a ~1600 level player
|
|
|
AccessViolation_
|
2025-12-05 11:14:40
|
oh that's not bad at all
|
|
|
Jyrki Alakuijala
|
2025-12-05 11:14:57
|
it was the best chess AI for those kind of phones
|
|
|
AccessViolation_
|
2025-12-05 11:15:22
|
how did you have enough compute for this?
|
|
|
Jyrki Alakuijala
|
2025-12-05 11:15:37
|
well, it didn't have much
|
|
2025-12-05 11:16:05
|
on the 6310i it run 1 ply, and extended up to 3 ply for difficult situations ,so the evaluator needed to be somewhat balanced
|
|
2025-12-05 11:16:23
|
(by default skill level)
|
|
2025-12-05 11:17:02
|
on the s60 phones you could run 3-5 ply search, but still it was depending a lot of sensibility built into the evaluator
|
|
2025-12-05 11:17:14
|
those phones didn't have JIT
|
|
2025-12-05 11:17:27
|
JIT was too expensive so they just didn't do it
|
|
2025-12-05 11:17:35
|
making J2ME a terrible terrible idea
|
|
2025-12-05 11:18:25
|
non-jitted java was like a factor of 50 slow down on those ~20 MHz slow thingies, essentially making every J2ME a C64 (or slower) in speed
|
|
2025-12-05 11:19:29
|
and to fit things in 30 kB one would end up doing "Java" programming with basic types, like int arrays instead of classes, and use c preprocessor to add some level of abstraction
|
|
2025-12-05 11:20:40
|
we used the c preprocessor before Java, really -- in our broschure we called it 'Java NanoBeans' or something like that
|
|
|
AccessViolation_
|
2025-12-05 11:24:41
|
that sounds delightfully cursed
|
|
2025-12-05 11:28:48
|
this is the video series that got me interested in chess programming
https://www.youtube.com/playlist?list=PLFt_AvWsXl0cvHyu32ajwh2qU1i6hl77c
|
|
2025-12-05 11:31:29
|
I might pick it up again eventually π I still have the code for the board and everything
|
|
|
jonnyawsom3
|
2025-12-06 11:41:41
|
Reminds me of the HDR flashbangs we discussed here a long time ago
https://x.com/i/status/1997088673645686925
|
|
|
dogelition
|
|
Reminds me of the HDR flashbangs we discussed here a long time ago
https://x.com/i/status/1997088673645686925
|
|
2025-12-06 12:24:54
|
i pulled the icc profile in <#1288790874016190505> from that exact image lol, it came up on my fyp
|
|
|
AccessViolation_
|
|
Reminds me of the HDR flashbangs we discussed here a long time ago
https://x.com/i/status/1997088673645686925
|
|
2025-12-06 12:49:28
|
this is so funny
|
|
2025-12-06 12:51:28
|
I welcome a future of retina scorching HDR memes to replace eardrum ripping audio clipping memes
|
|
|
Magnap
|
|
AccessViolation_
speculating, but I feel like improvements would be minimal, since higher effort features usually enable more destructive coding tools like merging 8x8 blocks into larger ones. something that could work is higher effort entropy coding, a better attempt at modular encoding the LF frame and other modular sub-bitstreams, and coefficient reordering, like replacing the zigzag pattern with a circular pattern (https://discord.com/channels/794206087879852103/794206170445119489/1445159505536090152)
|
|
2025-12-06 01:04:42
|
https://github.com/libjxl/libjxl-tiny/blob/main/doc/coding_tools.md libjxl-tiny output looks like it would provide good material for testing "losslessly recompress a JXL image by applying more effort"
|
|
2025-12-06 01:09:14
|
if that sort of tool exists for other formats, what do they call it? I know PNG has `pngopt`, `jxlopt` sounds like a good name to me
|
|
|
AccessViolation_
|
2025-12-06 01:09:44
|
`jxlopt` sounds nice yeah
|
|
|
dogelition
|
|
Magnap
if that sort of tool exists for other formats, what do they call it? I know PNG has `pngopt`, `jxlopt` sounds like a good name to me
|
|
2025-12-06 01:14:53
|
~~write it in rust and call it oxijxl~~
|
|
|
AccessViolation_
|
2025-12-06 01:21:38
|
I think using a png optimizer naming scheme might imply to people that it's for lossless jxl specifically
|
|
2025-12-06 01:22:22
|
which cjxl can already do, `cjxl input-lossless.jxl output-lossless.jxl -d 0 [good params]`
|
|
2025-12-06 01:25:18
|
if these tools exist eventually, it'd be nice if they were part of the main encoder CLI. just like how cjxl implicitly does lossless JPEG recompression, why not also do lossless optimization of lossy jxl if the input is VarDCT JXL and the distance is 0
|
|
2025-12-06 01:27:00
|
but I'm not writing it in C that's for sure <:KekDog:805390049033191445>
eagerly and patiently awaiting the first commits to the encoder side of jxl-rs so I can start tinkering with it
|
|
2025-12-06 01:34:23
|
petition to create a lossless un-optimizer that turns ANS into huffman with randomized trees, randomizes the coefficient order, etc
|
|
|
Magnap
|
2025-12-06 01:37:53
|
for the "add a frame that imports this already encoded frame to the sprite sheet" API I need to decode and then encode the frame header so I can change it to support animation, and it was so much work to write the image header type in a way where you can only represent valid headers. I might just decode with jxl-oxide and then write the modified frame header immediately without ever having my own type for it. then I can have a much simpler frame header type for the type of frames I am planning on allowing users to build
|
|
|
AccessViolation_
|
2025-12-06 02:12:05
|
the ideal coefficient order
|
|
2025-12-06 02:13:33
|
although: it's symmetrical along the diagonal, so if you signal the diagonal later or earlier you might be able to LZ77 away roughly half of it. not sure if LZ77 an can be used but I'll look into it
|
|
|
Magnap
|
|
AccessViolation_
the ideal coefficient order
|
|
2025-12-06 02:15:57
|
is that "in order of increasing quantization coefficient"?
|
|
|
AccessViolation_
|
2025-12-06 02:16:39
|
yeah, and because there are many repeating values that means those will all show up next to each other which would benefit compression
|
|
|
Magnap
|
2025-12-06 02:20:19
|
btw <@384009621519597581> idk if you have a searchable version of the specs but https://canary.discord.com/channels/794206087879852103/1021189485960114198/1169087534060556339 is very neat although ofc not the ~~True Name~~ actual final ISO standard
|
|
|
AccessViolation_
|
2025-12-06 02:29:02
|
ah I don't have one, thanks
|
|
2025-12-06 02:29:27
|
π
the blue and purple parts are identical
|
|
2025-12-06 02:32:23
|
this example is the default spec-included quant table so presumably the order is spec-defined as well and doesn't need to be signaled at all, but for other diagonally symmetric tables this could be a good order (again, if LZ77 is available for this)
|
|
2025-12-06 02:33:54
|
lemme check the spec
|
|
|
Magnap
|
|
AccessViolation_
this example is the default spec-included quant table so presumably the order is spec-defined as well and doesn't need to be signaled at all, but for other diagonally symmetric tables this could be a good order (again, if LZ77 is available for this)
|
|
2025-12-06 02:33:56
|
you signal a permutation (of the default order, I believe, but not very confidently) in a way I haven't read up on how works
|
|
|
AccessViolation_
|
2025-12-06 02:36:54
|
yeah, they're ordered using a Lehmer code. I haven't yet read up on how it works yet
|
|
2025-12-06 02:38:55
|
I'm curious if this order would offset the cost of signaling it
|
|
|
Quackdoc
|
|
Reminds me of the HDR flashbangs we discussed here a long time ago
https://x.com/i/status/1997088673645686925
|
|
2025-12-06 03:12:12
|
*just as the author intended* π
|
|
|
AccessViolation_
|
|
AccessViolation_
π
the blue and purple parts are identical
|
|
2025-12-06 03:22:05
|
it might be better for LZ77 if the repeating segments aren't interrupted by other data, so this
|
|
2025-12-06 03:39:08
|
actually, if this order is reused for the coefficients themselves as well this might not be good
|
|
|
Magnap
|
|
AccessViolation_
actually, if this order is reused for the coefficients themselves as well this might not be good
|
|
2025-12-06 03:40:51
|
wait, what do you mean by coefficient order then?
|
|
2025-12-06 03:41:03
|
I thought that's what it meant π
|
|
|
AccessViolation_
|
2025-12-06 03:41:29
|
ah I said coefficient order before, I meant the quant table itself
|
|
|
Magnap
|
2025-12-06 03:41:37
|
if you're talking about how the quant tables are encoded, they're either little Modular images, or, smaller, parameterized, or even smaller, all parameters are default
|
|
|
AccessViolation_
|
2025-12-06 03:42:11
|
yeah, I'm taling about "raw" quant tables that happen to be symmetrical
|
|
2025-12-06 03:47:08
|
ah I was mixing things up, the table itself cannot be reordered before modular encoding it seems. just the coefficients can be reordered
|
|
|
Tirr
|
2025-12-06 04:05:00
|
yeah raw quant table is just ordinary single channel Modular image
|
|
2025-12-06 04:05:33
|
which means you can squeeze quant table if you really want <:KekDog:805390049033191445>
|
|
|
AccessViolation_
|
2025-12-06 04:06:46
|
can I put splines in there :3c
|
|
|
Magnap
|
|
AccessViolation_
can I put splines in there :3c
|
|
2025-12-06 04:07:08
|
nope, frame-level feature
|
|
|
Tirr
|
2025-12-06 04:07:09
|
unfortunately splines are image features, not a part of Modular sub-bitstream
|
|
|
AccessViolation_
|
2025-12-06 04:07:19
|
sobbing
|
|
|
Magnap
|
2025-12-06 04:07:47
|
mfw I can't have photon noise in my quant tables
|
|
|
Tirr
|
2025-12-06 04:07:48
|
but you can do delta-palette
|
|
|
AccessViolation_
|
2025-12-06 08:03:20
|
https://www.leedavison.me/Files/DCT/DCTvis.html
|
|
2025-12-06 08:07:25
|
you can scroll on cells to get values between the min and max
|
|
2025-12-06 08:49:13
|
https://www.youtube.com/watch?v=-NzfTE8RI4w
this microscope can create 3D exports of surfaces by shining a light from 6 different angles, taking monochrome pictures and looking at how the shadows are laid out. JPEG XL would be a good format for these exports, you could store the 6 monochrome layers and even add the depth map after it's been computed
|
|
2025-12-09 06:53:14
|
I had the idea to try to make a URL shortener that cannot possibly go down, that works by locally compressing/decompressing URLs. other people could clone the code and setup their own frontends for the same compression and decompression logic, so as long as the code exists (and ideally someone is hosting it) it'll keep "working" forever. there's no risk of link rot in the traditional sense
|
|
2025-12-09 06:54:11
|
I also realize it wouldn't work nearly as well as traditional link shorteners that map a short ID to a URL
|
|
2025-12-09 06:58:22
|
my first thought was to collect a representative database of URLs and derive a zstd or brotli dictionary from those, but those compressors still need to be able to represent *any* data while URLs have some formatting invariants that it would never have to represent (maybe I'll do this aggressively, I'd be okay if some weird subset of URLs could not be encoded, if it that means the larger subset can be compressed more efficiently. for example limiting it to HTTP/HTTPS URLs only)
|
|
2025-12-09 07:21:30
|
that's the idea, I don't know if I'm going to get this into a working state but compressing URLs specifically seems like a nice exercise to experiment with context modeling
|
|
|
Adrian The Frog
|
2025-12-09 08:11:00
|
afaik any url can be represented by ascii
|
|
2025-12-09 08:11:24
|
If that's the issue
|
|
2025-12-09 08:13:05
|
With punycode
|
|
2025-12-09 09:26:17
|
Lowercase ascii actually
|
|
|
Meow
|
|
AccessViolation_
I had the idea to try to make a URL shortener that cannot possibly go down, that works by locally compressing/decompressing URLs. other people could clone the code and setup their own frontends for the same compression and decompression logic, so as long as the code exists (and ideally someone is hosting it) it'll keep "working" forever. there's no risk of link rot in the traditional sense
|
|
2025-12-10 02:16:14
|
Some services like X use local URL shortener
|
|
|
AccessViolation_
|
2025-12-10 04:11:41
|
they're "local" in a different sense. they're in-house, but still rely on a database of shortened URLs
|
|
2025-12-10 04:13:11
|
my proposed idea is "local" in that it doesn't need a database of URLs. you can turn a short URL into the original without internet access
|
|
|
monad
|
2025-12-11 06:01:59
|
Could it possibly have general utility? You'd not only have to share the compressed URL, but share the decompression method somehow. Maybe it only works in some niche where community norms can be established. Anyway, it sound intriguing even as a gimmick. I would use it.
|
|
|
|
ignaloidas
|
2025-12-11 09:06:26
|
https://www.youtube.com/watch?v=KSvjJGbFCws very cool camera
|
|
2025-12-11 09:08:24
|
I wonder how well could a JXL encoder be made to fit such a camera - encoding line by line is a bit more annoying because you'd have to have several groups in progress
|
|
|
jonnyawsom3
|
2025-12-11 09:15:06
|
The current hardware encoder handles 4096 x 4096 per unit, but can be parrelized and combined at the end. So should be possible
|
|
|
AccessViolation_
|
|
monad
Could it possibly have general utility? You'd not only have to share the compressed URL, but share the decompression method somehow. Maybe it only works in some niche where community norms can be established. Anyway, it sound intriguing even as a gimmick. I would use it.
|
|
2025-12-11 09:19:09
|
the compression/decompression method is provided in the website frontend. similarly to the user experience of typical URL shorteners: go to `tinyurl.com`, enter your URL: `https://archive.org/download/dominos-miku-1.15/dominos-miku-1.15_archive.torrent` and turn it into `https://tinyurl.com/4bmeu2cb`. then people visit `https://tinyurl.com/4bmeu2cb` and it redirects to the original
for my thing, you would go to some domain `shortener.com` and enter the same url, it'll be compressed to `shortener.com/oUACQKkIlQ7ltr4WDhbWskWQIfM=`.
that segment contains the whole URL, compressed, so if people visit that short URL it'll be decompressed locally and redirect you to the result
|
|
|
|
ignaloidas
|
|
The current hardware encoder handles 4096 x 4096 per unit, but can be parrelized and combined at the end. So should be possible
|
|
2025-12-11 09:42:07
|
this takes images in ~40000px tall, 2px wide lines, so it's very much out of the range of simple HW encoders. The full raw image from this camera is around 19GB, so a streaming encode for taking the image is basically a requirement
|
|
2025-12-11 09:44:04
|
I think the annoying part is that you'd have to hold at least 40 groups in progress before finally writing them into the file, that's a non-trivial amount of memory usage as well
|
|
2025-12-11 09:47:12
|
It's briefly touched upon in the video at around 19:10 mark, but only looks at BMP and PNG
|
|
|
AccessViolation_
|
|
ignaloidas
https://www.youtube.com/watch?v=KSvjJGbFCws very cool camera
|
|
2025-12-11 10:06:36
|
could've saved a bunch of time if they just put the whole scanner in the back of a massive box with a single lens hole <:KekDog:805390049033191445>
|
|
|
|
ignaloidas
|
|
AccessViolation_
could've saved a bunch of time if they just put the whole scanner in the back of a massive box with a single lens hole <:KekDog:805390049033191445>
|
|
2025-12-11 10:09:53
|
there were attempts to kinda do just that, but there's issues with that approach https://petapixel.com/2014/12/29/medium-format-camera-made-using-parts-epson-scanner/
|
|
|
AccessViolation_
|
2025-12-11 10:21:06
|
I was just thinking, you could potentially keep the sensor still and use a traditional camera shutter with a dual curtain. the horizontal resolution will be that of the sensor, the vertical resolution will be determined by how narrow the shutter gap is (or rather, there will be a blur applied that has the width of the shutter gap)
|
|
2025-12-11 10:21:35
|
though any traditional shutter is obviously way too fast for the readout speeds of these sensors
|
|
2025-12-11 10:23:23
|
not sure if this is necessarily any better
|
|
|
|
ignaloidas
|
2025-12-11 10:35:58
|
Not realy? If the sensor doesn't move it will stay at the same position on the lens projected image, so no matter what shutter does it can only gather information about one particular part of the image
|
|
|
AccessViolation_
|
2025-12-11 10:38:42
|
think about it like this: if you focus all light onto a single point, the entire image is merged into that point. if you then slide a lid with a slit over the lens, the point will see different parts of the image at different times
|
|
|
|
ignaloidas
|
2025-12-11 12:10:22
|
that's uhh, not how lenses for cameras work
|
|
|
AccessViolation_
|
2025-12-11 12:14:42
|
well yeah, but isn't a normal camera
|
|
|
|
ignaloidas
|
2025-12-11 12:14:46
|
camera lenses work essentially as camera obscura, just with passing more light with a larger aperture and no image flipping
|
|
2025-12-11 12:16:08
|
and I'm generally not aware of a lens system that would focus light from different angles into a single point, because you essentially never want that
|
|
|
AccessViolation_
|
2025-12-11 12:17:27
|
oh wait I see the problem now
|
|
|
|
ignaloidas
|
2025-12-11 12:19:54
|
also, blocking light deeper in the optical path will often result in counterintuitive results https://www.youtube.com/shorts/hgjK57kYGks
|
|
2025-12-11 12:20:39
|
the old type of shutters only work in an intuitive way because they are close to the image plane
|
|
|
monad
|
|
AccessViolation_
the compression/decompression method is provided in the website frontend. similarly to the user experience of typical URL shorteners: go to `tinyurl.com`, enter your URL: `https://archive.org/download/dominos-miku-1.15/dominos-miku-1.15_archive.torrent` and turn it into `https://tinyurl.com/4bmeu2cb`. then people visit `https://tinyurl.com/4bmeu2cb` and it redirects to the original
for my thing, you would go to some domain `shortener.com` and enter the same url, it'll be compressed to `shortener.com/oUACQKkIlQ7ltr4WDhbWskWQIfM=`.
that segment contains the whole URL, compressed, so if people visit that short URL it'll be decompressed locally and redirect you to the result
|
|
2025-12-11 12:23:17
|
In general, when the short URL rots there must be some signal for random passerby that the payload is recoverable. Or am I misunderstanding the point?
|
|
|
AccessViolation_
|
2025-12-11 12:27:16
|
right, if you're unfamiliar with the project and the short URL doesn't resolve it won't help. but it's still recoverable by those that know they can replace the domain by a different one hosting the same thing. so information isn't lost, but it could still be impractical if the site hosting it goes down
|
|
|
monad
|
2025-12-11 12:30:27
|
It's at least slightly more transparent for digital archaeologists.
|
|
|
AccessViolation_
|
2025-12-11 12:31:54
|
there are people actively trying to save us from from link rot when the url shorting service goes down: https://wiki.archiveteam.org/index.php/URLTeam
this wouldn't be necessary for this
|
|
|
jonnyawsom3
|
|
ignaloidas
this takes images in ~40000px tall, 2px wide lines, so it's very much out of the range of simple HW encoders. The full raw image from this camera is around 19GB, so a streaming encode for taking the image is basically a requirement
|
|
2025-12-11 12:37:10
|
Well, good news is we already have a streaming encoder
|
|
|
Magnap
|
|
ignaloidas
and I'm generally not aware of a lens system that would focus light from different angles into a single point, because you essentially never want that
|
|
2025-12-11 12:40:15
|
Isn't that exactly what a camera obscura / pinhole camera does at the pinhole?
|
|
|
|
ignaloidas
|
|
Magnap
Isn't that exactly what a camera obscura / pinhole camera does at the pinhole?
|
|
2025-12-11 12:41:31
|
subject to sensor's directional light sensitivity, but I guess yes
|
|
2025-12-11 12:42:03
|
most sensors don't react great to light that's really off-axis
|
|
|
Magnap
|
2025-12-11 12:43:45
|
Ah, I wasn't thinking about a sensor at all, just the old-fashioned "box/room with a tiny hole in the side"
|
|
|
|
ignaloidas
|
2025-12-11 01:02:08
|
though also, now that I thought a bit, with that approach you're limited to the amount of light that falls on the area the size of a sensor, while most lenses do give light that's proportional to the size of the first lens in the system.
|
|
|
AccessViolation_
|
2025-12-11 04:57:48
|
I'm thinking about ANS with adaptive probabilities, and it doesn't really matter what causes them to adapt, does it? so long as the encoder and decoder agree, it can be any signal based on the previous data
|
|
2025-12-11 05:00:59
|
if so, you could probably retrofit the neural network from some (relatively) tiny LLM and bake it into an ANS compressor. then say the probability distribution for the next token is decided by that neural network
|
|
|
|
ignaloidas
|
2025-12-11 05:28:58
|
tiny large language model is a bit of an oxymoron
|
|
2025-12-11 05:29:44
|
but the architecture could work, question is how useful you can make it
|
|
|
spider-mario
|
2025-12-11 05:31:37
|
MLM: ~~multi-level marketing~~ medium language model
|
|
|
jonnyawsom3
|
2025-12-11 05:54:49
|
Little Language Model (it has no need for language, you want a neural network)
|
|
|
AccessViolation_
|
2025-12-11 06:20:09
|
it sounds adorable if anything
|
|
|
lonjil
|
2025-12-11 06:46:22
|
by using a semantic embedding, you can do lossy compression on text
|
|
|
HCrikki
|
2025-12-11 06:46:48
|
https://web.dev/blog/upvote-features
upvote whatever you as a (web?)dev or enduser would want interoperable. where possible, elaborating on how theyd improve or change your experience would be more informative than a silent like
|
|
|
_wb_
|
|
lonjil
by using a semantic embedding, you can do lossy compression on text
|
|
2025-12-11 06:59:38
|
Basically the same as "summarize this text" (or "summarize and then expand again")
|
|
|
lonjil
|
2025-12-11 07:00:22
|
yesish, I bet using an actual semantic encoding would work a bit better than that
|
|
2025-12-11 07:01:55
|
e.g. Facebook's SONAR embedding is pretty good at exactly representing a sentence, and if you disturb the vector a little, you get a different sentence back out that has a veeery similar meaning.
|
|
2025-12-11 07:03:56
|
It even manages to appropriately represent idioms and cultural references, such that if you use the model trained on English to get a sentence vector, and then use the model trained on e.g. Hebrew to convert from semantic vector space back into sentence form, it'll *usually* spit something out that is appropriately comprehensible in a Hebrew cultural context.
|
|
|
AccessViolation_
|
2025-12-11 10:41:33
|
I trained a zstd dictionary on 3500 URLs from a dump of URLs that were actually put into URL shorteners. the results are definitely better than without a dictionary, and definitely shorter than the originals, but not short enough to be useful
|
|
2025-12-11 10:43:15
|
I'm going to have to implement my own context model which ramps up the complexity with several orders of magnitude
|
|
2025-12-11 10:46:56
|
I have some ideas though. instead of encoding from left to right `news.google.com`, it's beter to start with the main domain `google`, which for well-known domains, significantly limits the options of likely subdomains, and also TLDs. if you start with `news` instead, if could be followed by anything, the probabilities are all over the place
|
|
2025-12-11 10:59:13
|
I can also undo base64 and percent-encoding to get the original text or data before compressing
|
|
|
TheBigBadBoy - πΈπ
|
2025-12-11 11:01:35
|
what about [CMIX](https://github.com/byronknoll/cmix) with custom dict ?
since the data to compress is really short, I don't see a problem to use really slow compression
|
|
|
lonjil
|
2025-12-11 11:02:30
|
train an llm on urls
|
|
|
AccessViolation_
|
2025-12-11 11:03:06
|
this doesn't seem right. zstd whole URLs in the dictionary, including segments that only appeared once
|
|
2025-12-11 11:03:26
|
it did warn the dictionary wasn't optimal
|
|
|
lonjil
|
2025-12-11 11:03:40
|
how big is the dictionary?
|
|
|
AccessViolation_
|
2025-12-11 11:03:51
|
3500 files, one url per file
|
|
|
TheBigBadBoy - πΈπ
|
|
AccessViolation_
this doesn't seem right. zstd whole URLs in the dictionary, including segments that only appeared once
|
|
2025-12-11 11:03:55
|
> Small-Penis-Humiliation
π
|
|
|
AccessViolation_
|
2025-12-11 11:03:55
|
oh, the dictionary
|
|
2025-12-11 11:04:10
|
112 KB
|
|
|
TheBigBadBoy - πΈπ
> Small-Penis-Humiliation
π
|
|
2025-12-11 11:05:07
|
oh god I didn't see that
|
|
2025-12-11 11:05:40
|
the files themselves are 14 MB, combined
|
|
|
lonjil
|
2025-12-11 11:05:46
|
ah, the default dictionary size
|
|
|
TheBigBadBoy - πΈπ
|
2025-12-11 11:06:02
|
btw you'll perhaps win a byte if you don't give it an extra newline by using `echo -n`
|
|
|
AccessViolation_
|
2025-12-11 11:06:22
|
yeah maybe, I couldn't be bothered to figure out how to do that
|
|
2025-12-11 11:06:37
|
probably does help
|
|
|
lonjil
|
2025-12-11 11:06:47
|
that's 32 bytes of dictionary per input URL. Once similarities between the URLs are factored out, there's plenty of room for unique stuff to fit in there
|
|
|
AccessViolation_
|
2025-12-11 11:07:38
|
it saved four bytes!
|
|
|
lonjil
that's 32 bytes of dictionary per input URL. Once similarities between the URLs are factored out, there's plenty of room for unique stuff to fit in there
|
|
2025-12-11 11:07:50
|
oh hmm, makes sense
|
|
2025-12-11 11:08:12
|
also I assume these zstd blobs include a bunch of headers
|
|
2025-12-11 11:08:28
|
some of which could maybe be implicit
|
|
|
lonjil
|
2025-12-11 11:08:36
|
yes
|
|
2025-12-11 11:08:50
|
though tbh the headers are pretty small
|
|
2025-12-11 11:08:57
|
btw what's your goal here?
|
|
|
AccessViolation_
|
2025-12-11 11:09:37
|
I'm creating a URL shortener that stores the original entirely in the short URL using compression, rather than using a database of IDs mapped to the originals
|
|
|
lonjil
|
|
AccessViolation_
|
2025-12-11 11:10:23
|
I'm anticipating it won't be nearly as good in terms of space savings, this is for experimentation :)
|
|
|
lonjil
|
2025-12-11 11:10:34
|
try <https://github.com/siara-cc/Unishox2> and see how well it works
|
|
|
AccessViolation_
|
2025-12-11 11:13:00
|
I'll check out CMIX and that tomorrow
|
|
2025-12-11 11:13:06
|
they look promising
|
|
|
lonjil
try <https://github.com/siara-cc/Unishox2> and see how well it works
|
|
2025-12-11 11:17:38
|
actually this makes sense because things in the same language will be in the same 'region' of unicode and as such will have a significant amount of overlap in symbol's bits
|
|
2025-12-11 11:18:26
|
I assume that's what they mean with delta encoding, encoding the numerical difference between symbols
|
|
|
lonjil
|
2025-12-11 11:19:07
|
keep in mind that URLs generally use a special ASCII encoding for unicode characters
|
|
|
AccessViolation_
|
2025-12-11 11:20:42
|
yeah, they use [percent encoding](<https://en.wikipedia.org/wiki/Percent-encoding>)
|
|
2025-12-11 11:21:16
|
I think I can detect that and decode it before compressing, then signal that it should be re-encoded like that during decompression
|
|
|
lonjil
|
2025-12-11 11:21:17
|
oh wait wait no right URLs are more than just domain names :)
|
|
|
AccessViolation_
|
2025-12-11 11:22:38
|
though I've also seen URLs with non-ascii unicode characters encoded as they are, I think?
|
|
|
lonjil
|
2025-12-11 11:23:03
|
they get punycode encoded, at least before they're used by the browser
|
|
|
AccessViolation_
|
2025-12-11 11:41:18
|
I have gigabytes of URLs scraped from actual URL shorteners but they're in broken xz files :/
|
|
2025-12-11 11:41:40
|
I wonder if if I'm doing something wrong or if there was some truth to "xz is not a good format for long-time archival"
|
|
|
lonjil
|
|
TheBigBadBoy - πΈπ
|
2025-12-11 11:50:28
|
little test I made```sh
$ printf 'https://discord.com/channels/794206087879852103/794206087879852106' > a.txt
$ cmix -c a.txt a.cmix
Detected block types: DEFAULT: 100.0%
66 bytes -> 54 bytes in 4.29 s.
cross entropy: 6.545
$ find a.* -printf '%s %p\n' | sort -n
34 a.txt.unishox2
37 a.txt.unishox1
42 a.txt.unishox3_alpha
48 a.txt.br
54 a.txt.cmix
65 a.txt.lz4
66 a.txt
69 a.txt.zst
70 a.txt.gz
90 a.txt.bz2
124 a.txt.xz
162 a.txt.7z```
|
|
2025-12-11 11:52:52
|
all without custom dict ofc
CMIX is very powerful, but symmetric compression so it took 4.3s too to decompress that <:KekDog:805390049033191445>
|
|
|
AccessViolation_
|
2025-12-11 11:52:54
|
with the zstd dictionary that becomes 48 bytes
|
|
2025-12-11 11:53:51
|
4.3 seconds is crazy haha
|
|
|
TheBigBadBoy - πΈπ
|
2025-12-11 11:54:28
|
indeed [β ](https://cdn.discordapp.com/emojis/867794291652558888.webp?size=48&name=dogelol)
I have version 19, there's now version 21 in the git repo
so now it should be a little faster but idk by how much
|
|
|
AccessViolation_
|
2025-12-11 11:54:29
|
`echo -n "https://discord.com/channels/794206087879852103/794206087879852106" | zstd -c -D ../dict/zstd-dict --ultra --no-dictID | wc -c`
|
|
2025-12-11 11:56:15
|
interestingly brotli for me often does better with its spec-defined dictionary
|
|
2025-12-11 11:58:26
|
I think URLs can get a lot smaller with context modeling tuned specifically for URLs though
|
|
|
TheBigBadBoy - πΈπ
|
|
AccessViolation_
`echo -n "https://discord.com/channels/794206087879852103/794206087879852106" | zstd -c -D ../dict/zstd-dict --ultra --no-dictID | wc -c`
|
|
2025-12-12 12:04:54
|
note that using `--ultra` here does not do anything
> unlocks high compression levels 20+ (maximum 22)
default level is `-3`, and other levels give different output lengths:```$ parallel -k 'printf {#}\\t; echo -n "https://discord.com/channels/794206087879852103/794206087879852106" | zstd -c -D zstd-dict --ultra -{} --no-dictID | wc -c' ::: {1..22}
1 50
2 48
3 48
4 48
5 48
6 47
7 47
8 47
9 47
10 47
11 47
12 47
13 49
14 49
15 49
16 49
17 49
18 49
19 49
20 49
21 49
22 49```
|
|
2025-12-12 12:05:22
|
so by using levels 6~12 you win a byte for this specific URL
|
|
|
AccessViolation_
|
2025-12-12 12:05:49
|
oh huh
|
|
|
TheBigBadBoy - πΈπ
|
2025-12-12 12:07:03
|
at least here it's easy to bruteforce
|
|
2025-12-12 12:07:16
|
doesn't take that much time
|
|
|
A homosapien
|
2025-12-12 04:59:56
|
I wonder how brotli would fair
|
|
|
jonnyawsom3
|
|
Reminds me of the HDR flashbangs we discussed here a long time ago
https://x.com/i/status/1997088673645686925
|
|
2025-12-12 08:09:13
|
https://fixupx.com/i/status/1999267124376690898
|
|
2025-12-12 08:09:43
|
Nice to see people using tonemapping
|
|
|
Tirr
|
2025-12-12 08:38:53
|
I happened to memorize 58%PQ and 75%HLG while writing jxl-oxide
|
|
|
|
ignaloidas
|
2025-12-12 09:03:21
|
For that URL shortener thing, if you want there's all of the goo.gl links archived in here (around 7B links total), should be pretty good for testing how much you can squeeze and for tuning the dictionary https://archive.org/details/archiveteam_googl
|
|
|
AccessViolation_
|
2025-12-12 09:51:38
|
I checked a bunch of archiveteam exports and they're not available for download, just like that one
|
|
2025-12-12 09:52:06
|
I found one that was accessible, the one with the xz archives that I had trouble decompressing
|
|
2025-12-12 09:52:22
|
though my old file manager seems to be able to handle them
|
|
2025-12-12 10:02:02
|
I was too lazy to implement a streaming parser so let's see what happens when I try to open and parse a 5.5 GB file of URLs
|
|
2025-12-12 10:04:23
|
```
called `Result::unwrap()` on an `Err` value: Error { kind: InvalidData, message: "stream did not contain valid UTF-8" }
```
ugh, I guess some URL shorteners don't care about input validation
|
|
2025-12-12 10:05:03
|
I skimmed the dataset and also saw some *massive* entries of people shortening massive base64 blobs, which they were able to because they lead with "http://"
|
|
|
|
ignaloidas
|
|
AccessViolation_
I checked a bunch of archiveteam exports and they're not available for download, just like that one
|
|
2025-12-12 10:05:32
|
ah, wait, there's another source, checking IRC logs I found this command to get a whole bunch of archive.org torrents to get stuff from periodic dumps
`curl 'https://archive.org/services/search/v1/scrape?q=subject:terroroftinytown&count=10000' | jq -r '.items[].identifier | ("https://archive.org/download/" + . + "/" + . + "_archive.torrent")'`
|
|
|
AccessViolation_
|
2025-12-12 10:36:13
|
I'm second-guessing my decision to create 41 million tiny files in a folder
|
|
2025-12-12 10:37:36
|
surely my SSD is fine with this
|
|
2025-12-12 10:42:05
|
I decided to cut it off early. I now have 50 GB of tiny files containing one URL each
|
|
2025-12-12 10:44:14
|
...I should probably have tried to find a way to use the zstd API and do this by iterating over the lines of a single file
|
|
|
TheBigBadBoy - πΈπ
|
|
A homosapien
I wonder how brotli would fair
|
|
2025-12-12 11:17:39
|
without a custom dict, 48B on my example above 0.o
really nice
|
|
2025-12-12 11:25:39
|
also added bzip2, 90B
|
|
2025-12-12 11:33:44
|
UniShox3_Alpha, 42B
|
|
2025-12-12 11:38:38
|
UniShox2, 34B [β ](https://cdn.discordapp.com/emojis/852007419474608208.webp?size=48&name=woag%7E1)
|
|
2025-12-12 11:38:58
|
and it gives 16 different presets to work with
|
|
|
AccessViolation_
|
2025-12-12 12:01:08
|
yeah brotli is really good, from quick testing I did before it often matched zstd with the dict
|
|
2025-12-12 12:05:24
|
> POSIX upper limit on argument length (this system): 2092017
that would've been great to know before hammering my disk with way too many files than I can specify zstd should use
|
|
2025-12-12 12:06:51
|
it would appear I'm not yet out of the finding out phase of fucking around
|
|
2025-12-12 12:24:51
|
the dict derived from a massive amount of URLs wasn't *that* much better. I'm going to look into context modeling now
|
|
|
Quackdoc
|
|
https://fixupx.com/i/status/1999267124376690898
|
|
2025-12-12 01:33:26
|
I wonder what mappers they use
|
|
|
dogelition
|
|
Quackdoc
I wonder what mappers they use
|
|
2025-12-12 02:13:44
|
sounds like it's just a colorimetric conversion with a luminance boost? idk though
|
|
|
Quackdoc
|
|
dogelition
sounds like it's just a colorimetric conversion with a luminance boost? idk though
|
|
2025-12-12 03:26:06
|
no idea, will look into eventually I'm sure ~~I won't~~
|
|
|
AccessViolation_
|
2025-12-12 04:09:10
|
my idea currently is to compress different parts of the URL with their respective contexts (dictionaries, histograms, transforms, etc).
signaled in this order:
- scheme
- second-level domain
- top-level domain (correlated with the above, as are the following subdomain fields with *their* above)
- subdomain 1 (right-most)
- subdomain 2
- any further subdomains, concatenated right-to-left
- port (context signaled by scheme)
- path language (doesn't appear in URL, used for signaling other contexts. it's detected by some library beforehand and itself encoded with a context signaled by the TLD)
- path (context signaled by path language)
- query parameters (keys and values have a respective context. possibly also per path language? not sure yet)
- fragment (usually article headers, following a `#` in the url. context signaled by path language)
- text fragments (doesn't appear in the image, it's for those links to a section of highlighted text. context signaled by path language)
|
|
2025-12-12 04:17:42
|
"signaled by path language" doesn't mean there's a single context per language in total, it means that the context for that specific thing is further divided into language-specific versions of it
|
|
|
Traneptora
|
|
AccessViolation_
my idea currently is to compress different parts of the URL with their respective contexts (dictionaries, histograms, transforms, etc).
signaled in this order:
- scheme
- second-level domain
- top-level domain (correlated with the above, as are the following subdomain fields with *their* above)
- subdomain 1 (right-most)
- subdomain 2
- any further subdomains, concatenated right-to-left
- port (context signaled by scheme)
- path language (doesn't appear in URL, used for signaling other contexts. it's detected by some library beforehand and itself encoded with a context signaled by the TLD)
- path (context signaled by path language)
- query parameters (keys and values have a respective context. possibly also per path language? not sure yet)
- fragment (usually article headers, following a `#` in the url. context signaled by path language)
- text fragments (doesn't appear in the image, it's for those links to a section of highlighted text. context signaled by path language)
|
|
2025-12-12 04:20:32
|
that sounds somewhat unnecessary for a mostly-word-character URL
|
|
2025-12-12 04:20:50
|
I have a feeling that the context model will be larger than the URL zlibbed with no wrappers
|
|
|
AccessViolation_
|
2025-12-12 04:21:50
|
it would be, but it's largely implicit. it's fixed, the encoder and decoder work with the same hard coded contexts
|
|
2025-12-12 04:22:49
|
the only part of the context not implicit is the detected language of certain elements (which is still signaled with probabilities according to the TLD, so it should be relatively small. but no more than a byte even if not entropy coded)
|
|
|
|
ignaloidas
|
2025-12-12 04:35:07
|
re domain - could make sense encoding it in reverse order? as in com.google.news?
|
|
|
AccessViolation_
|
2025-12-12 04:37:57
|
hmmm yeah probably
|
|
2025-12-12 04:39:46
|
I already do that for the SLD and subdomains, but it might make sense to also do TLD first and have a context for the popular SLDs per TLD, rather than deriving the likely TLD from the SLD
|
|
2025-12-12 04:42:56
|
TLDs will be super cheap to signal since it's mostly a fixed set. sometimes they invent new ones and those would need to encoded in a way that's more expensive
|
|
2025-12-12 04:43:13
|
so I agree, I think TLD first is best
|
|
|
jonnyawsom3
|
2025-12-12 04:44:10
|
Sounds like you want https://github.com/facebook/openzl
|
|
|
|
ignaloidas
|
2025-12-12 04:44:15
|
I think it would be a bit better with languages of the SLD, some national TLDs will be very heavily weighted towards words of a certain language
|
|
|
Sounds like you want https://github.com/facebook/openzl
|
|
2025-12-12 04:45:15
|
that feels very much focused towards data-heavy applications where you need quicker (de)compression, not extracting the maximum, no?
|
|
|
AccessViolation_
|
|
Sounds like you want https://github.com/facebook/openzl
|
|
2025-12-12 04:45:35
|
oh that's really cool π
|
|
2025-12-12 04:46:11
|
it'll be interesting to see if it can create a beter format, after I finish this
|
|
|
jonnyawsom3
|
|
ignaloidas
that feels very much focused towards data-heavy applications where you need quicker (de)compression, not extracting the maximum, no?
|
|
2025-12-12 04:47:00
|
It allows splitting the payload to handle different sections more suitably. Seemed similar to splitting the URL into it's components
|
|
|
|
ignaloidas
|
|
It allows splitting the payload to handle different sections more suitably. Seemed similar to splitting the URL into it's components
|
|
2025-12-12 04:49:30
|
yeah, what I meant more is that all of the tested datasets in the paper are several GBs big, are in tabular formats, and the "compression units" are at least a couple hundred MB. While here you want to compress a single URL by itself
|
|
|
AccessViolation_
|
2025-12-12 04:49:47
|
it does look like this requires signaling the DAG though
|
|
2025-12-12 04:50:05
|
or at least, I assume so, because they say they have a universal decompressor
|
|
2025-12-12 04:50:59
|
any any signaling overhead should be minimal since url's are on the orders of 100-1000 bytes
|
|
|
jonnyawsom3
|
2025-12-12 04:51:08
|
Yeahhh
|
|
|
AccessViolation_
|
2025-12-12 04:54:58
|
but the ideas are similar. some of the ways they transform data (like interpreting number strings as integers) are things I've also thought of doing, esp. in places where it would be cheap to signal, like in the query parameter `?date=2025-01-01` I could use the fact that the key is `date` to signal relatively cheaply that the key is encoded in some date format
|
|
2025-12-12 04:56:31
|
still, if you trained OpenZL it on some set of URLs and made the resulting context model and dictionary implicit I bet it'd do pretty well
|
|
|
jonnyawsom3
|
|
AccessViolation_
any any signaling overhead should be minimal since url's are on the orders of 100-1000 bytes
|
|
2025-12-12 04:57:53
|
Bear in mind, how many are using a URL shortener if it's only a hundred bytes
|
|
|
AccessViolation_
|
2025-12-12 04:59:32
|
I have the answer to that question in the form of millions of tiny ass files that are a serious pain to handle with anything at all
|
|
2025-12-12 05:00:51
|
oh wait I have the code that generated them from the main data. I can create a histogram of url sizes, gimme a minute
|
|
2025-12-12 05:26:51
|
```rust
[src/main.rs:17:5] histogram = Histogram {
lt_10: 5251,
lt_20: 250149,
lt_40: 3439354,
lt_80: 17487309,
lt_160: 14442660,
lt_320: 4025378,
lt_640: 1013693,
lt_1280: 180991,
lt_2560: 31046,
lt_5120: 114630,
lt_10240: 2796,
lt_20480: 1106,
lt_40960: 2683,
gt_40960: 5630,
}
```
|
|
|
|
ignaloidas
|
2025-12-12 05:30:27
|
40960?? That's more than most browsers support no?
|
|
|
AccessViolation_
|
2025-12-12 05:31:47
|
I found out that some of them encode massive blobs of base64 by just putting `http://` before it. maybe they're using it for file sharing?
|
|
2025-12-12 05:33:22
|
or worse, cloud storage <:KekDog:805390049033191445>
|
|
|
Magnap
|
2025-12-12 05:48:06
|
I may be more cynical but my mind goes to malware distribution and command-and-control
|
|
|
AccessViolation_
|
2025-12-12 05:49:17
|
oh...yea
|
|
2025-12-12 05:49:41
|
the largest is 1,024,000 bytes, by the way
|
|
2025-12-12 05:52:39
|
oh?
|
|
2025-12-12 05:52:41
|
that's that largest one
|
|
2025-12-12 06:01:21
|
sample of some large URLs (trimmed)
|
|
|
HCrikki
|
2025-12-13 10:55:34
|
are there specific types of jpegs that would badly transcode losslessly to jxl?
|
|
2025-12-13 10:59:34
|
almost everything i tried gives a consistent 20% filesize reduction (more specifically, for the visual data)
|
|
|
jonnyawsom3
|
2025-12-13 11:01:48
|
I assume that's in reference to the Reddit post?
|
|
|
HCrikki
|
2025-12-13 11:02:48
|
correct. not possible to answer here and thought the subreddit could use activity the site's lost its shine
|
|
|
username
|
2025-12-13 11:03:38
|
|
|
2025-12-13 11:03:38
|
for context (present or future)
|
|
|
HCrikki
|
2025-12-13 11:04:54
|
i specified visual data as it couldve been jpegs with a huge metadata (ie generated from longlived images edited using photoshop)
|
|
|
A homosapien
|
2025-12-13 11:06:01
|
Even then, metadata is brotli compressed. So the 15-25% savings should still hold true.
|
|
|
HCrikki
|
2025-12-13 11:06:14
|
if your images are below 400kb itd make sense for the size of some messing the count. conversion utils should really generalize some basic logging for best and worst cases
|
|
|
jonnyawsom3
|
2025-12-13 11:07:49
|
The awnser https://www.reddit.com/r/jpegxl/comments/1pletwf/comment/ntsfieb/
|
|
|
HCrikki
|
2025-12-13 11:09:55
|
for transcoding efforts above 8 are a carbon footprint menace
|
|
|
|
veluca
|
2025-12-13 11:16:47
|
huh
|
|
|
HCrikki
for transcoding efforts above 8 are a carbon footprint menace
|
|
2025-12-13 11:16:56
|
yeah I don't think they ever help either
|
|
|
HCrikki
|
2025-12-13 01:04:30
|
testers needed https://github.com/FossifyOrg/Gallery/pull/816
|
|
2025-12-13 01:10:28
|
with 'open with' for jxl, fossify gallery seems to now cover jxl comprehensively enough
|
|
|
whatsurname
|
2025-12-13 02:59:02
|
That PR actually makes JXL support better than AVIF
"Open with" for AVIF is not supported before Android 12
|
|
|
Quackdoc
|
2025-12-13 03:03:07
|
oh yeah I forgot about that
|
|
|
RaveSteel
|
2025-12-13 03:03:09
|
JXL support was better before this too. Fossify Gallery does not tonemap HDR AVIFs for example
|
|
|
NovaZone
|
2025-12-14 06:39:05
|
https://discord.com/channels/794206087879852103/805722506517807104/1449491912611463339 but...why?
|
|
|
|
ignaloidas
|
2025-12-14 01:38:20
|
I got nerdsniped with the URL compression discussion and now have almost 2TB worth of URLs (uncompressed).
|
|
2025-12-14 01:39:58
|
Here's a chart of how the lengths of them end up (only looking at ones shorter than 1500 characters)
|
|
|
AccessViolation_
|
2025-12-14 04:44:18
|
wow!
|
|
2025-12-14 04:44:23
|
where did you get these from?
|
|
|
|
ignaloidas
|
2025-12-14 04:52:39
|
Got all of the urlteam dumps via this https://discord.com/channels/794206087879852103/794206087879852106/1448979074335899769
|
|
2025-12-14 04:53:17
|
now slowly cleaning it up from garbage (there's a lot of garbage URLs in there)
|
|
|
AccessViolation_
|
2025-12-14 04:58:06
|
on nice! I wasn't sure whether these were URLs from shorteners specifically
|
|
|
|
ignaloidas
|
2025-12-14 04:58:23
|
they are, some from goo.gl, some from other smaller ones
|
|
2025-12-14 04:58:30
|
mostly goo.gl tho
|
|
|
AccessViolation_
|
2025-12-14 04:58:34
|
how are you cleaning them up? in my case I just filtered those that did not contain valid unicode
|
|
|
|
ignaloidas
|
2025-12-14 04:59:46
|
trying to parse the url, and then checking it's components, older dumps have stuff like this `http://REVIEW: 'Bruce Lee: The Immortal Dragon - 70th Anniversary Special Edition' (DVD - Stax Entertainment) - http://www.kungfucinema.com/review-bruce-lee-the-immortal-dragon-70th-anniversary-special-edition-dvd-stax-entertainment-14611`
|
|
2025-12-14 05:01:02
|
though there's also some just HTML in some dumps, seemingly? They aren't of amazingly high quality, but it's a lot of data
|
|
|
AccessViolation_
|
2025-12-14 05:04:31
|
I just started working on a parser too (wrote the function headers, then had to leave for a few hours). I figured I could take some existing one, like from Firefox, but I need to actually segment it for context modeling. also didn't seem too hard to write one. I know those in browsers are oddly resilient to incorrectly formatted URLs, but because my context model heavily relies on the format of the URL I think I'm going to be more strict in what it can take in
|
|
|
|
ignaloidas
|
2025-12-14 05:08:05
|
do note that there are URLs with IP addresses, including IPv6 addresses
|
|
|
AccessViolation_
|
2025-12-14 05:08:11
|
initially I was going to allow only a subset of URLs, like just HTTP(S), and I even considered dropping the scheme entirely and just encode the domain and all that follows, but it'd be nice if I could also encode `ftp://`, `data:`, and more uncommon, custom ones like `gemini://`
|
|
|
|
ignaloidas
|
2025-12-14 05:08:47
|
that one was a bit annoying for me to figure out (especially since I'm trying to split the URLs by TLD to start)
|
|
2025-12-14 05:10:10
|
also, there's a non-trivial amount of URLs with authentication info (user/password) - if you want to support that
|
|
|
AccessViolation_
|
2025-12-14 05:12:57
|
yeah I'm aware, I do think I want to support that. the only reason I'd consider not supporting certain edge cases was if it resulted in a higher signaling overhead and thus worse compression for all URLs. but since ANS allows symbols to be represented by fractions of a bit, I think it'll be relatively cheap to signal the common features, and make uncommon ones more expensive
|
|
|
|
ignaloidas
|
2025-12-14 05:15:51
|
tbh I'm thinking of not supporting them, because they're likely to be so uncommon that the signaling such features would remove all of the compression for the rest of the URL
|
|
2025-12-14 05:24:23
|
consider that for URL compression, the pigeonhole principle doesn't exist - if the compressed URL ends up being longer (certainly a possibility), then you may as well use the original URL without any signaling overhead, so if you end up with a longer result, the compressed url is mostly useless
|
|
|
AccessViolation_
|
2025-12-14 05:25:41
|
a large sample of URLs in my much smaller dataset are news.google.com URLs because they're so large
`https://news.google.com/read/CBMisAFBVV95cUxQNE5SMEpremdyNTdPbGswM2Vxb01tRnRTUGJ3ZndPUHFOYk1tNy04TXJJYWdLeGhnd2ljM2IyMGQ5RmNrSmJ6WVBTOHhYZXo0YVBhM0pydExvV3hxTUlROUk2TTNDcHFqVHJzRVVKSkN6VGhZRzJWT1VXM2FLc0kyRVJFWnE0Sjc4M0t3N2M2bzVhODgwdkFxZkRRNFlReEk2VVFRVDNZeTNGQk1MOG42UNIBtgFBVV95cUxOVVVLbDg3a1IxU2tmZWo2RUZnNUQ2ZVVScnpkbE9fdXlnZFBER0pBVUh2TFVvcFA2RnM2RVZxTFJFV2NqYUg1aWFhdmpvR09yaGl6c2ZzMDlaV3d5VWliZm9ELWhtZzVYVFlmdndIV3o0X1dRcmIxR3I1a3I3QW9CU0tCSDA2TEQwcEVkRjk2eV9oRTN2OWtBX0dvMHVhRWV5VFRWRFVIM3BvcnlWU3phTXY4Y1YtQQ?hl=en-US&gl=US&ceid=US%3Aen`
Unfortunately, it seems in the past the base64 decoded directly to the URL of the source web page (since they all redirect to the original article's website), but now they decode to some weird other string that I guess they decode to the original servers-side, so that you can't skip sending a request to google
|
|
|
Exorcist
|
2025-12-14 05:33:52
|
You can design special rule for each website
but I think it's impossible to better than general lossless compression, for general URL
|
|
|
AccessViolation_
|
2025-12-14 05:35:43
|
I thought about doing a context per very commonly seen domain but I'm not sure I want to do that
|
|
2025-12-14 05:39:29
|
oh like that, yeah I do plan to add that to the frontend. so if the service you're creating a shortlink for has internal shortlinks, it'll create one and then compress *that*.
|
|
2025-12-14 05:40:16
|
it doesn't need to be explicitly supported by the compressor either since it does it before the URL is compressed
|
|
|
jonnyawsom3
|
|
AccessViolation_
I thought about doing a context per very commonly seen domain but I'm not sure I want to do that
|
|
2025-12-14 05:41:19
|
Huffman tree of most common entries for each URL segment?
|
|
|
AccessViolation_
|
2025-12-14 05:44:11
|
my plan is to treat the 'domain' part and 'path' part separately. so domains themselves will be form a context of commonly seen ones, but the context of the path that follows is not tied to the specific domain directly
|
|
2025-12-14 05:46:03
|
if it was, it would surely compress better, but I'd have to hard-code thousands of different contexts for common websites in the encoder and decoder and that doesn't feel right
|
|
|
jonnyawsom3
|
2025-12-14 05:51:46
|
Hmmm, what if you parsed the top 10/100 out of the corpus you downloaded. Use the huffman for the overwhelmingly dominant options of HTTPS, goo.gl, ect
|
|
|
Exorcist
|
2025-12-14 05:55:02
|
You may want this before Huffman:
1. sort list
2. mark [the prefix that same as previous line] as offset
|
|
|
AccessViolation_
|
|
Hmmm, what if you parsed the top 10/100 out of the corpus you downloaded. Use the huffman for the overwhelmingly dominant options of HTTPS, goo.gl, ect
|
|
2025-12-14 05:56:33
|
I will create dictionaries based on these URLs, but not different dictionaries per domain, is what I mean. or that's not the plan, at least
|
|
2025-12-14 06:02:55
|
the thing is, I would like it to be generally good, and it feels sort of cheaty to give the most popular websites that much special treatment, in a sense
|
|
|
jonnyawsom3
|
2025-12-14 06:04:50
|
That should still work pretty well, was just thinking of different combinations of techniques
|
|
|
AccessViolation_
|
2025-12-14 06:07:06
|
it also felt wrong to give english special treatment, even though english is represented a lot more in sampled URLs. so instead I'm creating per-language dictionaries, and the language can be signaled, and that signal itself is entropy coded with probabilities according to the TLD.
for example, if you want to signal that a URL itself has many french words, it will be cheaper to do so if the TLD is `.fr`, but more expensive if the TLD is `.de`
|
|
2025-12-14 06:08:09
|
this way, compressing non-english URLs won't be more expensive in most cases, even though the vast majority of URLs are english
|
|
2025-12-14 06:10:08
|
I do sort of give wikipedia special treatment, because I will also allow country codes in subdomains if the TLD is generic, so `fr.wikipedia.org` will signal language probabilities similarly to how `wikipedia.fr` would :>
|
|
2025-12-14 06:10:40
|
of course this works for any site that segregates different language versions this way
|
|
|
Meow
|
2025-12-15 02:55:51
|
Uh WMF owns `wikipedia.cat`
|
|
|
_wb_
|
2025-12-15 07:26:40
|
instead of manually creating per-language dictionaries, can't you use context modeling and use tld, subdomains, highest level dirs, query params as context? And then use context clustering. Which will probably have the same effect of making `fr.wikipedia.org` or `blabla.fr` or `foo.com/fr/` or `bar.com/quux?lang=fr` all map to the same context, but you wouldn't manually define such rules, they could be derived from a large training set.
|
|
|
Magnap
|
|
Meow
Uh WMF owns `wikipedia.cat`
|
|
2025-12-15 09:20:03
|
Ah, Catalan, makes sense
|
|
|
AccessViolation_
|
|
_wb_
instead of manually creating per-language dictionaries, can't you use context modeling and use tld, subdomains, highest level dirs, query params as context? And then use context clustering. Which will probably have the same effect of making `fr.wikipedia.org` or `blabla.fr` or `foo.com/fr/` or `bar.com/quux?lang=fr` all map to the same context, but you wouldn't manually define such rules, they could be derived from a large training set.
|
|
2025-12-15 10:15:54
|
that seems like a good idea, I'll have to think about that. one potential issue I see would be that things I'm using to derive the context from, can't themselves be a part of it.
for example this made up URL:
`https://wiki.factorio.com/w/index.php?lang=fr?title=Chargeur_de_munitions_performantes?theme=sombre`
if I use query params as the context, it would read all three of them and detect a language from `?lang=fr`, but since I'm using queries as part of the signal it can't then compress `?title=Chargeur_de_munitions_performantes?theme=sombre` using the french language context
|
|
2025-12-15 10:17:11
|
I could explicitly signal which parameters and path components are used for deriving the context and encode those separately, so `query[0]` is encoded in a default context, then a language context is signaled from that, and then and `query[1..=2]` are encoded using that context
|
|
|
_wb_
|
2025-12-15 01:02:36
|
You could have a default ordering for the various url components, where of course you can only use already-decoded components as context for to-be-decoded components, that's why the ordering matters.
And then do the same thing jxl does: first 1 bit to indicate if you're using the default or not, and if not, then a lehmer coded permutation. That way an encoder can use a more optimal ordering.
|
|
|
AccessViolation_
|
2025-12-15 03:34:57
|
oh hmmmm that's smart
|
|
2025-12-15 03:35:43
|
I hadn't thought about allowing different ordering per encoded URL
|
|
2025-12-15 03:39:04
|
I'm assuming the lehmer code signalling overhead is proportional to how different you make the ordering
|
|
2025-12-15 03:40:46
|
on average URLs people put into link shorteners are like 80 characters, so signaling overhead in general adds up really quickly
|
|
|
_wb_
|
2025-12-15 04:17:17
|
yes, so the default order should be pretty good, and changing it will only make sense for long urls where the change is really useful.
|
|
|
AccessViolation_
|
2025-12-15 04:45:31
|
it might be interesting to support any URI, not just HTTP URLs. I'm implementing RFC 3986 in my parser anyway, so why not π
|
|
|
|
ignaloidas
|
2025-12-15 04:46:30
|
Fun stuff from cleaning up my dataset - apparently US government had a URL shortener https://go.usa.gov/
|
|
2025-12-15 04:47:07
|
noticed only because I found somewhat strange that .mil had ~24M URLs in shortening services
|
|
|
AccessViolation_
|
2025-12-15 04:49:23
|
how's your project going so far?
|
|
2025-12-15 04:49:27
|
you were working on the same thing right?
|
|
|
|
ignaloidas
|
2025-12-15 04:52:43
|
Yeah, though I'm thinking of maybe doing a harder thing and trying to train a ML model to work with adaptive ANS - will likely take way longer to get to something useable than your approach
|
|
2025-12-15 04:53:42
|
will try to put up the dataset I'm cleaning up fairly soon though, though it's going to end up around ~400GB compressed
|
|
|
AccessViolation_
|
2025-12-15 04:56:35
|
that sounds interesting
|
|
2025-12-15 04:57:40
|
I'm considering using ANS with adaptive probabilities to get around the inefficiencies that arise from having multiple streams
|
|
2025-12-15 05:00:39
|
for example, instead of having one ANS stream per context, I could also have a single ANS stream and adapt the probabilities per context. that's probably going to be harder to manage if I also want predefined LZ77 dictionaries per context, though
|
|
2025-12-15 05:04:30
|
but this is all theoretical, I'm going to start with segmenting URLs, then context modeling, then just using zstd blobs per segment per context :p
|
|
2025-12-15 05:05:37
|
implementing a custom compressor with ANS doesn't sound particularly easy so that'll be the last step
|
|
|
mincerafter42
|
2025-12-16 07:53:27
|
hello i believe i've implemented the [pairwise nearest neighbour palette generation algorithm in Ξ(n log n) time](https://vivivi.leprd.space/software/pairwise-nearest-neighbour/), better than the previous O(nΒ²) unless someone did it better and i couldn't find it
|
|
|
lonjil
|
|
mincerafter42
|
2025-12-16 07:57:21
|
i wanted to make some GIFs and was unsatisfied with the existing state of algortithms, so after guessing "eh a K-D tree is probably gonna reduce the time complexity for nearest neighbour search right? even though the distances are non-euclidian?" i wrote this and yes it works
|
|
|
AccessViolation_
|
2025-12-16 09:15:56
|
oh wow
|
|
2025-12-16 09:15:58
|
nice
|
|
|
_wb_
|
2025-12-16 09:23:38
|
That's cool! It wonder if this would also be useful to create a better palette ordering?
libjxl by default only uses palette in a lossless way, i.e. if the image already does contain a small number of colors (e.g. when recompressing a GIF or PNG8). But the palette ordering it uses right now is pretty basic, just sorting on luma basically iirc with maybe some attempts at doing slightly better than that but nothing principled.
|
|
|
mincerafter42
|
2025-12-16 10:28:23
|
my code doesn't do anything with the order of the palette entries; they look somewhat ordered in the output but only because i don't change the order after making an initial histogram in the source colour space
|
|
2025-12-16 10:44:53
|
would palette ordering benefit from finding the nearest neighbour of a palette entry, with all entries weighted by the number of pixels using the entry? that's essentially what's going on here
|
|
|
|
veluca
|
2025-12-16 11:23:34
|
I am *somewhat* convinced it's n log n average case, not worst case
|
|
2025-12-16 11:24:19
|
still I doubt I can cook up an image that makes it actually take quadratic (at least not in the time I am willing to spend on it :P)
|
|
|
mincerafter42
|
2025-12-16 11:38:50
|
i did say average case yes
|
|
|
TheBigBadBoy - πΈπ
where is the 186B SVG ?
|
|
2025-12-17 01:15:48
|
i can get it down to 168B :p
|
|
2025-12-17 01:17:09
|
(admittedly with a hacky trick that saves bytes by making the background an enormous triangle that gets cropped lol)
(i am knowledgable in the ways of SVG and not very active in this discord-group :p )
|
|
|
AccessViolation_
|
2025-12-17 01:21:37
|
132 bytes brotli compressed :)
|
|
|
Adrian The Frog
|
2025-12-21 07:30:50
|
|
|
2025-12-21 07:31:05
|
|
|
2025-12-21 07:34:16
|
1024 bytes, `cjxl -d 23.005 -e 10 -m 1 --gaborish 1 -I 98`
|
|
2025-12-21 07:35:23
|
from this 21kb png
|
|
|
spider-mario
|
2025-12-22 06:30:08
|
https://youtu.be/izxXGuVL21o
|
|
|
AccessViolation_
|
2025-12-25 11:40:12
|
I can wait for JXLs with noise synthesis to become so prominent that the noise pattern starts showing up in AI generated images, similarly to how charlie kirk's face showed up on people in generated images because that face was overrepresented in memes used in training <:KekDog:805390049033191445>
|
|
|
jonnyawsom3
|
2025-12-26 03:17:41
|
Thought this was neat
https://fixupx.com/i/status/2004210496489357414
|
|
|
RaveSteel
|
2025-12-26 03:23:59
|
The images were posted [here](https://discord.com/channels/794206087879852103/794206170445119489/1363684268760367184) some time ago
|
|
|
jonnyawsom3
|
2025-12-26 04:22:54
|
Oh yeah, I even reacted to it.... It's been a long week
|
|
|
DZgas Π
|
|
Adrian The Frog
|
|
2025-12-27 09:12:38
|
|
|
2025-12-27 09:12:57
|
cjxl -m 1 -d 11.92 -e 10 --resampling=1
|
|
2025-12-27 09:13:18
|
|
|
2025-12-27 09:14:38
|
or
cjxl -m 1 -d 2.983 -e 10 --resampling=2
|
|
2025-12-27 09:15:03
|
|
|
2025-12-27 09:17:05
|
great rectangle
|
|
|
A homosapien
|
2025-12-28 10:31:07
|
avif `tune=iq -d 10`
|
|
|
username
|
2025-12-28 10:36:10
|
WebP with `-size 1019 -pass 10 -mt -pre 0 -sharpness 1 -m 6 -af -sharp_yuv -q 2 -sns 50 -noalpha`
|
|
|
DZgas Π
|
2025-12-29 08:46:17
|
JPEG mozjpeg `cjpeg.exe -baseline -notrellis -sample 2x2 -quality 8 -tune-psnr -quant-table 3`
|
|
|
TheBigBadBoy - πΈπ
|
|
DZgas Π
JPEG mozjpeg `cjpeg.exe -baseline -notrellis -sample 2x2 -quality 8 -tune-psnr -quant-table 3`
|
|
2025-12-29 09:11:49
|
19B less with jpegultrascan
|
|
|
jonnyawsom3
|
2026-01-02 05:40:19
|
Well, props to Irfanview, it just did a difference map of a gigapixel image without exploding. Some UI flickering and my music stuttered, but not bad
|
|