Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
Bytes are all you need: Transformers operating directly on file bytes (arxiv.org)
263 points by pmoriarty on June 3, 2023 | hide | past | favorite | 96 comments


Fascinating philosophical question: if an AI observes a stream of bytes but doesn't decode a camera's output into an image -- and yet can raise an alarm if the camera is aimed at a certain class of subject -- does the AI "see" the camera image?

They should have named the system "Blindsight"[1].

Apple has a serious corporate concern about proposed legal requirements to detect and report child porno[2], while other legal requirements try to protect user privacy. This appears to be an attempt to say, "We can detect your CP image storage without ever actually looking at your stored images."

[1] https://en.wikipedia.org/wiki/Blindsight

[2] https://en.wikipedia.org/wiki/Regulation_to_Prevent_and_Comb...


It is worth doing some ML courses to understand what is going on.

I am beginning this journey, and it takes a lot of the magic out of it.

Really what is happening is just mathematical operations: a lot of matrix multiplications, some simple non-linear functions (because linear algebra alone cannot represent all logic), some statistical stuff to stop the numbers getting out of control.

Importantly there are millions or billions of "magic numbers" that get updated as the model learns called parameters.

Whether you train the network on representation A of the image, or representation B of the image, doesn't change much philosophically. It is just a function f(x) applied before the function g(x, parameters), so you get g(f(X), parameters).

Now you could argue, well it might mean "something". After all you can reduce our brains to being a big mathematical functions.

Possibly.

But I think in this case it is too simple. The AI is just looking at a slightly different representation. Similar to changing VS code from/to dark mode for humans, or something like that. And the modellers might be changing representations all the time anyway as they tinker with things.


I took Andrew Ng’s famous Coursera ML course and I still find this stuff extremely fascinating and as close to magic as you can get (in the digital realm).

I agree that a higher degree of understanding lessens the emotional impact but often you just need to pause for a second, look back and appreciate what we’ve accomplished as a species.

Sometimes I get on a plane and mid flight I realize “wait a second, I’m in a metal box flying in the sky at 800 km/h and I can breathe, eat, drink and watch tv” and I get goosebumps, even though I’m kind of familiar with the physics of it.

Maybe some things just have an inherent “awe factor” and no matter how well you understand them you still get those butterflies. Or maybe I’m just blabbering :)


Yeah but this is an ML 101 view of the thing, so let me put it this way

The topology of the (real) input data is part of the dimensionality reduction needed for the identification to work, especially at a given number of parameters

Understanding how the data is connected in the input (there are image lines and they form a "shape") helps a lot and CNNs used this to their advantage

Maybe a very powerful RNN could deduce this by itself, but it is wasted processing.

GPT also uses the input topology to its favour when it uses tokens, process text sequentially, etc


If this is the case, and I believe you, then this implies: "bytes are NOT all you need". You need a good representation of your image for the network to work well. It might so happen that one of the popular formats, maybe JPEG does a good job. But a dedicated ML person could do better most likely.

This is probably what you would classically call "feature engineering"? And this is all hyperparameter search stuff in a way.


"I am beginning this journey, and it takes a lot of the magic out of it.

Really what is happening is just mathematical operations: a lot of matrix multiplications, some simple non-linear functions (because linear algebra alone cannot represent all logic), some statistical stuff to stop the numbers getting out of control."

And all that happens in the human brain is "just" chemical reactions.

All that happens on the Earth is "just" chemistry.


Also I said "Now you could argue, well it might mean "something". After all you can reduce our brains to being a big mathematical functions.

Possibly."


Well, it’s not just an aside. It takes all the wind out of your sails. What’s interesting here is the end result, not pointing out the basic elements the same reason reducing love or consciousness to chemical reactions doesn’t take away the mystique and wonder of those things. “Heh sorry to ruin it for you all :b but it’s all just chemical reactions” just isn’t as interesting as you might have been thinking when writing that comment.


Do you "see" the image projected onto your eye's retina, if you just observe a massively parallel stream of electric stimuli going through your optic nerve, but never decode it into a cluster of pixels and the internal reconstruction of the scene in your brain is nothing like that?

(wait until you know that there are single-pixel compressive sensing cameras that bypass the wasteful image reconstruction phase, directly mapping the raw signal to the output representation...)


Models never « see » anything to begin with, it’s all matrices. And since we ditched convents, locality doesn’t even matter anymore (almost). RGB is just convenient for humans, nothing says it’s optimal for deep learning


Yeah, real humans see with a fourier transform in a highly optimized basis for projecting 3d down to the 2d retina. Not cold soulless math like those machines!

http://www.math.utah.edu/~bresslof/publications/01-3.pdf


Much hardware and software for codecs use wave basis as well.

https://en.wikipedia.org/wiki/Discrete_cosine_transform


Humans see more than 8-bit RGB. Humans can see light polarization and stereo disparity, but more importantly we can interact with things we look at.

XYZ is the "most optimal" 3-channel colorspace but it's a simple transformation from RGB, so it doesn't matter - the model can learn it if it wants to.


Yeah, yeah, we know it is all matrices and numbers. But the Apple group seems to be asserting that there is a difference,

> We also demonstrate ByteFormer's ability to perform inference with a hypothetical privacy-preserving camera which avoids forming full images by consistently masking 90% of pixel channels, while still achieving 71.35% accuracy on ImageNet.

So the way this algorithm smooshes the matrices is "privacy-preserving" because it doesn't actually take in all the bytes. Just some of them. So it is not an invasion of your privacy, if your phone checks all your stored photos for child pornography, because it never actually _looks_ at the images. It just _peeks_ at a bit of them.


If I ask to see just a bit of your private documents - I can choose the corner of the document, let's say, the top left - and I can inferr with 70% accuracy who you're getting services from, did I respect your privacy? This seems a bit like smoke and mirrors. Either scan the image or not!

Inference from partial images is super cool research for example to deal with corrupted image files, but not for privacy imo.


Which is a hardcore effort to pretend the issue was something to do with looking at images, and not indiscriminate motivated surveillance.

No one cares about software touching their bytes: they should care a lot about software which is built with the null hypothesis of "this person has child pornography, notify the authorities".


That sounds like a variant of the Chinese Room problem: if a non-Chinese speaker follows a rule book to text-chat in Chinese with someone on the other side, does s/he, in actuality, “speak Chinese”?

https://en.m.wikipedia.org/wiki/Chinese_room


Except the person in the room also had to write the book by looking at examples of Chinese.


No, the book does.


Or perhaps the book + human system does


Or perhaps it's the universe speaking Chinese with itself. The chain of cause and effect stretches 13.8 billion light years.



The AI doesnt decode the image with a human written decoder. But it decodes with learned model parameters. So it still decodes.


You might want to read about Daniel Kish and maybe listen to the Batman story on NPR

They raise the philosophical question of what does it mean to see

According to brain scan of Daniel Kish, it seems that despite being blind, he can actually see through echolocation

Reference: https://www.npr.org/2015/11/20/455906507/how-can-you-see-wit...


Wouldn’t the transformer just be learning a jpeg decoder at some point?


You might like the scifi novel of the same name by Peter Watts


The real conclusion from your question is that the AI never “sees” the image even if it did decode it.


And how do you draw that conclusion?


What does it mean to “see”? Are we assuming that eyes are involved?


How does the human brain "see" the signal impulses that pass through the optic nerve?


Both tif and wav are lossless so this doesn’t really surprise me.

It is however nice to see Apple researchers publishing, you don’t see that often at the forefront of transformer or generative model research. I hope it means that everyone here on HN is right and they are seriously looking into running the generative models locally on their own silicon


They publish research all the time:

https://machinelearning.apple.com/research

Including research about fitting models locally onto Apple Silicon, for example: https://machinelearning.apple.com/research/neural-engine-tra...


I had no idea. I've never seen one mentioned in the wild.


I mean, they published their own metal stable diffusion implementation on GitHub very quickly last year. I dont think this was in doubt


And then… nothing after?


I suspect they prefer to use their standard marketing avenues to better control the narrative. I'm sure we'll be hearing a lot more next week, and in Apple fashion it'll all be grounded in customer facing use cases and associated developer frameworks.


That's not AI research


Its desperately needed though.

If AMD or Intel started going around github and writing in acceleration for random popular ML repos (and not just one off abandoned demos), that would change so much.


These models run on CUDA. Making a version that ran on Metal filled a feature gap on Apple's proprietary GPUs. CPUs aren't involved, the last thing, for example, Stable Diffusion needs is help from AMD and Intel.


CPUs are actually involved, especially with SIMD like AVX and VNNI instructions. It all depends on the scale of your compute.

https://www.intel.com/content/www/us/en/developer/tools/open...


Linking to an Intel SDK with “deep learning” in it does prove it’s possible to do matmuls on CPUs.

But this a nitpicked to death thread: it’s good to see Apple releasing research, the Metal stuff was to make it run on their GPU, and there simply aren’t any relevant uses of CPUs.

Academically, yes, CPUs have SIMD so they can implement the same algorithm that drives GPU perf, but without some massive breakthrough the one that can do more matmuls wins, and nVidia wins massively, in theory and practice.


I'm not just linking it as a proof, I now work at Intel and collaborate with Microsoft on bringing new platform technologies to life. I'm currently involved in VNNI and hybrid core scheduling KPIs. OpenVINO allows developers to target execution engines on client based on their needs and CPU is a well supported engine for lightweight inferencing that is latency sensitive.

Matmuls (FMAC especially) did evolve as a SIMD workload on CPUs but it's still forcing vector execution on cores designed around scalar workloads - you just have to be mindful of tradeoffs between executing a bunch of AVX/VNNI instructions for inferencing vs setting up a GPU/VPU/IPU context and performing that inference asynchronously.

AFAIK, ARM is also interested in use of NEON/SIMD co-processors as low latency inference engines.

Microsoft and Intel have a lot of interesting stuff in the works for AI compute (as I'm sure so do Apple, Nvidia and to a lesser extend AMD).


You know that Intel makes GPUs right? In fact, they are by far the biggest producer of GPUs by volume.

Sure, they aren't being used in datacenters but if you are making an application that needs to run locally then you absolutely need good support for intel GPUs.


Intel actually just released a set of data centre GPUs too


You're talking about integrated GPUs I presume?


iGPUs aren’t necessarily the worst solution where the problem is one where “can I do it at all” is often constrained by RAM available to the GPU, even if dGPUs have better compute performance.


Apple GPUs are also integrated. (But "unified memory" is a better technical distinction.)


Mainly yes, although they do also make dedicated gpus


AMD and Intel both ship a lot of GPUs.

In fact there are good AMD/Intel implementations, just not enough dev effort to add them to popular SD projects


In Intel’s case, they have their own SD-based project (OpenVINO AI Plugins for GIMP), instead of putting effort into popular SD projects.


Oh, and also, lossless does not necessarily mean readable. Hence they note zlib compressed PNGs are untenable.


It has been well documented that Apple has permitted its AI teams to publish papers in a manner that appears to contradict Apple's culture of secrecy. This is explicitly allowed, as previously this was a hindrance to recruiting talent.


Drip by drip gearing up for "one more thing" on Monday...


> Additionally, we demonstrate that ByteFormer has applications in privacy-preserving inference. ByteFormer is capable of performing inference on particular obfuscated input representations with no loss of accuracy. We also demonstrate ByteFormer's ability to perform inference with a hypothetical privacy-preserving camera which avoids forming full images by consistently masking $90\%$ of pixel channels, while still achieving $71.35\%$ accuracy on ImageNet.

I'm not certain they know what "privacy-preserving" means. All the claims they've made around privacy look, to the lay-person (me), to be meaningless:

• Permuting the input values doesn't change anything, because. If anything, this suggests that transformers might be able to approximate the original image, given an unknownly-permuted image – but so can humans, so that's nothing new. https://en.wikipedia.org/wiki/Block_cipher_modes_of_operatio...

• Their "partially-masked image" just looks like a noisy image, not a redacted one. Basic information theory suggests it's not really privacy-preserving at all.

Is it normal for AI papers to be so hypey? Like, this part is literally security-by-obscurity.

> As our method can handle highly nonlinear JPEG encodings, we expect it to perform well on a variety of alternative encodings that an outside observer might not be able to easily guess.

I don't see how any of section 4.2 contributes to the paper, other than letting them make a bold claim about a buzzword in a disproportionate amount of the abstract and conclusion.


I'm interested in that claim as well, though maybe not so overtly hostile to the research.

My first thought was certainly that, at the very least, an adversary can perform image classification as well. I think this is an obvious limitation of how privacy-preserving is possible, and so maybe it's just taken that the reader should understand that. And BIG-IF the adversary can't do much more than that, it would still be appreciated.


They say their model works for a hypothetical privacy preserving camera that masks 90% of the pixels.

I’m not sure there’s much more to it; I think you’re reading too far into it. It shows the power of their model and possible applications towards privacy.

It’s not a stance that that hypothetical camera is actually great for privacy.


But a camera that masks 90% of the pixels isn't privacy-preserving. It's just a 90s consumer-grade webcam. They haven't shown that the approach works with actual privacy measures, which makes their claims in this area dubious.


> It's just a 90s consumer-grade webcam.

The positions of the masked pixels are not stored in the resulting data -- it's not like they are just making some pixels black. The channel information is actually removed from the buffer entirely, and then inference is performed on that buffer:

> The camera stores the remaining unmasked pixel channels in an array without retaining the coordinates of pixel channels on the image sensor. In this scenario, an ad- versary could not obtain a faithful reconstruction of the in- put image. Even if the adversary could guess pixel channel locations, the low resolution of captured data prevents the adversary from recovering a high-fidelity image

Also:

> Their "partially-masked image" just looks like a noisy image, not a redacted one

The caption for the figure (assuming you're talking about figure 4) makes it clear that the figure is illustrative, since it includes the positions of the masked pixels. The pixel positions would not be present in the actual data that comes from this hypothetical camera. So what the figure "looks like" is irrelevant -- its purely illustrative.


ByteFormer. All You Need. The applied papers that claim dubious "sota" status on some benchmark do this. The foundational work that is actually worth reading doesn't.


Had a quick skim.

I was really struggling to see the practical advantage of this. Because you can convert different formats to what the model needs easily.

It feels like some strawman are set up. In the world of the paper, people have to painstakingly convert the image to the correct format for the model, and "hand-crafting a model stem for each modality"

But once you have set up a model, and got it working for say .tiff then to work on any other format you can just use ImageMagick or something? Unless you want the meta data too?

I think the use case for a model that works on bytes is as a kind of "easy to install" package on local devices, like a security camera, regular camera etc that will work with whatever the local file format is.

Also, a dollar into the "is/are all you need" jar please.


> Also, a dollar into the "is/are all you need" jar please.

"Is/Are All You Need Considered Harmful"


The Unreasonable Effectiveness Considered Harmful for Fun and Profit is All You Need


Harm is all you need.


  All you need is harm (ba da-da da-da)
  All you need is harm (ba da-da da-da)
  All you need is harm
  Harm, harm is all you need


One interesting application is to imagine you have a bunch of data in some proprietary binary format, with metadata in plain text.

This is saying you could potentially make a network that generates new instances of the proprietary format without having to know the details of that format.


Interesting thought to have a model training co-processor that reads all of the data inputs and outputs from the actual processor. There's a ton of sequence information flowing through there even on a single machine. Then you'd basically have a model that was a "virtual machine" mirror of your actual cpu and the data it's interacted with. I'm not sure what would emerge from that, but it's super interesting.


Could be used to "compress"/predict common computations, a la Nvidia's AI video compression. Distributed computing that's distributed across time as well as space.


Wouldn’t be surprised if this was already done to some degrees for branch prediction



I definitely appreciate the technical privacy focus here and I’m curious if that is the whole purpose here.

I don’t see that many benefits to this method other than privacy - the ability to process multiple data types with a singular architecture is really cool but not SOTA beating or comparable really to large multimodal systems like imagebind [1]

[1] https://arxiv.org/pdf/2305.05665.pdf


I wonder what would happen if someone connected a Transformer between the inputs and output of an embedded system.

Could we get a robotic arm to catch a falling ball?

I suspect the lack of awareness of time would mess it up.

What if we had the Transformer take inputs from the state of the world (as determined by other software) and output commands? I wonder what it could do.


I think PALM-E[1] is pretty close to what you're describing.

[1]: https://arstechnica.com/information-technology/2023/03/embod...


> In a video example, a researcher grabs the chips from the robot and moves them, but the robot locates the chips and grabs them again.

Well, that’s it boys, I think we’ve successfully created a Terminator. To paraphrase Kyle Reese, “It is a chip-grabbing machine, and it absolutely will not stop!! Until it has acquired all the chips.”


Replace chips with money and bam, it's man.


As they note, most media compression is going to throw a monkeywrench into the whole thing.

But I was kinda hoping they would test gpu texture compression. AFAIK its a much simpler compression scheme.


Why not enforce that decompression is a necessary part of the data cleaning? There's no reason to operate on complex formats like mp4 directly.


Is data compression really a problem? Compression algorithms are meticulously tuned to preserve important details. In theory, a model that operates on compressed data has a real advantage over one that operates on raw data. You’ve reduced the memory requirements without sacrificing any important features of the input data.


I tried to get GPT-4 to act as a compiler and it didnt go so well but it felt like it was mainly because it didnt believe it could. After much consternation it was willing to put together a hello world in x86 assembly.


A lot of problems are avoided with strategic "as a [suitable role], [task]" or "you are a ...". Not sure what you tried, but variations of that have been enough to bypass all kinds of weird objections for me.


The little language model that could.


Interesting release from Apple but there's been a similar (and I think more promising architecturally) paper from Meta lately:

> MEGABYTE: Predicting Million-byte Sequences with Multiscale Transformers

https://arxiv.org/abs/2305.07185


Seems in similar spirit to the “perceiver” architecture from deepmind a couple years ago:

https://arxiv.org/abs/2107.14795


This seems like a downgrade. Intuition suggests it should lead to better performance if you first parse/process the input to represent the actual input space better.


Yes that's the intuition, but there's an idea called 'the bitter lesson' by Richard Sutton where every 'feature engineering' is swamped and overtaken by the raw power of stacking more layers with increased parameters and exaflops and dataset sizes.


All that's really happening there is a brute-force search for engineered features.

Such brute-forcing has a very very low ROI when the necessarily huge datasets don't already exist.

Indeed, creating the experimental conditions to collect that data often means engaging in some form of imagined feature engineering.


I wondered something like this—would it be better to train a CNN in say YUV color space? But if you consider that a NN approximates any function, if using YUV performed better then the network would learn to convert RGB to YUV itself. (It’s a simple linear relationship that would take a single layer.)

I then found a paper that confirmed that converting input to different color spaces did not cause significant differences in performance.


Similarly I wondered if partially decompressing video and using that format as both the input and output might work. The logic is that a fully decompressed video is huge, and that extra data is by definition wasteful: it’s exactly what’s thrown away by compression! We’ve designed compression to efficiently match the human visual system and not waste bytes on irrelevant things.

So I wonder if a NN trained on something like a quantised DCT as both input and output might be dramatically more efficient, roughly in line with the compression ratio of applying the same transforms to a raw video.

Obviously we’d have to avoid bit-level streaming algorithms like Huffman coding.

However even reusing image tiles might work via methods such as differentiable hash tables, as seen in NVIDIA’s reverse rendering neural nets!

Food for thought…


You'd think this, but, RGB color is ~meaningless and all the image stuff works great. I long for a model that is trained in a perceptual space, and yet, I doubt it'll matter


It's not faithfully representative of the post-processing performed by the average person's visual system, but that doesn't make it meaningless by any stretch of the imagination.


Intuitively, it’s meaningless. What’s brighter, 255 G or 255 R?


That seems much more like "incomplete" than "meaningless."

There's enough meaning to answer this question, no: "what's greener, 255 G or 255 R"? Or "what's greener, #FF6600 or #FF0066?"


what does it mean to have "more green"? what is this "green"?



That's called viewing conditions in color science: it's sort of like saying "in a pitch black room, the same!"

Let's assume the same viewing conditions, i.e it's not a trick question: answer is green, by about 30% IIRC


Green is brighter in all typical colorspaces. That's why it's the larget contribution to Y in the RGB->YUV conversion matrix.


In terms of processing raw bytes, inference on raw camera data, bypassing the ISP, is a more relevant application. This can reduce latency in camera inference.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: