• 0 Posts
  • 123 Comments
Joined 1 年前
cake
Cake day: 2023年7月8日

help-circle
  • Oh yes, let me just contact the manufacturer for this appliance and ask them to update it to support automated certificate renewa–

    What’s that? “Device is end of life and will not receive further feature updates?” Okay, let me ask my boss if I can replace i–

    What? “Equipment is working fine and there is no room in the budget for a replacement?” Okay, then let me see if I can find a workaround with existing equipme–

    Huh? “Requested feature requires updating subscription to include advanced management capabilities?” Oh, fuck off…


  • I keep thinking of the anticapitalist manifesto that a spinoff team from the disco elysium developers dropped, and this part in particular stands out to me and helps crystallize exactly why I don’t like AI art:

    All art is communication — dialogue across time, space and thought. In its rawest, it is one mind’s ability to provoke emotion in another. Large language models — simulacra, cold comfort, real-doll pocket-pussy, cyberspace freezer of an abandoned IM-chat — which are today passed off for “artificial intelligence”, will never be able to offer a dialogue with the vision of another human being.

    Machine-generated works will never satisfy or substitute the human desire for art, as our desire for art is in its core a desire for communication with another, with a talent who speaks to us across worlds and ages to remind us of our all-encompassing human universality. There is no one to connect to in a large language model. The phone line is open but there’s no one on the other side.




  • Did you read the article, or the actual research paper? They present a mathematical proof that any hypothetical method of training an AI that produces an algorithm that performs better than random chance could also be used to solve a known intractible problem, which is impossible with all known current methods. This means that any algorithm we can produce that works by training an AI would run in exponential time or worse.

    The paper authors point out that this also has severe implications for current AI, too–since the current AI-by-learning method that underpins all LLMs is fundamentally NP-hard and can’t run in polynomial time, “the sample-and-time requirements grow non-polynomially (e.g. exponentially or worse) in n.” They present a thought experiment of an AI that handles a 15-minute conversation, assuming 60 words are spoken per minute (keep in mind the average is roughly 160). The resources this AI would require to process this would be 60*15 = 900. The authors then conclude:

    “Now the AI needs to learn to respond appropriately to conversations of this size (and not just to short prompts). Since resource requirements for AI-by-Learning grow exponentially or worse, let us take a simple exponential function O(2n ) as our proxy of the order of magnitude of resources needed as a function of n. 2^900 ∼ 10^270 is already unimaginably larger than the number of atoms in the universe (∼10^81 ). Imagine us sampling this super-astronomical space of possible situations using so-called ‘Big Data’. Even if we grant that billions of trillions (10 21 ) of relevant data samples could be generated (or scraped) and stored, then this is still but a miniscule proportion of the order of magnitude of samples needed to solve the learning problem for even moderate size n.”

    That’s why LLMs are a dead end.


  • When IT folks say devs don’t know about hardware, they’re usually talking about the forest-level overview in my experience. Stuff like how the software being developed integrates into an existing environment and how to optimize code to fit within the bounds of reality–it may be practical to dump a database directly into memory when it’s a 500 MB testing dataset on your local workstation, but it’s insane to do that with a 500+ GB database in production environment. Similarly, a program may run fine when it’s using a NVMe SSD, but lots of environments even today still depend on arrays of traditional electromechanical hard drives because they offer the most capacity per dollar, and aren’t as prone to suddenly tombstoning when it dies like flash media. Suddenly, once the program is in production, it turns out that same program’s making a bunch of random I/O calls that could be optimized into a more sequential request or batched together into a single transaction, and now it runs like dogshit and drags down every other VM, container, or service sharing that array with it. That’s not accounting for the real dumb shit I’ve read about, like “dev hard coded their local IP address and it breaks in production because of NAT” or “program crashes because it doesn’t account for network latency.”

    Game dev is unique because you’re explicitly targeting a single known platform (for consoles) or targeting for an extremely wide range of performance specs (for PC), and hitting an acceptable level of performance pre-release is (somewhat) mandatory, so this kind of mindfulness is drilled into devs much more heavily than business software dev is, especially in-house dev. Business development is almost entirely focused on “does it run without failing catastrophically” and almost everything else–performance, security, cleanliness, resource optimization–is given bare lip service at best.


  • Fine, you win, I misunderstood. I still disagree with your actual point, however. To me, Intelligence implies the ability to learn in real-time, to adapt to changes in circumstance, and for self-improvement. Once an LLM is trained, it is static and unchanging until you re-train it with new data and update the model. Even if you strip out the sapience/consciousness-related stuff like the ability to think critically about a scenario, proactively make decisions, etc., an LLM is only capable of regurgitating facts and responding to its immediate input. By design, any “learning” it can do is forgotten the instant the session ends.


  • The commercial aspect of the reproduction is not relevant to whether it is an infringement–it is simply a factor in damages and Fair Use defense (an affirmative defense that presupposes infringement).

    What you are getting at when it applies to this particular type of AI is effectively whether it would be a fair use, presupposing there is copying amounting to copyright infringement. And what I am saying is that, ignoring certain stupid behavior like torrenting a shit ton of text to keep a local store of training data, there is no copying happening as a matter of necessity. There may be copying as a matter of stupidity, but it isn’t necessary to the way the technology works.

    You’re conflating whether something is infringement with defenses against infringement. Believe it or not, basically all data transfer and display of copyrighted material on the Internet is technically infringing. That includes the download of a picture to your computer’s memory for the sole purpose of displaying it on your monitor. In practice, nobody ever bothers suing art galleries, social media websites, or web browsers, because they all have ironclad defenses against infringement claims: art galleries & social media include a clause in their TOS that grants them a license to redistribute your work for the purpose of displaying it on their website, and web browsers have a basically bulletproof fair use claim. There are other non-infringing uses such as those which qualify for a compulsory license (e.g. live music productions, usually involving royalties), but they’re largely not very relevant here. In any case, the fundamental point is that any reproduction of a copyrighted work is infringement, but there are varied defenses against infringement claims that mean most infringing activities never see a courtroom in practice.

    All this gets back to the original point I made: Creators retain their copyright even when uploading data for public use, and that copyright comes with heavy restrictions on how third parties may use it. When an individual uploads something to an art website, the website is free and clear of any claims for copyright infringement by virtue of the license granted to it by the website’s TOS. In contrast, an uninvolved third party–e.g. a non-registered user or an organization that has not entered into a licensing agreement with the creator or the website (*cough* OpenAI)–has no special defense against copyright infringement claims beyond the baseline questions: was the infringement for personal, noncommercial use? And does the infringement qualify as fair use? Individual users downloading an image for their private collection are mostly A-OK, because the infringement is done for personal & noncommercial use–theoretically someone could sue over it, but there would have to be a lot of aggravating factors for it to get beyond summary judgment. AI companies using web scrapers to download creators’ works do not qualify as personal/noncommercial use, for what I hope are bloody obvious reasons.

    As for a model trained purely for research or educational purposes, I believe that it would have a very strong claim for fair use as long as the model is not widely available for public use. Once that model becomes publicly available, and/or is leveraged commercially, the analysis changes, because the model is no longer being used for research, but for commercial profit. To apply it to the real world, when OpenAI originally trained ChatGPT for research, it was on strong legal ground, but when it decided to start making it publicly available, they should have thrown out their training dataset and built up a new one using data in the public domain and data that it had negotiated a license for, trained ChatGPT on the new dataset, and then released it commercially. If they had done that, and if individuals had been given the option to opt their creative works out of this dataset, I highly doubt that most people would have any objection to LLM from a legal standpoint. Hell, they probably could have gotten licenses to use most websites’ data to train ChatGPT for a song. Instead, they jumped the gun and tipped their hand before they had all their ducks in a row, and now everybody sees just how valuable their data is to OpenAI and are pricing it accordingly.

    Oh, and as for your edit, you contradicted yourself: in your first line, you said “The commercial aspect of the reproduction is not relevant to whether it is an infringement.” In your edit, you said “the infringement happens when you reproduce the images for a commercial purpose.” So which is it? (To be clear, the initial download is infringing copyright both when I download the image for personal/noncommercial use, and also when I download it to make T-shirts with. The difference is that the first case has a strong defense against an infringement claim that would likely get it dismissed in summary, while the cases of making T-shirts would be straightforward claims of infringement.)


  • That factor is relative to what is reproduced, not to what is ingested. A company is allowed to scrape the web all they want as long as they don’t republish it.

    The work is reproduced in full when it’s downloaded to the server used to train the AI model, and the entirety of the reproduced work is used for training. Thus, they are using the entirety of the work.

    I would argue that LLMs devalue the author’s potential for future work, not the original work they were trained on.

    And that makes it better somehow? Aereo got sued out of existence because their model threatened the retransmission fees that broadcast TV stations were being paid by cable TV subscribers. There wasn’t any devaluation of broadcasters’ previous performances, the entire harm they presented was in terms of lost revenue in the future. But hey, thanks for agreeing with me?

    Again, that’s the practice of OpenAI, but not inherent to LLMs.

    And again, LLM training so egregiously fails two out of the four factors for judging a fair use claim that it would fail the test entirely. The only difference is that OpenAI is failing it worse than other LLMs.

    It’s honestly absurd to try and argue that they’re not transformative.

    It’s even more absurd to claim something that is transformative automatically qualifies for fair use.



  • FFS, the issue is not that the AI model “copies” the copyrighted works when it trains on them–I agree that after an AI model is trained, it does not meaningfully retain the copyrighted work. The problem is that the reproduction of the copyrighted work–i.e. downloading the work to the computer, and then using that reproduction as part of AI model training–is being done for a commercial purpose that infringes copyright.

    If I went to DeviantArt and downloaded a random piece of art to my hard drive for my own personal enjoyment, that is a non-infringing reproduction. If I then took that same piece of art, and uploaded it to a service that prints it on a T-shirt, the act of uploading it to the T-shirt printing service’s server would be infringing, since it is no longer being reproduced for personal enjoyment, but the unlawful reproduction of copyrighted material for commercial purpose. Similarly, if I downloaded a piece of art and used it to print my own T-shirts for sale, using all my own computers and equipment, that would also be infringing. This is straightforward, non-controversial copyright law.

    The exact same logic applies to AI training. You can try to camouflage the infringement with flowery language like “mere extraction of relationships between components,” but the purpose and intent behind AI companies reproducing copyrighted works via web scraping and downloading copyrighted data to their servers is to build and provide a commercial, for-profit service that is designed to replace the people whose work is being infringed. Full stop.


  • They literally do not pass the criteria. LLMs use the entirety of a copyrighted work for their training, which fails the “amount and substantiality” factor. By their very nature, LLMs would significantly devalue the work of every artist, author, journalist, and publishing organization, on an industry-wide scale, which fails the “Effect upon work’s value” factor.

    Those two alone would be enough for any sane judge to rule that training LLMs would not qualify as fair use, but then you also have OpenAI and other commercial AI companies offering the use of these models for commercial, for-profit purposes, which also fails the “Purpose and character of the use” factor. You could maybe argue that training LLMs is transformative, but the commercial, widespread nature of this infringement would weigh heavily against that. So that’s at least two, and arguably three out of four factors where it falls short.



  • unique

    “unique new IP right?” Bruh you’re talking about basic fucking intellectual property law. Just because someone posts something publicly on the internet doesn’t mean that it can be used for whatever anybody likes. This is so well-established, that every major art gallery and social media website has a clause in their terms of service stating that you are granting them a license to redistribute that content. And most websites also explicitly state that when you upload your work to their site that you still retain your copyright of that work.

    For example (emphasis mine):

    FurAffinity:

    4.1 When you upload content to Fur Affinity via our services, you grant us a non-exclusive, worldwide, royalty-free, sublicensable, transferable right and license to use, host, store, cache, reproduce, publish, display (publicly or otherwise), perform (publicly or otherwise), distribute, transmit, modify, adapt, and create derivative works of, that content. These permissions are purely for the limited purposes of allowing us to provide our services in accordance with their functionality (hosting and display), improve them, and develop new services. These permissions do not transfer the rights of your content or allow us to create any deviations of that content outside the aforementioned purposes.

    Inkbunny:

    Posting Content

    You keep copyright of any content posted to Inkbunny. For us to provide these services to you, you grant Inkbunny non-exclusive, royalty-free license to use and archive your artwork in accordance with this agreement.

    When you submit artwork or other content to Inkbunny, you represent and warrant that:

    * you own copyright to the content, or that you have permission to use the content, and that you have the right to display, reproduce and sell the content. You license Inkbunny to use the content in accordance with this agreement;

    DeviantArt:

    1. Copyright in Your Content

    DeviantArt does not claim ownership rights in Your Content. For the sole purpose of enabling us to make your Content available through the Service, you grant DeviantArt a non-exclusive, royalty-free license to reproduce, distribute, re-format, store, prepare derivative works based on, and publicly display and perform Your Content. Please note that when you upload Content, third parties will be able to copy, distribute and display your Content using readily available tools on their computers for this purpose although other than by linking to your Content on DeviantArt any use by a third party of your Content could violate paragraph 4 of these Terms and Conditions unless the third party receives permission from you by license.

    e621:

    When you upload content to e621 via our services, you grant us a non-exclusive, worldwide, royalty-free, sublicensable, transferable right and license to use, host, store, cache, reproduce, publish, display (publicly or otherwise), perform (publicly or otherwise), distribute, transmit, downsample, convert, adapt, and create derivative works of, that content. These permissions are purely for the limited purposes of allowing us to provide our services in accordance with their functionality (hosting and display), improve them, and develop new services. These permissions do not transfer the rights of your content or allow us to create any deviations of that content outside the aforementioned purposes.

    Xitter:

    Your Rights and Grant of Rights in the Content

    You retain your rights to any Content you submit, post or display on or through the Services. What’s yours is yours — you own your Content (and your incorporated audio, photos and videos are considered part of the Content).

    By submitting, posting or displaying Content on or through the Services, you grant us a worldwide, non-exclusive, royalty-free license (with the right to sublicense) to use, copy, reproduce, process, adapt, modify, publish, transmit, display and distribute such Content in any and all media or distribution methods now known or later developed (for clarity, these rights include, for example, curating, transforming, and translating). This license authorizes us to make your Content available to the rest of the world and to let others do the same.

    Facebook:

    The permissions you give us We need certain permissions from you to provide our services:

    • Permission to use content you create and share: Some content that you share or upload, such as photos or videos, may be protected by intellectual property laws.

    • You retain ownership of the intellectual property rights (things like copyright or trademarks) in any such content that you create and share on Facebook and other Meta Company Products you use. Nothing in these Terms takes away the rights you have to your own content. You are free to share your content with anyone else, wherever you want.

    • However, to provide our services we need you to give us some legal permissions (known as a “license”) to use this content. This is solely for the purposes of providing and improving our Products and services as described in Section 1 above.

    • Specifically, when you share, post, or upload content that is covered by intellectual property rights on or in connection with our Products, you grant us a non-exclusive, transferable, sub-licensable, royalty-free, and worldwide license to host, use, distribute, modify, run, copy, publicly perform or display, translate, and create derivative works of your content (consistent with your privacy and application settings). This means, for example, that if you share a photo on Facebook, you give us permission to store, copy, and share it with others (again, consistent with your settings) such as Meta Products or service providers that support those products and services. This license will end when your content is deleted from our systems.

    I could go on, but I think I’ve made my point very clear: Every social media website and art gallery is built on an assumption that the person uploading art A) retains the copyright over the items they upload, B) that other people and organizations have NO rights to copyrighted works unless explicitly stated otherwise, and C) that 3rd parties accessing this material do not have any rights to uploaded works, since they never negotiated a license to use these works.


  • Bear in mind that training AI does not involve copying content into its database, so copyright is not an issue.

    Wrong. The infringement is in obtaining the data and presenting it to the AI model during the training process. It makes no difference that the original work is not retained in the model’s weights afterwards.

    You can train AI in a book and it will give you information from the book - information is not copyrightable. You can read a book a talk about its contents on TV - not illegal if you’re a human, should it be illegal if you’re a machine?

    Yes, because copyright law is intended to benefit human creativity.

    If you try to outlaw Automating this process by computers, there will be side effects such as search engines will no longer be able to index data.

    Wrong. Search engines retain a minimal amount of the indexed website’s data, and the purpose of the search engine is to generate traffic to the website, providing benefit for both the engine and the website (increased visibility, the opportunity to show ads to make money). Banning the use of copyrighted content for AI training (which uses the entire copyrighted work and whose purpose is to replace the organizations whose work is being used) will have no effect.



  • This process is akin to how humans learn by reading widely and absorbing styles and techniques, rather than memorizing and reproducing exact passages.

    Like fuck it is. An LLM “learns” by memorization and by breaking down training data into their component tokens, then calculating the weight between these tokens. This allows it to produce an output that resembles (but may or may not perfectly replicate) its training dataset, but produces no actual understanding or meaning–in other words, there’s no actual intelligence, just really, really fancy fuzzy math.

    Meanwhile, a human learns by memorizing training data, but also by parsing the underlying meaning and breaking it down into the underlying concepts, and then by applying and testing those concepts, and mastering them through practice and repetition. Where an LLM would learn “2+2 = 4” by ingesting tens or hundreds of thousands of instances of the string “2+2 = 4” and calculating a strong relationship between the tokens “2+2,” “=,” and “4,” a human child would learn 2+2 = 4 by being given two apple slices, putting them down to another pair of apple slices, and counting the total number of apple slices to see that they now have 4 slices. (And then being given a treat of delicious apple slices.)

    Similarly, a human learns to draw by starting with basic shapes, then moving on to anatomy, studying light and shadow, shading, and color theory, all the while applying each new concept to their work, and developing muscle memory to allow them to more easily draw the lines and shapes that they combine to form a whole picture. A human may learn off other peoples’ drawings during the process, but at most they may process a few thousand images. Meanwhile, an LLM learns to “draw” by ingesting millions of images–without obtaining the permission of the person or organization that created those images–and then breaking those images down to their component tokens, and calculating weights between those tokens. There’s about as much similarity between how an LLM “learns” compared to human learning as there is between my cat and my refrigerator.

    And YET FUCKING AGAIN, here’s the fucking Google Books argument. To repeat: Google Books used a minimal portion of the copyrighted works, and was not building a service to compete with book publishers. Generative AI is using the ENTIRE COPYRIGHTED WORK for its training set, and is building a service TO DIRECTLY COMPETE WITH THE ORGANIZATIONS WHOSE WORKS THEY ARE USING. They have zero fucking relevance to one another as far as claims of fair use. I am sick and fucking tired of hearing about Google Books.

    EDIT: I want to make another point: I’ve commissioned artists for work multiple times, featuring characters that I designed myself. And pretty much every time I have, the art they make for me comes with multiple restrictions: for example, they grant me a license to post it on my own art gallery, and they grant me permission to use portions of the art for non-commercial uses (e.g. cropping a portion out to use as a profile pic or avatar). But they all explicitly forbid me from using the work I commissioned for commercial purposes–in other words, I cannot slap the art I commissioned on a T-shirt and sell it at a convention, or make a mug out of it. If I did so, that artist would be well within their rights to sue the crap out of me, and artists charge several times as much to grant a license for commercial use.

    In other words, there is already well-established precedent that even if something is publicly available on the Internet and free to download, there are acceptable and unacceptable use cases, and it’s broadly accepted that using other peoples’ work for commercial use without compensating them is not permitted, even if I directly paid someone to create that work myself.



  • Fucking Christ I am so sick of people referencing the Google books lawsuit in any discussion about AI

    The publishers lost that case because the judge ruled that Google Books was copying a minimal portion of the books, and that Google Books was not competing against the publishers, thus the infringement was ruled as fair use.

    AI training does not fall under this umbrella, because it’s using the entirety of the copyrighted work, and the purpose of this infringement is to build a direct competitor to the people and companies whose works were infringed. You may as well talk about OJ Simpson’s criminal trial, it’s about as relevant.