Rage Against the Machine Learning

Woah, it’s been a while since I actually made a blog post. I’m going to to try to make a blog post every day until I head off for my Facebook internship in June (super psyched!). I’m putting it in writing, so that’s how you know I’m definitely going to break my word…

I owe the above title to Bryan Ford, who listened to me raging against Machine Learning this afternoon. Bryan is visiting for our first systems lunch in a while. It’s been a long hiring season…

Anyway, back to the complaint at hand. Today was the final machine learning class of the semester and people were presenting their projects. One of the groups wasn’t really a group, so much as two students working loosely on the same approach to different problems. At the end of their presentation, I asked what commonalities there were between the two projects, other than both using CRFs to model the problem. They said that they were both trying to show that some kind of regularization would cause improvement on the state of the art of their respective data sets. I then asked if they were using the same base code. The students responded that they were not: one was implemented in Matlab and the other in Scala. They had reasons for this (mainly, the Scala student uses Scala in his lab, and he seemed to have scared off the Matlab student from using Scala because of its “learning curve” — see, Scala, you seriously need some better PR). I was troubled that they weren’t using comparable backends, since this meant they would definitely not be able to generalize without modeling the impact of measurement on their results. I didn’t ask if they were implementing the state of the art in their systems, so as to make them comparable (although I should have). In the end, I thought maybe it wouldn’t have that much of an impact.

At the end of the class, I picked up my final “mini-project.” We were given some sample output to test against and for my implementation of a one-vs-all multiclass classifier using a polynomial kernel, I was only able to get a test accuracy of 78.70%, versus the expected test accuracy of 87.3%. The TA (whom I don’t mean to criticize — he’s done a great job in this class, and I really appreciate it!) had written “too low, but I couldn’t find errors in your code. May be some small bugs.” Now, I had originally implemented other versions of other parts of this project; they too were also too low. Those older versions of the code were written using for-loops. After porting my loops over to matrix multiplications, I was able to improve the accuracy of my test set to equal that of the provided checkpoints. All this is to say that I strongly suspect the offending code (then one-vs-all multiclass classifier using a polynomial kernel) is semantically equivalent to a matrix-operation-only version, but will produce “better” results. This (of course!) is an exercise in dealing with numeric stability.

The futility of comparisons?

So of course looking at my homework and reflecting on my question for my colleagues, I felt confirmation that since we can’t even compare semantically equivalent computations when written in the same language, how can we compare across languages? Part of the reason why people use libraries like BLAS is because they are very efficient, but it also has the effect of determinism when comparing against other implementations — we don’t have faithful models of computation to provide meaningful comparisons between implementations when people use different floating point operations on different architectures. These issues are clearly tied to problems of reproducibility that have been getting a lot of buzz in our community lately.

Topical: Measurement

Immediately before class I had seen Andrew Gelman’s post on how we undervalue measurement in statistics. Although numerical analysis has been tackled by a lot of people who are way smarter than I am, it still causes problems for practitioners. With the rise of deep learning, big data, and data science, more and more people are using tools whose shortcomings are non-obvious. As a computer scientist, I am aware that floating point numbers are problematic, and that compounded floating point error can be a problem. I was still shocked by how much of a problem it was. I expected a small amount of variation, but was surprised that my result deviated as much as is did. Of course, the problem could still be in the code itself — after all, it isn’t statically verified. However, I still noticed differences in test accuracy for the other code that I converted from loops to matrix operations. These differences were on the order of a few percentage points. Although an order of magnitude smaller than the difference in the kernelized one-vs-all classifier, the difference of two percentage points or so actually makes a result publishable (at least in some ML/NLP conferences I’ve seen).

I wonder if we would get some variation in the test accuracy using loops just from randomizing the layout of some junk under the surface. Lisps have been partially addressing numerical stability by having ratio types. There must be some languages that do some semi-symbolic computations, simplifying and deferring floating point operations until a value is needed. If there were true, they would be able to return some bound on the error of that computation.

Rise of the Machines

So is there a way to convert loops to matrix operations automatically? I asked John if he had ever heard of such a thing and he said that there’s “vectorization” (a term I must have heard before, but have forgotten due to my advanced age), but he isn’t aware of any compiler optimizations for as high a level as I was suggesting. Since the Julia folks seem to be at the forefront of numerical computing right now (or are at least getting a lot of inflated press about it right now), I thought I’d look into what they have to say, but it doesn’t look like it. Are there constrained cases where we could do this? Maybe there’s a FFTW-style meta-programming approach someone could employ.

11 thoughts on “Rage Against the Machine Learning

  1. Luci

    Your project is amazing. I’ve also been interested in the Violins field, so I’ve learned a lot of useful knowledge from you. Thank you very much.

    Reply
  2. River Ward

    Machine learning can be a beast, but with the right tools and insights, it’s possible to tame it! Your story highlights the complexity behind these systems. Want to dive deeper into these topics? Visit Graduation Year Calculator website for more in-depth discussions on machine learning challenges and solutions.

    Reply
  3. Punjabi

    Interesting read! Your experience really highlights an under-discussed aspect of machine learning—how seemingly minor implementation details can lead to vastly different outcomes. I wonder if standardizing code structures or even languages for ML models would help improve reproducibility, especially in academic settings. Have you looked into tools like Docker or Singularity for environment consistency?”
    Punjabi from https://punjabimonthname.com/

    Reply
  4. coperc

    In Diablo 4, items are the foundation of your character’s power, strategy, and unique playstyle. This game builds on Blizzard’s classic action-RPG formula, taking itemization to a new level with deep customization, more rarity tiers, and complex item attributes that cater to various builds. From legendary powers to unique affixes, Diablo 4 items provide the tools to slay demons in creative, exciting ways.

    Reply
  5. jasmine

    Professional 3D rendering services transform concepts and designs into lifelike visualizations, enabling businesses and designers to showcase ideas with clarity and impact. These services are vital across industries like architecture, interior design, real estate, product design, and entertainment. By creating detailed, high-quality visuals, professional rendering allows for better client communication, more informed decision-making, and accelerated project approvals.

    Reply

Leave a Reply

Your email address will not be published. Required fields are marked *