Blog - page 2

10 things a library needs

When I first started working on httpx, the mission was clear: make an easy-to-use open-source HTTP client that could support both legacy and future versions of the protocol, and all, or at least most, of its quirks.

However, developing a library which can potentially be used and improved by several people I’ll never meet personally, all working on different setups and solving different problems and use-cases, it made me think about what would be the key aspects that could keep the level of community interest and participation high and levels of frustration low, while not causing me too much burden (remember, this is not my day job).

Having actively contributed to several open-source projects, I experienced the pain of creating reproducible scripts for my issues, building a project locally, having to reluctantly write in a project maintainer’s preferred flavour or philosophy of the language, or getting significant contributions rejected for not getting the specifics of what the project goals were. A combination or all of these factors have contributed to the overall community decrease in interest, and in some cases eventual abandonment, by the community or the maintainer(s), of some of those projects I’ve contributed to.

(Disclaimer: I’m not saying that, by addressing all of these, your project will be immediately successful, but it will certainly be healthier).

Not having all the time in the world for maintaining httpx, or any other of my projects, I implemented a set of side-goals, with a focus on:

  • Make project onboarding easy;
  • User perspective first;
  • Help the user communicate their issues with the right level of detail;
  • Follow the standard style of the language (as unopinionated as possible);
  • Set your main project goals, and stick to them;

Here’s 10 things I did to accomplish these in httpx.

1. Test what the users will use.

As a veteran rubyist, I’m a big believer in TDD. However, that didn’t help me much early on in the project’s life, when I was still trying to figure out how to implement the internals of the library. Things changed so drastically, that if I’d TDD my early HTTP connection implementations, the effort in changing tests as I’d change the API would result in more redundant work, and instead of focusing on getting the right API, I’d be more concerned in reducing time spent on rewriting tests. For something which is private API.

So, it’s clear that, although TDD is a valuable practice, one can’t fall in the fallacy of “TDD all the things”, rather one should test what one will really use, and make sure that these tests cover the internals in a way you’ll feel pretty confident about the overall outcome.

My first “test” was actually a one-liner that you can still see in the project’s main page:

HTTPX.get("https://news.ycombinator.com")

This means that my approach was to use the .get call as my MVP, and after making it work (both for HTTP/1 and HTTP/2), I could start extending it to other HTTP verbs, different kinds of request bodies, etc., and make it work for all versions of the protocol and transport mode, support “100-continue” flows, and so on. So the bulk of httpx could be perceived on a first look as “integration tests”, however the philosophy here is, if I can make it work for my tests, the user will most likely make it work as well.

Do make sure that your tests cover a significant amount of the code you wrote though, so make code coverage a variable of your CI builds.

2. Integration tests, predictable CI builds.

An HTTP client needs an HTTP server.

From my experience in maintaining or contributing to network-focused (also other HTTP) libraries, I identified that there were two kinds of test strategies for them: the ones that “mocked” network communication, and the ones that relied on the network. Both approaches have advantages, and both come with their own set of disadvantages:

You mock the network because:

  • You use OS sockets, so your focus is not to them;
  • You want to test only the particular things you implement (like encoding/decoding, error handling…);
  • You want your tests to run quick;
  • You want a predictable test build;

however, what you get is:

  • Your code might not handle all network failure scenarios;
  • Edgiest uses of your APIs will be very unreliable;
  • Parsers might not handle incomplete payloads;
  • Your tests might be quick, but they won’t assure anyone in the big picture;

You test with the network because:

  • You want to make sure that everything works end-to-end;
  • You want to tackle most edge cases and failure scenarios;

however ,what you get is:

  • The network is not reliable;
  • You might get rate-limited, or worse, blocked by the peer you test against;
  • Peer might be down (server down, DDoS from China…);
  • Your CI will fail often, and you’ll not take it seriously;

Mocking the network was never an option for httpx. I decided to go with the following method:

  • httpbin to test most of HTTP features and quirks;
  • nghttp2.org as a proxy to httpbin that could do HTTP/1, HTTP/2, server push and h2c negotiation.
  • www.sslproxies.org to get an available HTTP or Connect proxy to nghttp2.org;
  • www.socks-proxy.net with the same intent (but testing SOCKS proxy connections);

However, I knew that nghttp2.org could be down, DNS could fail, httpbin could change the endpoints, there could be no available proxy, proxies could timeout on handshake (when choosing random, sometimes they’re on the other side of the world), and all sorts of possible combinations, that could make the CI builds fail well beyond 50% of the time due to one of these issues.

Alas, I wanted the CI builds to be reliable and reproducible. How could I do that and keep my integration tests? The answer I found was docker.

docker to the rescue

A common use of docker setups is to expose service base images to avoid installing them locally, like databases. By adapting this approach, I could bake in all my external services in its own containers, link them through docker-compose.yml and run my tests inside the local docker bridge network.

This is how I came to the idea behind the CI pipeline which has been running the builds to this day. It took some time and tweaks to make it work reliably, but now my integration tests runs 99% of the time with no worries about network failures, peer availability or versioning. It was eventually extended to also deploy the project’s main website. It’s one the things I’m most proud of in the project.

P.S.: There are exceptions to the rule, of course. I couldn’t test all HTTP features I wanted with httpbin (like brotli encoding), so I’m using available peers in the internet. And I’m also resorting to mocking to test the DNS over HTTPS feature, until I get some plan on how to use it as a service in this setup. On the other hand, I’m only testing the SSH tunneling feature when running inside docker, as it’d be otherwise pretty hard to set up. There’s no free lunch.

Another nice benefit from this setup is: if contributors have docker and docker-compose installed, they can use the same setup to develop locally, which brings me to the next point…

3. Make development setup easy

One of the main reasons mentioned for not contributing to open-source projects, is that it takes a lot of effort to setup the development environment. For some, it comes with the job, i.e. linux has a manual about its development tools.

Rails, to use an example from ruby, relies on a lot of services being installed and available for it to run its test suite, like DBs for activerecord, (mysql, postgres, …), or node/yarnsince webpacker became a thing (there are more). It maintains a separate project to set up its development environment, which relies on vagrant (I assume this was made before docker became a thing). Without a doubt, Rails requires significant cognitive load from a potential contributor. Of course, as a “batteries included” project, there’s not a lot Rails can do in that department.

Smaller projects, however, have their own set of requirements, i.e. a lot of them rely directly or indirectly on gcc being available to compile extensions, some require git or bash (even if they don’t use git for more than git ls, blame bundle gem for that); some require network being available, even if they only calculate primes.

All of these things are potential hinderances to a first-time contributor, that might never come back after having tried and failed to contribute.

Always include a section in your wiki / manuals / documentation, describing how one can setup a development environment. Whenever possible, automate what the user has to do. Whenever necessary, explain the user why he needs it. httpx provides such a section in the wiki.

4. Make style standard and hassle-free

If you’re using go, this is not even a debate. Between its small set of features, limited (not limiting) API, and code not compiling until you gofmt format it, go projects seem to be programmed by the same guy.

Not ruby though. It’s a very big language. Object, the base class of all ruby objects, has 56 instance methods. Integer adds 62 more. Array has 120 instance methods more. And that’s just primitive types. A lot of them do the same thing. Ruby provides many ways to skin a cat, and everyone has a particular opinion about it. And they usually carry it to their particular and professional projects, being a point of friction when being onboarded into new projects, open source or not. Don’t get me wrong, it’s my language of choice, but it fails at preventing a lot of pointless discussions. I lost count of the times I’ve had contributions rejected because I used tabs instead of spaces, used << instead of concat, used ternary operator instead of if/else…

So when I started created projects, I decided to do the exact same thing and implement my own flavor of ruby, until I realised, that’s not how you build a community! So I went where other projects had gone before, to find some balance.

rubocop

By now, all of you working with ruby have heard of rubocop. It’s the de-facto ruby linter, and its features go beyond plain linting. It’s default configuration follows the ruby style guide, and even if this is non-consensual (the ruby core team disputes this style guide as being truly standard), it’s still a benchmark bringing some order to the chaos it is to develop ruby projects in a collaborative way.

And so I adopted it, and made a significant effort in not adding too many rules in my rubocop configs beyond the strictly necessary. By following the ruby style guide, no matter whether I agree with it personally, I’m reducing noise. And if contributors still have their preference, they can happily develop in their style, and then rubocop -a their way to a clean and mergeable change request I can work with.

5. Kill your dependencies

You might recognize this from Mike perham’s post of the same name. It’s a cautious tale about the cost one incurs when adopting a dependency they don’t own, and how hard it might be to pay that debt later. It’s easier said than done (in some cases, building and maintaining a certain functionality might just be too big a burden compared to the publicly available alternative), but the own changelog of sidekiq is a story of the benefits one can potentially reap when going down this road.

As a former contributor to the celluloid ecosystem, I can still remember the disappointment I felt when sidekiq removed it as a core dependency, as I felt it was a disservice to the dependency that made celluloid viable in the first place.

Only as I matured, I realized what was going on: celluloid was holding sidekiq back, with its constant API breaking changes, lack of scope, subtle memory leaks and unstable features. Little by little, every code path dependent of a celluloid feature was rewritten, until the actor performing jobs was a glorified thread worker. The removal of this dependency was a great success, specially if you consider the current “abandonware” status of celluloid.

HTTP parsers

httpx went through a similar discovery path: during its inception, both its HTTP/1 and HTTP/2 parser were external dependencies (http_parser.rb and http-2 respectively).

http_parser.rb was the first one to be removed. The decision to include it was because it seemed to be working fine as httprb parser, and I didn’t want to write a parser from scratch. However, as I continued developing around its flaws, I’ve realized that 1) it was massively outdated (it was based on an old version of node’s HTTP parser), 2) it didn’t support all the features I wanted, 3) it was buggy, 4) there was no full parity between the C and Java HTTP parser (so I couldn’t guarantee JRuby support), and 5) both http_parser.rb as the Java parser it was based one were barely maintained (the Java parser was pretty much abandonware by the time I started using it). By the time I was developing yet another workaround to a parser misfeature, I knew it was time to remove that dependency. So I built my own HTTP/1 parser from scratch. In pure ruby. supporting all the HTTP/1 quirks I wanted. And I never had to think about HTTP/1 parsing again.

(P.S.: httprb has since then dropped http_parser.rb due to the same reasons and the amount of issues it generated. pnly to replace it with another dependency called http-parser, an FFI binding for a more recent version of the same node HTTP parser. They still get parsing-related issues they can’t easily fix by themselves.)

http-2 should not be a dependency anymore by the time this post goes public. There are not a lot of HTTP/2 parsers available for ruby, and this pure ruby implementation has the benefit of being very readable and easily extensible, because duh!, it’s ruby. I’d been an active contributor until recently, until activity in the main repo kind of stalled (one of the still open PRs in the project is mine, as of the publishing time of this article).

I started receiving bug reports which didn’t seem to come from httpx itself. After some investigation, I came to the conclusion that the parser was the issue in some cases. Although a pretty interesting project, it never fully complied with the HTTP/2 spec, therefore I had to accept that httpx would probably break in different ways for a not-so-small amount of HTTP/2 compliant servers. Or do something about it.

So I decided to fork http-2 and release it as http-2-next (no, I’m not writing an HTTP/2 parser from scratch, ahahahah). The result was a parser that passes all specs of the h2spec suite. It’s probably not gonna stop there, but it’s pretty good for now.

So now I own own the runtime dependencies of httpx I started with, so a lot of worries I used to have about external dependencies (whether API breaks, bugs aren’t fixed or project is abandoned) are not concerns anymore.

6. Forwards compatibility

Not all dependencies are worth replacing, though.

Besides the already mentioned rubocop, httpx also depends on minitest, simplecov or pry, to name a few of the development/test dependencies I couldn’t possibly maintain on my own. Some of the plugins come with their own set of dependencies (brotli, http-cookie, http-form_data, faraday…). Although valuable, they still bring with them the same risks mentioned in 5.: What if the API changes? What if there’s a CVE reported? What if the project is abandoned?

But the main question is: how do I make sure that I stay compatible with newer versions of my dependencies? This is usually overlooked, as it’s kind of expected of maintainers to keep track of latest changelogs and announcements, until something breaks in production, and suddenly quickfixing is your goal. Some of us might feel too overwhelmed. This might, dare I say, be another cause for maintainer burnout.

How can one make sure that all these changes won’t catch up with the project? First, accept that you won’t be able to control all of this. But in my experience, a strategy that works out pretty well is to limit your exposure to potential changes in a library’s API. Thereby, you should limit the amount of features you are exposed to, and find a subset of APIs which have a higher probability of never being changed.

minitest

Although it sells itself as a minimal test library, as in all things ruby, there’s nothing minimal about it. It comes with the “test” and “spec” ways of writing your tests, it ships with many assertion helpers, and mocking is a very verbose task. So how do I limit my exposure?

Answer: just use assert.

Really, just use assert. It powers all other (arguably) useless assert/refute methods polluting the namespace. Its API is simple enough to not ever change (boolean, error message). Its origin probably can be traced back to JUnit’s assertTrue. All the cases you could fit any of the assert helpers can be deconstructed and stripped down to a call to assert. Update minitest in 5 years, and you’re mostly likely guaranteed to have assert available with the same signature.

assert is all you need. So, define your own helper methods using assert under the hood, and own them.

(P.S.: I do exceptionally use other features, but this is not the rule).

Ruby itself is another example where you can keep it simple by using proven APIs that haven’t changed in years. By limiting your exposure to experimental and controversial methods, you’ll be ensuring stability of your project for years to come, as ruby upgrades (looking at you, lonely operator).

  1. Help users give the right feedback

One of the hardest parts of maintaining a project is deciphering a user’s error report.

A user of your library works in a completely different setup from you. Not only does he/she feel frustrated if your library doesn’t work as expected, his/her patience will also be short. Most users will never report an issue, lest find your bug tracker. Therefore, the ones arriving at your inbox, are the ones who made it, and only a fraction of them will be able to articulate what went wrong in a meaningful way. How can you help the user give you a description of the problem you can actually work with? How can you avoid the ping-pong of question/answer that only makes the user more frustrated?

Github tried to solve this with templates. And most projects took them to a level of detail, that they’ve become a separate form no one has the time nor the desire to fill up. What version? What CPU? What browser? Templates can become just another filter limiting the pool of users who want to reach out to you.

Asking for a stacktrace from the get-go can be invaluable. Some users struggle, but most of them know how to get one. Asking for a reproducible script might help, but sometimes the error lies so deep in the logic of the user’s application, that asking him to take it out of its context not only is an awful lot of work, it might even mask the error.

Finding an error that happened to a remote user boils down to know 1) when the problem happened, and 2) what was the state of the world at the time. Stacktraces help with the former, but not with the latter.

httpx solves this with debug logs. Although a user-facing feature, you can also turn it on and define severity level by setting an environment variable, HTTPX_DEBUG. By asking the user to turn it on to “2” and rerun his example again, we’ll get a detailed trace of HTTP request/response interactions, and in HTTP/2 case, frame traces. It’s important to note that setting an environment variable is something a user can easily do, and the output is very valuable. In fact, this is how I found out that http-2 does not allow max frame size negotiation.

It’s also worth noting that httpx didn’t invent the practice. verbose logs with different levels is also a feature of ssh or curl. The node community (afaik) also uses environment variables with library prefixes to turn on logs only for a subset of the dependencies.

  1. Examples, tutorials, how-to’s

Your library is useless if no one understands how to use it.

You can be pretty confident when announcing its release, and how much your piece of software will change the world, but if you don’t post a code snippet with a disclaimer “insert this line to achieve greatness”, you’ll never get people to use it. In fact, if you don’t actively use your library, you might never get that maybe people don’t pick it up because of how complicated it is to use it. So use it, and write examples.

But don’t write any examples. Write simple examples. Write exhaustive examples. Cover edge-cases. Write examples for things you would like to do, and if it doesn’t work, write the code to make it work, and share the code as an example.

That’s what I did to justify working on a faraday adapter; I wanted to be able to use the Stripe SDK gem using its faraday adapter (stripe has since moved away from faraday, oh well…). So I wrote an example of how it should look like. and then the feature came.

So now httpx has a README with examples, a wiki with examples, an examples/ project folder, and also a cheatsheet!

  1. Open for extension

This is ruby, and monkey-patching is king! If your objects aren’t extensible, this is what’ll happen: people will include module after module after overwritten method until it accomplishes the feature they want, barely (a sin I’ve been guilty of).

For all of the praises about its existence, metaprogramming can be a very sharp knife: wield it well, and you’ll achieve great things, but the most likely outcome is that you’ll cut a finger!

A problem ruby never figured out was how to make extending existing features and classes easy! Half of the times, I see most extensions corrupting the main module, often times not considering call order or whether there’s already an extension extending that extended module, and then it’s a big mess of wild proportions. Then ruby brought Module#prepend, and some (just some) sanity was brought into call orders.

Then refinements came. One of the most controversial features of ruby recently, not because of its potential, but because of its limitations. There are a wild array of scenarios where refinements works in non-obvious ways, or just won’t work (try to refine a class method, or refine a module). Ruby just can’t seem to solve this itself.

httpx implements a plugin pattern, where certain classes considered “core” can be extended in a sane way, without the core classes themselves being globally extended. Most of its features are implemented as plugins, actually. If this is familiar to you, it’s because it’s not a novel idea. I kind of stole this from both sequel and roda, both maintained by Jeremy Evans. Go read the source code, it’s one of the finest examples of metaprogramming in ruby, and its best approach to the “open for extension, closed for modification” principle.

  1. Backwards-compatibility is a feature!

A lot of people have heard about Linus Torvalds’ email rants. He used to be quite aggressive when commenting on the quality of one’s proposed changes to the linux source code, which was a consequence of his relentless drive for project stability. One of his most famous mantras was “never break user code”. He was not wrong there.

In ruby, breaking upgrades are no news. Rails, its most famous gem, is the biggest outlier: every upgrade is guaranteed to break your application. Gems extending rails break accordingly. It’s actually kind of amazing that rails managed to keep its popularity while constantly breaking the legs of its users. Chaos is a ladder.

The latest trend in the ruby world is to “stop supporting older versions of ruby”, older versions being versions not officially supported anymore by the ruby core team. This trend not only ignores how ruby is distributed and used globally (not everyone uses rvm, mates), but in some (most?) cases, the API is already compatible with older versions, or would require minimal effort to do such.

But rails is the exception, not the rule. If you break compatibility constantly, you will alienate your user base. The ones that can abandon your project, will abandon your project. The ones that don’t, they’ll grunt. Python hard-broke compatibility from v2 to v3, and we’re still talking about it after 10 years. The nodeJS ecosystem is defined by “always be breaking builds”, to the point where this became an anecdote. The Facebook API changed so much since the first time I had to work with it (2010, perhaps), that I’d have to write everything from scratch if I had to go back to one of those projects. Googles discontinues developer products all the time. Angular broke compatibility from 1 to 2, and in a single blow delivered the frontend framework crown to React. None of these failed per-se (except Angular, maybe), but they generated a spiral of negativity that hovers around them, and no one’s truly happy with it.

In my day job, we use Stripe for our payments processing. One of the small pleasures I have is to work with code around Stripe SDKs and APIs. Not only is their standard above everything else I’ve worked with, their approach to user compatibility is jaw-dropping: if you didn’t upgrade your API since 2014, it still works as before. If you go to their documentation pages, you’ll see code snippets for the API version you’re with! They do go above and beyond to make you write the right code. The API upgrade strategy is also very easy to grasp. All endpoints can receive an API version header, which means you can migrate endpoint by endpoint to the latest version; by then, you just make the switch at the dashboard level, remove the API version headers, and you’re migrated. An API like that is every developer’s dream.

We also use AWS services in my day job. One of these days, I had to add a one-liner to an application running for months in a row on AWS lambda. The deploy failed: I was using a version of the serverless-warmup-plugin, a lambda to keep your lambdas warm, still using the node6.10 lambda configuration, and AWS is not supporting that node version anymore, so I had to spend some time to figure out how to upgrade that package.

Now pause for a second. Stripe. AWS. Both serve companies from all sizes. Both should not create friction with the companies working with them. Stripe is always accomodating our demands. AWS doesn’t care. And it’s not a question of money, I’d say.

Conclusion

These 10 practices are not to be taken as commandments (I did mention when I couldn’t follow them), but they help me maintaining a fairly wide and complex set of features with no budget. And that’s the key aspect of this: Open Source projects are not just about writing code; in order to survive long term, they must excel at communication, collaboration, education. And that’s the hardest task.

Enumerable IO Streams

I’ve been recently working on CSV generation with ruby in my day job, in order to solve a bottleneck we found because of a DB table, whose number or rows grew too large for the infrastructure to handle our poorly optimized code. This led me in a journey of discovery on how to use and play with raw ruby APIs to solve a complex problem.

The problem

So, let’s say we have a User ActiveRecord class, and our routine looks like this:


class UsersCSV
  def initialize(users)
    @users = users
  end

  def generate
    CSV.generate(force_quotes: true) do |csv|
      csv << %w[id name address has_dog]
      @users.find_each do |user|
      csv << [
        user.id,
        user.name,
        user.address.to_addr,
        user.dog.present?
      ]
    end
  end
end

payload = UsersCSV.new(User.relevant_for_this_csv).generate

aws_bucket.upload(body: StringIO(payload))

The first thing you might ask is “why are you not using sequel”. That is a valid question, but for the purpose of this article, let’s assume we’re stuck with active record (we really kind of are).

The second might be “dude, that address seems to be a relationship, isn’t that a classic N+1 no-brainer?”. It kind of is, and good for you to notice, I’ll get back to it later.

But the third thing is “dude, what happens if you have like, a million users, and you’re generating a CSV for all of them?”. And touchè! That’s what I wanted you to focus on.

You see, this example is a standard example you find all over the internet on how to generate CSV data using the csv standard library, so it’s not like I’m doing something out of the ordinary.

I could rewrite the generation to use CSV.open("path/to/file", "wb") to pipe the data to a file, however if I can send the data to the AWS bucket in chunks, why can’t I just pipe it as I generate? There are many ways to do this, but this put me to think, and I came up with an solution using the Enumerable module.

Enumerable to the rescue

I’ll change my code to enumerate the CSV rows as they’re generated:

class UsersCSV
  include Enumerable

  def initialize(users)
    @users = users
  end

  def each
    yield line(%w[id name address has_dog])
      @users.find_each do |user|
      yield line([
        user.id,
        user.name,
        user.address.to_addr,
        user.dog.present?
      ])
    end
  end

  private

  def line(row)
    CSV.generate(row, force_quotes: true)
  end
end

# I can eager-load the payload
payload = UsersCSV.new(User.relevant_for_this_csv).to_a.join
# you can select line by line
csv = UsersCSV.new(User.relevant_for_this_csv).each
headers = csv.next
first_row = csv.next
#...

But this by itself doesn’t solve my issue. If you look at the first example, specifically the line:

aws_bucket.upload(body: StringIO(payload))

I’m wrapping the payload in a StringIO. that’s because my aws client expects an IO interface. And Enumerables aren’t IOs.

The IO interface

An IO-like object must implement a few methods to be usable by certain functions which expect the IO interface. In other more-ruby-words, it must “quack like an IO”. And how does an IO quack? Here are a few examples:

  • An IO reader must implement #read(size, buffer)
  • An IO writer must implement #write(data)
  • A duplex IO must implement both
  • A closable IO must implement eof? and #close
  • A rewindable socket must implement #rewind
  • IO wrappers must implement #to_io

You know some of ruby’s classes which implement a few (some, all) of these APIs: File, TCPSocket, and the aforementioned StringIO.

A few ruby APIs expect arguments which implement the IO interface, but aren’t necessarily instances of IO.

  • IO.select can be passed IO wrappers
  • IO.copy_stream(src, dst), takes an IO reader and an IO writer as arguments.

Enter Enumerable IO

So, what if our csv generator can turn itself into a readable IO?

I could deal with this behaviour directly in my routine, but I’d argue that this should be a feature provided by Enumerable, i.e. an enumerable could also be cast into an IO. The expectation is risky: the yield-able data must be strings, for example. But for now, I’ll just monkey-patch the Enumerable module:

# practical example of a feature proposed to ruby core:
# https://bugs.ruby-lang.org/issues/15549

module Enumerable
  def to_readable_stream
    Reader.new(self, size)
  end

  class Reader
    attr_reader :bytesize

    def initialize(enum, size = nil)
      @enum = enum
      @bytesize = size
      @buffer = "".b
    end

    def read(bytes, buffer = nil)
      @iterator ||= @enum.each
      buffer ||= @buffer
      buffer.clear
      if @rest
        buffer << @rest
        @rest.clear
      end
      while buffer.bytesize < bytes
        begin
          buffer << @iterator.next
        rescue StopIteration
          return if buffer.empty?
          break
        end
      end
      @rest = buffer.slice!(bytes..-1)
      buffer
    end
  end
end

With this extension, I can do the following:

csv = UsersCSV.new(User.relevant_scope_for_this_csv).to_readable_stream
aws_bucket.upload(body: csv)

And voilà! Enumerable and IO APIs for the win!

Using this solution, there’s a performance benefit while using clean ruby APIs.

The main performance benefit is, the payload doesn’t need to be all kept in memory til all the CSV is generated, so we get constant memory usage (in our case, this leak was exacerbated by that N+1 problem; the more you wait for the rows, the longer the csv payload was being retained).

Caveat

Depending of what you’re using to upload the file, you might still need to buffer first to a file; at work, we use fog to manage our S3 uploads, which requires IO-like request bodies to implement rewind, therefore the easy way out is to buffer to a tempfile first:

csv = UsersCSV.new(User.relevant_scope_for_this_csv).to_readable_stream
file = Tempfile.new
IO.copy_stream(csv, file)
file.rewind
fog_wrapper.upload(file)

Conclusion

There are many ways to skin this cat, but I argue that this way is the easiest tom maintain: you can tell any developer that their CSV/XML/${insert format here} generator must implement #each and yield formatted lines, and then you just have to pass it to your uploader. You ensure that the payload will not grow linearly, and no one will ever have to read another tutorial on “How to write CSV files in ruby” ever again.

This doesn’t mean that all of our problems are solved: as the number of records grows, so does the time needed to generate it. And it will become a bottleneck. So how can you guarantee that the time needed to generate the date won’t derail?

I’ll let you know when I have the answer.

Falacies about HTTP

When I first started working on httpx, I wanted to support as many HTTP features and corner-cases as possible. Although I wasn’t exhaustively devouring the RFCs looking for things to implement, I was rather hoping that my experience with and knowledge about different http tools (cURL, postman, different http libraries from different languages) could help me narrow them down.

My experience working in software development for product teams also taught me that most software developers aren’t aware of these corner cases. In fact, they aren’t even aware of the most basic rules regarding the network protocols they use daily, and in fact, many just don’t care. When your goal is to get shit done before you go home to your family, these protocols are just a means to an end, and the commoditization of “decent-enough” abstractions around them resulted in the professional devaluation of its thorough knowledge.

Recently, the explosion of packages in software registries for many open source languages/platforms also led to the multiplication of packages which solve the same problem, but just a little bit differently from each other to justify its existence. awesome-ruby, a self-proclaimed curated list of ruby gems, lists 13 http clients as of the time of writing this article. And this list prefers to omit net-http, the http client available in the standard library (the fact that at least 13 alternatives exist for a library shipped in the standard library should already raise some eyebrows).

Some of these packages were probably created by the same developers mentioned above. And the desire to get shit done while ignoring the fundamentals of how the network and its protocols work, led to this state of mostly average implementations who have survived by cheer popularity or “application dependency ossification” (this is a term I just clammed together, meaning “components which use too many resources inefficiently but accomplish the task reasonably, and whose effort to rewrite is offset by the amount of money to keep this elephant running”). This list of falacies is for them.

  1. 1 request - 1 response

One of the most spread-out axioms of HTTP is that it is a “request-response” protocol. And in a way, this might have been the way it was designed in the beginning: send a request, receive a response. However, things started getting more complicated.

First, redirects came along. A request would be thrown, a response would come back, but “oh crap!”, it has status code 302, 301, the new “Location” is there, so let me send a new request to the new request. It could be quite a few “hops” (see how high level protocols tend to re-use concepts from lower level protocols) until we would get to our resource with a status code 2XX. What is the response in this case?

But this is the simplest bending of “request-response”. Then HTTP started being used to upload files. Quite good at it actually, but people started noticing that waiting for the whole request to be sent to then fail on an authentication failure was not a great use of the resources at our disposal. Along came: 100 Continue. In this variant, a Request would send the Headers frame with the “Expect: 100-continue” header, wait on a response from the server, and if this had status code 100, then the Body frame would be sent, and then we would get our final response. So, I count two responses for that interaction. Nevermind that a lot of servers don’t implement it (cURL, for instance, sends the body frame if the server doesn’t send a response after a few seconds, to circumvent this).

Or take HTTP CONNECT tunnels: In order to send our desired request, we have to first send an HTTP Connect request, receive a successful response (tunnel established) then send our request and get our response.

But one could argue that, for all of the examples above, usually there is a final desired response for a request. So what?

Well, along came HTTP/2 Push. And now, whenever you send a request, you might get N responses, where N - 1 is for potential follow-up requests.

All this to say that, although it looks like a “request-response” protocol, it is actually more complex than that.

Most client implementations choose not to implement these semantics, as they may perceive them as of little value for server-to-server communications, which is where the majority is used.

  1. Write the request to socket, Read response from socket

As many of the examples described here, this is a legacy from the early days of TCP/IP, where TCP sockets were always preferred and less complex message interactions were privileged in favour of ease-of-use. As SMTP before it, so did the first versions of HTTP have these semantics built in: open socket, write the full request, receive the full response, close the socket, repeat for next.

However, things started getting complex really fast: HTML pages required multiple resources before being fully rendered. TCP handshake (and later, SSL/TLS) got so much of getting stuff to the end user, that “user hacks” were developed to limit the number of connections. A big chunk of the following revision of HTTP (1.1) revolved around re-using TCP connections (aka “Keep-Alive”) and stream data to the end user (aka “chunked encoding”), improvements which were widely adopted by the browsers and improved things for us, browser users. HTTP proxies, the “Host” header, Alt-Svc, TLS SNI, all of them were created to help decrease and manage the number of open/close intermediate links.

Other things were proposed that were good in theory, but hard to deploy in practice. HTTP pipelining was the first attempt at getting multiple responses at once to the end user, but middlebox interference and net gains after request head-of-line blocking meant that this was never going to be a winning strategy, hence there were very few implementations of this feature, and browsers never adopted this widely.

And along came HTTP/2, and TPC-to-HTTP mapping was never the same. Multiple requests and responses multiplexed over the same TCP stream. Push Promises. And maybe the most important, connection coalescing: If you need to contact 2 hosts which share the same IP and share the same TLS certificate, you can now safely pipe them through the same TCP stream!

Many of these improvements have benefitted browsers first and foremost, and things have evolved to minimize the number of network interactions necessary to render an HTML page. HTTP/2 having decreased the number of TCP connections necessary, HTTP/3 will aim at decreasing the number of round-trips necessary. All of this without breaking request and response semantics.

Most of these things aren’t as relevant when all you want is send a notification request to a third-party. Therefore, most client implementations choose not to implement most of these semantics. And most are fine implementing “open socket, write request, read response, close socket”.

Ruby’s net-http by default closes the TCP socket after receiving the response (even sending the Connection: close header). It does implement keep-alive, but this requires a bit more set-up.

  1. Network error is an error, HTTP error is just another response

HTTP status codes can be split into 4 groups:

  • 100-199 (informational)
  • 200-299 (successful)
  • 300-399 (redirection)
  • 400-499 (client errors)
  • 500-599 (server errors)

In most server-to-server interactions, your code will aim at handling and processing “successful” responses. But in order for this to happen, checking the status code has to happen, in order to ensure that we are getting the expected payload.

In most cases, this check has to be explicit, as 400-599 responses aren’t considered an error by clients, and end users have to recover themselves from it.

This is usually not the case for network-level errors. No matter whether the language implements errors as exceptions or return values, this is where network errors will be communicated. A 404 response is a different kind of error, from that perspective. But it is still an error.

This lack of consistency makes code very confusing to read and maintain. 429 and 424 error responses can be retried. 503 responses can be retried. DNS timed-out lookups too. All of these represent operations that can be retried after N seconds. All of them require different error handling schemes, depending of the programming language.

A very interesting solution to handle this can be found in python requests library: although network-level errors are bubbled up as exceptions, a 400-599 response can be forced to become an exception by calling response.raise_for_status. It’s a relative trade-off to reach error consistency, and works well in practice.

However, this becomes a concern when supporting concurrent requests: if you recover from an exception, how do you know which request/response pair caused it? For this case, there’s only one answer: prefer return errors than raising errors. Or raise exceptions only after you know whom to address them.

But one thing is clear: Network errors and HTTP errors should be handled at the same level.

  1. You send headers, then you send data, then it’s done

Earlier, we talked about the “open socket, write request, receive response, close socket” fallacy. But what is “write a request, receive a response” exactly?

HTTP requests and responses are often described as composed of headers and data frames (not to be confused with the HTTP/2 frames). Most of the examples and use cases show header being sent first, then data. This is one of HTTPs basic semantics: no data can be sent before sending headers (again, this might have come from SMTP, if I had to bet).

Things have become a bit more complicated than that. When HTTP started being used for more than just sending “hypertext”, other frame sub-sets started showing up.

Along came multipart uploads. Based on MIME multipart messages, which were already being used to transfer non-text data over e-mail (SMTP, again), it created a format for encoding payload-specific information as headers within the HTTP data frame. Here’s an example of an 3-file upload request:

# from https://stackoverflow.com/questions/913626/what-should-a-multipart-http-request-with-multiple-files-look-like
POST /cgi-bin/qtest HTTP/1.1
Host: aram
User-Agent: Mozilla/5.0 Gecko/2009042316 Firefox/3.0.10
Accept: text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8
Accept-Language: en-us,en;q=0.5
Accept-Encoding: gzip,deflate
Accept-Charset: ISO-8859-1,utf-8;q=0.7,*;q=0.7
Keep-Alive: 300
Connection: keep-alive
Referer: http://aram/~martind/banner.htm
Content-Type: multipart/form-data; boundary=----------287032381131322
Content-Length: 514

------------287032381131322
Content-Disposition: form-data; name="datafile1"; filename="r.gif"
Content-Type: image/gif

GIF87a.............,...........D..;
------------287032381131322
Content-Disposition: form-data; name="datafile2"; filename="g.gif"
Content-Type: image/gif

GIF87a.............,...........D..;
------------287032381131322
Content-Disposition: form-data; name="datafile3"; filename="b.gif"
Content-Type: image/gif

GIF87a.............,...........D..;
------------287032381131322--

Later, an addition to HTTP was made: Trailer headers. These are defined as headers which are sent by the peer after the data has been transmitted. Its main benefits are beyond the scope of this mention, but this fundamentally changed the expectation of what an HTTP message looks like: after all, headers can be transmitted before and after the data.

A lot of client implementations re-use an already existing HTTP parser. Others write their own. I’ve seen very few supporting trailer headers. I don’t know of any, other than httpx, that does (and httpx only reliably supports it since ditching http_parser.rb, ruby bindings for an outdated version of the node HTTP parser). I also don’t know of any in python. Go’s net/http client supports it.

  1. HTTP Bytes are readable

This was particularly talked about during the SPDY days and the initial HTTP/2 draft, when it was decided that the new version was going to adopt binary framing. A lot of different stakeholders voiced their opposition. One of the main arguments was that HTTP plaintext-based framing was a key factor in its adoptions, debuggability and success, and losing this was going make HTTP more dependent of the main companies driving its development (the Googles of this planet).

They were talking about the “telnet my HTTP” days, where due to its text-based nature, it was possible to use the telnet to open a connection to port 80 and just type your request, headers/data, and see the response come in your terminal.

This hasn’t been as black-and-white for many years. Due to better resource management, there are time constraints in terms of how much time that “telnet” connection will be kept open by the server (in many cases, if servers don’t receive anything within 15 seconds, connection is terminated). HTTPS and encoding negotiation also made telnet-based debugging less efective.

Also, better tooling has showed up that has taken over this problem space: Wireshark has been able to debug HTTP/2 almost since day one, and will be able to debug HTTP/3 in no time.

To sum it up, this fallacy has been a remaining legacy from the old TCP/IP initial protocol days (surprise: you can also send SMTP messages over telnet!). No one should use telnet in 2019 (and I know for a fact that many network providers do). Better tooling has come up for this problem space. Network and system administrators of the 20 years past, just raise the bar.

A hole in a lot of http clients is that they don’t provide introspection/debug logging, and one has to resort to network-level tools to inspect payload (net-http actually does, however). Maintainers, that should be an easy problem to fix.

  1. Response is an IO stream

Some features introduced during the HTTP/1.1 days, like chunked encoding or the event stream API, introduced streaming capabilities to the protocol. This might have given the wrong idea that an HTTP connection was itself streamable, a concept that has “leaked” to a few client implementations.

Usually, in these interactions, You create an HTTP connection (and its inherent TCP/TLS connection), and there is an API that returns the next stream “chunk”, after which you can perform some operation, and then loop to the beginning.

Besides the implicit socket-to-HTTP-connection here, which has been debunked a few fallacies ago, there’s also the fact that “draining” the connection is only performed when acquiring the next chunk. If your client is not consuming payload as fast as possible, and the server keeps sending, many buffers along the way will be filled waiting for you to consume it. You might just caused “bufferbloat”.

If there are timing constraints regarding network operations, there is no guarantee that you’ll require the next chunk before the TCP connection itself times out and/or peer aborts. Most of these constraints can be controlled in a dev-only setup, and such interactions will result in “production-only” errors which can’t be easily reproduced locally. Surprise, you might just have programmed a “slow client”.

This is not to say that you should not react on data frames sent, but usually a callback-based approach is preferred and causes less unexpected behaviour, provided you keep your callbacks small and predictable. But whatever happens, always consume data from the inherent socket as soon as possible.

Besides, if you’re using HTTP/2, there is no other chance: unless you can guarantee that there’s only one socket for one HTTP/2 connection, you can’t just read chunks from it. And even if you can, reading a data chunk involves so much ceremony (flow control, other streams, etc…) that you might as well end up regretting using it in the first place.

Client implementations that map a 1-to-1 relationship between socket and HTTP connection are able to provide such an API, but won’t save you from the trouble. If connections hang from the server, time out, or you get blocked from accessing an origin, consider switching.

  1. Using HTTP as a transport “dumb pipe”

According to the OSI model, HTTP belongs to layer 7, to the so called application protocols. These are perceived as the higher-level interfaces which programs use to communicate among each other over the network. HTTP is actually a very feature-rich protocol, supporting feature like content-negotiation, caching, virtual hosting, cross-origin resource sharing, tunneling, load balancing, the list goes on.

However, most client use HTTP as a dump pipe where data is sent and received, as if it were a plain TCP stream.

It is like that for many reasons, I’d say. First, there is a big incentive to use HTTP for all the things: bypassing firewalls! Second, implementing all the features of HTTP in a transparent way is rather hard. Some implementers even think that only richer user-agents like browsers would benefit from such features.

Even cURL is partially to blame: it is probably the most widely used and deployed HTTP client around, but its mission is to allow downloading content over many other protocols, where HTTP is just one of them. If you’re doing:

> curl https://www.google.com

You’re a) not negotiation payload compression; b) not checking if a cached version of the resource is still up-to-date. Can you do it with cURL? Yes. Do you have to be verbose to do it? Pretty much.

Most 3rd-party JSON API SDKs suffer from this issue, because the underlying library is not doing these things. The only reason why we’re sending JSON over HTTP is because proxies have to be bypassed, but it is done in an inefficient way.

Conclusion

I could had a few more thoughts, but 7 sounds official, so I’ll let that sink in.

Enjoy the week, y’all!

Welcome to HTTPX the blog

First of all, welcome. This is the first post about HTTPX, the ruby http client library for the future.

In it, I’ll talk about this library, http, the ruby ecosystem, decisions and choices, and anything that I can relate to the motivation of creating and maintaining httpx. I’ve realized that there is a lot of resources (blogs, tutorials, etc…) about other ruby (and not only) http client libraries, and practically nothing about httpx, mostly due to it being very recent, and the others having stabilized and being part of mature code bases. Hopefully I can tackle this superficial perception and build some community around it (no matter how good your product is, if no one uses it, it does not exist).

but a milestone has been reached: httpx is now part of the awesome-ruby resources, so I’d like to celebrate that on our first post.

So long folks. Be excellent to each other.