Wednesday, August 9, 2017

On Contempt Culture, a Reply to Aurynn Shaw

I saw an interesting presentation recorded and delivered on LinkedIn on contempt culture by Aurynn Shaw, delivered this year at PyCon.  I had worked with Aurynn on projects back when she used to work for Command Prompt.  You can watch the video below:



Unfortunately comments on a social media network are not sufficient for discussing nuance so I decided to put this blog post together.  In my view she is very right about a lot of things but there are some major areas where I disagree and therefore wanted to put together a full blog post explaining what I see as an alternative to what she rightly condemns.

To start out, I think she is very much right that there often exists a sort of tribalism in tech with people condemning each others tools, whether it be Perl vs PHP (her example) or vi vs emacs, and I think that can be harmful.  The comments here are aimed at fostering a sort of inclusive and nuanced conversation that is needed.

The Basic Problem

Every programming culture has norms, and many times groups from outside those norms tend to be condemned in some way or another. There are a number of reasons for this.  One is competition and the other is seeking approval in one's in group.   I think one could take her points further and argue that in part it is about an effort to improve the relative standing of one's group relative to others around it.

Probably the best example we can come up with in the PostgreSQL world is the way MySQL is looked at.  A typical attitude is that everyone should be using PostgreSQL and therefore people choosing MySQL are optimising for the wrong things.

But where I would start to break with Aurynn's analysis would be when we contrast how we look at MySQL with how we look at Oracle.  Oracle, too, has some major oversights (empty string being null if it is a varchar, no transactional DDL, etc).  Almost all of us may dislike the software and the company.  But people who work with Oracle still have prestige.  So bashing tools isn't quite the same thing as bashing the people who use them.  Part of it, no doubt, is that Oracle is more established, is an older player in the market, and therefore there is a natural degree of prestige that comes from working with the product.  But the question I have is what can we learn from that?

Some time ago, I wrote a the most popular blog post in the history of this blog.  It was a look at the differences in design between MySQL and PostgreSQL and was syndicated on DZone, featured in Hacker News, and otherwise got a fairly large review.   In general, aside from a couple of historical errors, the PostgreSQL-using audience loved the piece.  What surprised me though was that the MySQL-users also loved the piece.  In fact one comment that appeared (I think on Reddit) said that I had expressed why MySQL was better.

The positive outpouring from MySQL users, I think, came from the fact that I sympathetically looked at what MySQL was designed to do and what market it was designed for (applications that effectively own the database), describing how some things I considered misfeatures actually could be useful in that environment, but also being brutally honest about the tradeoffs.

Applying This to Programming Language Debates

Before I start discussing this topic, it is worth a quick tour of my experience as a software developer.

The first language I ever worked with was BASIC on a C64.  I then dabbled in Logo and some other languages, but the first language I taught myself professionally was PHP.  From there I taught myself some very basic Perl, Python, and C.  For a few years I worked with PHP and bash scripting, only to fall into doing Perl development by accident.  I also became mildly proficient in Javascript.

My PostgreSQL experience grew out of my Perl experience.  And about 3 years ago I was asked to start teaching Python courses.  I rose to this challenge.  Around the same time, I had a small project where we used Java and quickly found myself teaching Java and now I feel like I am moderately capable in that language.   I am now teaching myself Haskell (something I think I could not have done before really mastering Python). So I have worked with a lot of languages.  I can pick up new languages with ease.  Part of it is because I generally seek to understand a language as a product of its own history and the need it was intended to address.

As we all know different programming languages are associated with stereotypes.  Moreover, I would argue that stereotypes are usually imperfect understandings that out-group people have of in-group dynamics, so dismissing stereotypes is often as bad as simply accepting them.

PHP as a case study, compared to C.

I would like to start with an example of PHP, since this is the one specifically addressed in the talk and it is a language I have some significant experience writing software in.

PHP often is seen to be insecure because it is easy to write insecure software in the language.  Of course it is easy to write insecure software in any language, but certain vulnerabilities are a particular problem in PHP due to lexical structure and (sometimes) standard library issues.

Lexically, the big issue with PHP is the fact that the language is designed to be a preprocessor to SGML files (and in fact it used to be called the PHP Hypertext Preprocessor).  For this reason everything, PHP is easy to embed in SGML PI tags (so you can write a PHP template as a piece of valid HTML).  This is a great feature but it makes cross site scripting particularly easy to overlook.  A lot of the standard library in the 1990's had really odd behaviour, though much of this has been corrected.

Aurynn is right to point to the fact that these were exacerbated by a flood of new programmers during the rise of PHP, but one thing she does not discuss in the talk is how software and internet security were also changing during the time.  In essence, the late 1990's saw the rise of SSH (20k users in 1995 to over 2M in 2000), the end of transmission of passwords across the internet in plain text, the rise of concern about SQL injection and XSS, and so forth.  PHP's basic features were in place just before this really got going, and adding to this a new developer community, and you have a recipe for security problems.  Of course, today, PHP has outgrown a lot of this and PHP developers today have codified best practices to deal with a lot of the current threats.

If we contrast this with C as programming language, C has even more glaring lexical issues regarding security, from double free bug possibilities to buffer overruns.  C, however, is a very unforgiving language and consequently, it doesn't tend to be a language that has a large, novice developer community. At the same time, a whole lot of security issues come out of software in C.

Conclusions

There is no such thing as a perfect tool (database, programming language, etc).  As we grow as professionals, part of that process is learning to better use the strengths of the technologies we work with and part of it is learning to overcome the oversights and problems of the tools as well.

It is further not the case that just because a programmer primarily uses a tool with real oversights that this reflects poor judgment from the programmer.  Rather this process of learning can have the opposite impact.  C programmers tend to be very knowledgeable because they have to be.  The same is true for Javascript programmers for very different reasons.  And one doesn't have to validate all language design decisions in order to respect others.

Instead of attacking developers of other languages, my recommendation is, when you see a problem, to neutrally and respectfully point it out, not from a position of superiority but a position of respectful assistance and also to understand that often what may seem like poor decisions in the design of a language may in fact have real benefits in some cases.

For example, Java as a language encourages mediocrity of code. It is very easy to become a mediocre Java developer.  But once you understand Java as a language, this becomes a feature because it means that the barrier to understanding and debugging (and hence maintaining!) code is reduced, and once you understand that you can put emphasis instead on design and tooling.    This, of course, also has costs since it is easy for legacy patterns to emerge in the tooling (JavaBeans for example) but it allows some really amazing frameworks, such as Spring.

On the other extreme, Javascript is a language characterised by shortcuts taken during the initial design stage (for time constraint reasons) and some of those cause real problems, but others make hard things possible.  Javascript makes it, also, very easy to be a bad Javascript programmer.  But perhaps for this reason I have found that professional Javascript programmers tend to be extremely knowledgeable, and have had to work very hard to master software development in the language, and they usually bring to the table great insights into computing problems generally.

So what I would recommend that people take away is the idea that in fact we do grow out of hardship, and that problems in tools are overcome over time.  So for that reason discussing real shortcomings of tools while at the same time respecting communities and their ability to grow and overcome problems is important.

Monday, February 13, 2017

PostgreSQL at 10TB and Beyond Recorded Talk

The PostgreSQL at 10 TB And Beyond talk has now been released on Youtube. Feel free to watch.  For the folks seeing this on Planet Perl Iron Man, there is a short function which extends SQL written in Perl that runs in PostgreSQL in the final 10 minutes or so of the lecture.

This lecture discusses human and technical approaches to solving volume, velocity, and variety problems on PostgreSQL in the 10TB range on a single, non-sharded large server.

As a side but related note, I am teaching a course through Edument on the topics discussed in Sweden discussing many of the technical aspects discussed here, called Advanced PostgreSQL for Programmers.  You can book the course for the end of this month.  It will be held in Malmo, Sweden.

Thursday, January 26, 2017

PL/Perl and Large PostgreSQL Databases

One of the topics discussed in the large database talk is the way we used PL/Perl to solve some data variety problems in terms of extracting data from structured text documents.

It is certainly possible to use other languages to do the same, but PL/Perl has an edge in a number of important ways.  PL/Perl is light-weight, flexible and fills this particular need better than any other language I have worked with.

While one of the considerations has often been knowledge of Perl in the team, PL/Perl has a number of specific reasons to recommend it:

  1. It is light-weight compared to PL/Java and many other languages
  2. It excels at processing text in general ways.
  3. It has extremely mature regular expression support
These features combine to create a procedural language for PostgreSQL which is particularly good at extracting data from structured text documents in the scientific space.  Structured text files are very common and being able to extract, for example, a publication date or other information from the file is very helpful.

Moreover when you mark your functions as immutable, you can index the output, and this is helpful when you want ordered records starting at a certain point.

So for example, suppose we want to be able to query on plasmid lines in UNIPROT documents but we have not set this up before we loaded the table.  We could easily create a PL/Perl function like:

CREATE OR REPLACE FUNCTION plasmid_lines(uniprot text) 
RETURNS text[]
LANGUAGE PLPERL IMMUTABLE AS
$$
use strict;
use warnings;
my ($uniprot) = @_;
my @lines = grep { /^OG\s+Plasmid/ } split /\n/ $uniprot;
return [ map {  my $l = $_; $l =~ s/^OG\s+Plasmid\s*//; $l } @lines ];
$$;


You could  then create a GIN index on the array elements:

CREATE INDEX uniprot_doc_plasmids ON uniprot_docs USING gin (plasmid_lines(doc));

Neat!

Tuesday, January 24, 2017

PostgreSQL at 10 TB and Above

I have been invited to give a talk on PostgreSQL at 10TB and above in Malmo, Sweden.  The seminar is free to attend.  I expect to be talking for about 45 minutes with some time for questions and answers.  I also have been invited to give the talk at PG Conf Russia in March.  I do not know whether either will be recorded.  But for those in the Copenhagen/Malmo area, you can register for the seminar at the Event Brite page.

I thought it would be helpful to talk about what problems will be discussed in the talk.

We won't be talking about the ordinary issues that come with scaling up hardware, or the issues of backup or recovery, or of upgrades. Those could be talks of their own.  But we will be talking about some deep, specific challenges we faced and along the way talking about some of the controversies in database theory that often come up in these areas, and we will talk about solutions.

Two of these challenges concern a subsystem in the database which handled large amounts of data in high-throughput tables (lots of inserts and lots of deletes).   The other two address volume of data.

  1. Performance problems in work queue tables regarding large numbers of deletions off the head of indexes with different workers deleting off different indexes.  This is an atypical case where table partitioning could be used to solve a number of underlying problems with autovacuum performance and query planning.
  2. Race conditions in stored procedures between mvcc snapshots and advisory locks in the work queue tables.  We will talk about how this race condition happens and we solved it without using row locks.  We solved this by rechecking results in a new snapshot which we decided was the cheapest solution to this problem.
  3. Slow access and poor plans regarding accessing data in large tables.  We will talk about what First Normal Form really means, why we opted to break the requirements in this case, what problems this caused, and how we solved them.
  4. Finally, we will look at how new requirements on semi-structured data were easily implemented using procedural languages, and how we made these perform well.
In the end there are a number of key lessons one can take away regarding monitoring and measuring performance in a database.  These include being willing to tackle low-level details, measure, and even simulate performance.

Please join me in Malmo or Moscow for this talk.