Podcasts

061 - Discussing DevOps and Elastic Stack with Mike Place

December 29, 2020

Today's episode is a conversation between Jeremy Morgan and Mike Place. Mike is a Senior Engineer on the Elastic developer productivity team, and former lead on the Salt Open project.

They talk about DevOps, Elastic Stack, and the tactics for effectively managing developers.


If you enjoy this episode, please consider leaving a review on Apple Podcasts or wherever you listen.

Please send any questions or comments to podcast@pluralsight.com.

Transcript

[00:00:00.000]

Hello, and welcome to All Hands on Tech. I'm Daniel Blaser. Today's episode features a conversation between Jeremy Morgan and Mike Place. Mike is a senior engineer on the Elastic developer productivity team and former lead on the Salt Open project. This episode covers a range of topics, including DevOps, Elastic Stack, and the tactics for effectively managing developers.

 

[00:00:32.383]

So how are you doing today, Mike?

 

[00:00:33.844]

I'm great. How are you, Jeremy?

 

[00:00:35.812]

I'm doing well. So tell us a little bit about yourself and what you do.

 

[00:00:40.528]

Sure. My name is Mike Place. I work for Elastic. I live in Paris, France. And let's see, I have been in tech now for going on 25 years, if you can believe that. And yeah, I work for Elastic on the developer productivity team, specifically supporting developer teams in Elastic observability. And I have been on that particular team now just about a year. And yeah, I really, really enjoy it.

 

[00:01:19.607]

Awesome. So how did you get started in tech? Is it something that you've always wanted to be in, or was it something that you kind of fell into?

 

[00:01:27.353]

Sure. You know, I'm sure you get these stories a lot, but when I was a kid, basically the leading technology at the time was made by Timex. And what you could do was, you could go to the store and you could get a little Timex computer, and you could hook it up to your television, which in and of itself was not the easiest feat. And what this thing consisted of, basically, is it was kind of a computer and an embedded keyboard. There was very little memory, I really have no idea, but some small amount of kilobytes on the thing, certainly no type of permanent storage. And to the best of my recollection, perhaps somebody can correct me, but to the best of my recollection, basically what you got when you started this thing up is, you got a prompt in BASIC. And that was it. That was the whole experience. If you managed to get this thing connected to your television, you had a BASIC prompt, and you could do whatever you want in BASIC from there. But, of course, there was no permanent storage, so as soon as you turned the thing off, that was it. I thought this was fantastic. I mean, I was, I can't remember, a couple of years old, probably 6 or 7 years old at the time, and taught myself BASIC at the time, poorly, I should add. I mean, I was 7. And went on from there at the university. I have a degree in English literature, with a focus on modernist poetry. And at the time, this would've been sort of post-college, mid-1990s, give or take, it wasn't particularly common for what at the time were system administrators to have any type of CS degree. Instead, it was mostly just people from varying backgrounds who thought computers were really cool, and we just sort of all found each other. And so in Salt Lake City, which is where I was living at the time, we all found each other at an absolutely top-shelf independent service internet provider, which exists to this day, called XMission, and I worked there for over a decade, had a fantastic time. And at the time, they were really the good old days. When we needed a server, there was no cloud, when we needed a server, we ordered parts online. They came in a box, we put them together, we racked them up, we installed an OS, and some number of hours or days later, usually days, we had a mail server. And if you needed a second one, well, you had to do all of that again. And anyway, sort of a typical story from there. I bounced around a lot, spent a number of years at Salt Stack. Now years later, here I am working at Elastic. 

 

[00:05:08.496]

It seems like the Stone Age now, when you really sit and think about it. Because I was one of those guys, I worked at an ISP also, and I was one of those guys that didn't have a CS degree, had a degree in something else but thought computers were really cool. And yeah, we were having so much fun back then, but looking back now, it's like, well, I'm really glad that's not the way it is anymore.

 

[00:05:30.145]

I have mixed feelings about it, if I'm honest. There are certainly parts of it that I miss. I think the part that I miss more than anything is constraint. At the time, whether or not your mail server ran in 100 MB, or even 10 MB, or 15 or 20 MB of memory, was not an inconsequential problem. Whether or not you had particular kernel tuning options turned on or off, how you configured your file system, were all relatively important, especially if you were like us and relatively poor and trying to make a buck as a bunch of folks trying to run an ISP. I fear a little bit, unfortunately, of that has been lost, when it costs pennies to spin up a server in the cloud that has 32 or 64 GB of memory. It's a little bit harder to think about constraint than it was at the time. 

 

[00:06:44.450]

Yeah, do you think that's had kind of an adverse effect on software development in general? I've had discussion recently about optimizing code on the back end and things like that, and I've had a lot of people on Twitter and publicly being like, this doesn't matter anymore. That's cool that you shaved a few milliseconds, but we don't care anymore. And it was kind of a little bit of a shock to me, like, why wouldn't people care? And then I thought about it, and it's like, well, with the cloud, exactly like you said, there's no constraint, so if things are running slower, we'll just throw some more instances at it and call it good. But do you think that's affecting how people develop software?

 

[00:07:30.452]

I will make a case for why people should care. And, look, I'm sympathetic to the idea that, yeah, it probably doesn't necessarily affect application performance negatively. And, of course, vertical scaling and horizontal scaling are both very cheap and very easy. But at the same time, we're faced with a global challenge of climate change. Very few of these servers run on solar power, and we need to acknowledge that the efficacy of the software that we write has literal real-world applications. We can and should be concerned about how efficiently our software runs, if for no other reason than software at the end of the day is another pollutant that exists out in the world. And even if we can afford it, once those cycles are burned, they're burned. And so I do really wish that people were a little bit more conscious about software, for that reason if nothing else.

 

[00:08:43.914]

That's where most of our electricity is probably going these days, is data centers.

 

[00:08:49.578]

Yeah, it's been a while since I've looked, but I'm sure somebody will write me an email. But I remember numbers like 1.5% of global utilization. It's in that area. Please don't email me. But I know it's somewhere in that range. And I would very much like to see a movement in software where actual energy utilization was something that we measured in the same way that we measure software performance, because that energy utilization, by the way, translates into cost from cloud providers. They're no dummies. They're not losing money. And I think it would be wonderful if I could pull up an AWS dashboard and say, oh, okay, well, these are the number of Lambda executions that you had at the end of the day. And by the way, here's the estimated number of kilowatt-hours that we think those burned. It should be a metric that we look at when we think about software. It's very frustrating to me that it's not, and I look at this sort of lack of consciousness about efficiency as being part of that problem.

 

[00:10:09.738]

So let's step back a little bit in your career. What can you tell us about the Salt Open project?

 

[00:10:15.122]

Sure. So the Salt Open project is a project, oh, goll. I used to know this stat by heart. I think it started in 2013, but it's been now a couple of years since I've been on that project, so I don't know for sure. By Tom Hatch, a wonderful gentleman living there also in Utah. And after a little while had company grow around it, which SaltStack, recently acquired by VMware. Salt is a project that, interestingly, originally was designed for remote execution, and then some number of months or years later, it had configuration management layered on top of that. And it was really when configuration management started getting layered on top of it that it began to be known in the space of configuration management tools opposite tooling like Puppet or Chef or Ansible. I worked on that project for a number of years. Like I mentioned earlier, I've since left there. But certainly, that automation space has been very interesting and I think is still ripe for a lot of innovation, a lot of interesting things to happen.

 

[00:11:40.072]

Yeah, what do you think some of the lesser-known benefits of automation are? Because we always talk about automating boring things or pushing out more features or something like that, but is there more to it than that, as far as the benefits of automation?

 

[00:11:56.471]

Right, yeah, that's a relatively broad subject. But the benefits of automation come explicitly in a couple of places. First off, we think of the benefits of automation as something that can take away repetitive tasks from development teams. We know that developer time is exceedingly expensive; we know that certain tasks that would otherwise be left to developers could be automated. Therefore, it makes a lot of sense to use some sort of tooling that can get developers back to what they are good at and what provides the most value for the business, which is writing features and fixing bugs for customers. So, I mean, automation I think principally should be thought of as a cost-saving tool, which it generally is. Additionally, I think one of the interesting, unexplored pieces of automation, I shouldn't say unexplored, that's certainly not true, but lesser-explored pieces of automation, is automation that you might call full-stack automation, automation that is essentially seamless of integrated from the point of the application, all the way down the stack into the infrastructure. We've talked about this in the DevOps world for quite a number of years, that we should be able to combine the development world and the operations world, when in fact what we've generally done is, we've just said, okay, we're going to have automation that represents the intended state of some application, and then we're going to overlay that on top of the actual state of our infrastructure, which is a fine and wonderful and worthy goal. But I also think there's some unexplored territory in the idea that automation can also play a role in the application itself, which ostensibly knows its own state, informing the infrastructure about what it needs. Certainly, we've made some strides in that. We've come from this world where we've basically said, okay, we're going to have this idea called configuration management, which we've had certainly for many years, which can, again, describe the intended state of infrastructure. And then like I said, we kind of weirdly just sort of blended in this notion of application deployment on top of it using generally the same type of tooling. But again, it's kind of problematic because it becomes very hard, I think, to reason about, which is a problem, certainly endemic, in my view, to automation today. For example, I think most teams, especially teams that are heavily invested in automation, like to say, well, we've described our infrastructure as code, and I say, okay, well, fine. Show me what the first-day experience looks like for somebody on your team. And 9 times out of 10, they say, well, they go, here's a pile of YAML and maybe some documentation that goes along with it. The problem that I have there is, it's extremely difficult to reason about the intended state of applications, or the actual or intended state of infrastructure by looking at a pile of YAML. Automation I think has yet to effectively even address this problem, much less solve it. Yeah, I don't know. I covered five or six touchpoints there.

 

[00:16:11.544]

So what can you tell us about Elastic Stack?

 

[00:16:14.534]

Sure, so I should couch this by saying that I'm certainly not here to speak on behalf of Elastic or anything like that, but that said, Elastic has these days a very broad stack in terms of development. I work on the observability team, and so we focus on solutions like APM and other types of observability and sort of the ability to look at and represent an infrastructure or an application and understand the signaling that it emits as its runtime progresses that might provide people with information about its present state. I like, forgive me if I got this quote wrong, I think it's a quote of Charity Majors, which I don't have in front of me, so, again, I may get it slightly wrong. But it was something along the lines of, observability tells us what the application is doing, where monitoring looks at the application from the outside. So we're focused on dealing on the observability side of things, from the application looking outward. What is its state? Yeah, and like I mentioned, I work on the developer productivity team there, which is not one of the direct products.

 

[00:18:00.605]

How can the stack help with some of the other things like security? This is one of the things that I've noticed. I've been really excited about researching a lot of Elastic Stack, and I used something similar years ago. And one of the things I think that was most exciting that people don't talk about as much is the security aspect, like being able to, for lack of a better world, plug holes much faster than we've ever been able to before.

 

[00:18:29.576]

Now, I am not the best person to ask about this, so I would certainly direct people to better information than I have. But Elastic really genuinely does have a really compelling security story to tell, and it's generally told in sort of two parts. The first part is a SIEM detection engine that allows security analysts to ingest these large number of events which are happening on any type of network and reconstruct events that may be necessary to conduct a security investigation, for example. So they have this really, really wonderful tool that's designed for analysts that allows you to, like I said, create sophisticated queries and conduct security investigations and what have you, and it's really, really wonderful. You should really take a look at it if you haven't had a chance. The second one is Elastic Endpoint Security story, which is also excellent. I'm a really big fan of it. And it's this notion that you can both detect threats and also respond to them. And so as you're engaged in this security analysis of a potential problem, you can wall off potentially compromised components, you can save their state for investigation later. Again, I'm not the best person to ask about this because really honestly, I've primarily sat in demos with a lot of that tooling. But it's very, very impressive stuff. I really encourage everybody to take a look at it.

 

[00:20:21.097]

So what kind of skills are useful to be proficient as a developer with this stack?

 

[00:20:26.337]

I think first off, basic HTTP and REST skills, though certainly these days there are a number of client libraries that abstract those sorts of things away from you. Probably as with anything, I think the most challenging piece for me when I was learning it was trying to unhinge my brain away from relational database thinking, and so really as a person starts, certainly this isn't necessary prior to starting with Elasticsearch or with the Elastic Stack, but especially if you come from a relational database frame of mind, I would certainly encourage people to always stop as they're engaged in their learning journey and think, okay, am I trying to think of this problem as a relational database problem with traditional tables and joins and things like that, or am I thinking of this in terms of the Elastic way of doing things, where you have a set of more or less freeform documents. Document duplicates are not necessarily problematic. You don't have the same type of very rigid structure that you might. So that would be thing one. Thing two would be to, and I've certainly run into this in the past, and I think most people who have used the stack would probably cop to this, is that it generally, at least in my experience, pays to think, if you're planning to use Kibana to visualize data, which certainly you don't have to but if that's something that you're interested in doing, it certainly pays to think a little bit about how your Kibana visualization is going to be structured as you're starting to think about the nature and structure of the documents that you insert because things will be much easier for you if you create those visualizations down the road, versus trying to come to a document structure may be more challenging to visualize with Kibana. And the third thing, really, in my experience, and I don't know how well this would be received by everyone, but in my experience, sometimes the fastest route or the most efficient route to visualizing data is to restructure it outside of the Elastic Stack. Now, hey have the solutions that have been released very recently in the 7.8 series, I think it is, that makes this dramatically easier. But sometimes I have a particular visualization that I want to make, and I've got data that is already flowing in and it's in a particular shape, and what I do is, I pull it out and I reprocess it with Python and then I stuff it back in. And maybe that's cheating, maybe that's not the officially sanctioned way, but sometimes it's the shortest path between me and getting work done, and quite frankly, I think there's no problem with that.

 

[00:24:05.554]

Yeah, I suppose starting with what you're looking for in a visualization and working back, I could see how that would be a little smoother than, as you're saying, starting with the data and trying to kind of wrangle it into something later down the road.

 

[00:24:19.935]

Yeah, I mean it may not even be looking back, but I would say keeping that in mind. And as you're sort of formulating how you want your document structure to look, every now and then just pause and go, okay, does this still make sense for the visualization that I'm aiming for here? Because as wonderful as it would be for all visualization tools to be able to ingest and represent all types of data structures, that just isn't, unfortunately, the reality that we live in. So I've found that just checking in every now and then as I'm doing development is worthwhile.

 

[00:25:03.098]

Shifting gears a little bit, I know it's been a couple years, but at Devopsdays you talked about event-driven infrastructure, and I really enjoyed that talk and was wondering, what can you tell us about event-driven infrastructure and how that works?

 

[00:25:19.022]

Right, so event-driven infrastructure is an idea that certainly really isn't new. In fact, if you go back through the literature, interestingly, Oracle talked about this sort of briefly. They got very excited about it for, I don't know, 5 or 6 months, gosh, 5 or 6 years ago now. But basically, event-driven infrastructure is the idea that infrastructure can, or infrastructure tooling, rather, can listen to events and respond accordingly. That's the bottom line. Now, what does that really mean in practice? What that means is that, let's imagine a very simple event infrastructure relationship. If infrastructure tooling receives an event called capacity warning, a new machine is spun up, or a new load balancer is provisioned. So, this is interesting because we can and should receive events from everywhere. Often, we think of automation, I think in a very limited way. We think of automation as basically being glorified scripting. And, yes, I understand that we have all of these declarative paradigms and what have you. But by that, I mean we think of automation as something which is initiated by a human or by a timer, and then it runs, and then it's done, and that's it. Event-driven infrastructure is the notion that, how can we think about the initialization of these automation routines as being more fluid? So, returning to our example of a load balancer and infrastructure provisioning, we might have the ability, for example, for our monitoring tools to give us relatively good signaling in respect to how we're doing in terms of capacity. So this is like your basic autoscaling use case. So we might be able to say, okay, well, I know that if I've reached 80% of capacity, it's probably a good idea for me to do a little bit more scaling. Well, using event-driven infrastructure, the paradigm is basically, well, our monitoring system knows how to create events, our infrastructure management or automation system knows how to receive events and to respond to them accordingly; therefore, when our monitoring system emits an event that matches the conditions known to our automation system, the automation system knows how to react to it. That's really it in a nutshell. Now, again, I used the example of a monitoring system. I think this stuff gets very interesting when you think of what other types of external services can potentially emit events. In these cases, I'm thinking of things like CI/CD, serverless routines. Amazon, when it comes to, for example, Lambda and DynamoDB, they have this notion that, hey, we can take events that happen like a database insert and use them to trigger serverless routines. Well, we can use that same type of pattern in infrastructure. And like I said, we can receive those events not just from monitoring or not just from CI/CD, but even in fact from the applications themselves, which gives our applications and their runtime, because, again, we have this notion that applications should have a good understanding of their own state, they should too be able to emit telemetry-type events onto this bus that the infrastructure management system knows how to respond to. And so this covers cases, again, like autoscaling. It certainly covers cases like self-healing infrastructure. But I also think it has the potential to cover sort of more traditional use cases like simple application provisioning, for example. So that's the idea in a nutshell. SaltStack, again, which is the project that I worked on, has this notion of a service called an event reactor, which, again, ingests events on a message bus, against which you can write various rules and then plug those rules and triggers into actions that you want Salt to undertake when a particular event or trigger. Yeah, that's roughly the idea. I've given certainly a number of talks where I go into depth about not just sort of like the principles behind this, but how you might create this sort of paradigm for yourself using a sort of native, nonopinionated tooling.

 

[00:31:12.215]

Yeah, definitely. And so what are some misconceptions that folks might have about things like these big automation systems and high-tech tooling and how that interweaves with DevOps in their organization?

 

[00:31:27.982]

Right, yeah, that's a pretty important question. For a long time, we went around saying automate everything. And that, let's be honest, was not a good message. 

 

[00:31:45.916]

Yeah, I lived through those years.

 

[00:31:48.211]

Right, yeah, so did I, and certainly was responsible for my share of damage.

 

[00:31:54.359]

Same here.

 

[00:31:55.611]

Yeah, and I very quickly, hopefully, let's call it very quickly, backed away from that message to a more conservative message that was something along the lines of only automate things that you actually understand. These days I think a couple of things about automation. One, and this is potentially the most controversial, is that I tend to think that for a lot of cases, we've gone far, far, far too far along the declarative path. YAML is a, and sort of like stateful descriptions of how certain thing should behave, is wonderful. But we need to be honest about this idea that we're trying to build what are basically procedural workflows on top of these declarative systems. And we're trying to build these procedural workflows by shoving them into this system of requisites between these declarative states. 

 

[00:33:19.136]

Now I understand that I'm using a little bit of jargon here, so let me try to make that a little bit more clear. Sometimes Bash really is the better solution. And frankly, a lot of times Bash is really the better solution. And too often, I think this just generally, we get caught up with this idea that native tooling is better, that if we're using a configuration management tool that we should try to do as much as possible native to that tooling. And we end up with patterns that just make no sense. And sometimes I look at these, and I come to them and I'm like, great, it's wonderful that we can do all of this complicated orchestration, but this orchestration works best, in my view, when it is the least. When the orchestration or automation, and I'm talking like this high-level, sort of requisite-based system like you would see, again, in Salt or Puppet or other systems, works best when it orchestrates procedural scripts, as it were. And if you can find that balance, you can do a lot of really, really great things. But if you're trying to find the magic set of requisites in YAML and Jinja to run a service shutdown before you run a service start, and you can do that in two lines of Bash, you've lost your damn mind anyway. I strayed a bit from the question, but I think it's an important point for people, especially people who are just walking into these systems. The number one mistake that I would see is they would walk in, and they'd be like, oh, my God, there's all this power. I'm going to apply this power everywhere. I'm going to rewrite all my Bash scripts, I'm going to do all this. And I'm like, oh, my God, no. Use the power judiciously. There's no rule that says everything has to be native to this single orchestration, nor should it.

 

[00:35:48.865]

Yeah, absolutely, and that's one of the things that I've told people who are wanting to get into DevOps or automation-type work is, one of the first things I always tell them is, if you're Windows learn PowerShell. If you're in Linux use Bash, or both is you're mixed because sometimes just having that in the back of your mind, exactly like you say, you could stare at something, and it's this big huge complex system and think to yourself, I could do this with PowerShell and will clean it up. And it'll be simple. And how easy is it to manage a two-line Bash script?

 

[00:36:24.544]

That's right. And I think the other side of it too is to think of this in terms of representation of complexity. These systems, or a lot of these systems that we manage are complex. They just are. But one of the questions that we need to be asking ourselves about that complexity is not just how do I manage all of this complexity, but how do I reason about this complexity? And one concrete way that we can ask ourselves this question is, when a new team member joins my team, day one, and we show them our system, what is the mean time that it's going to take this person to sit down with the system that I've designed and understand what's happening behind the scenes? Can I look at this automation system and reason about what it actually represents? If you can't or if you think that's problematic, in my view, it's time to take a pause. And certainly, look, maybe the answer is we can solve this with more inline documentation. Maybe the answer is we need a better balance between sort of like a procedural approach and other approaches, and that's all fine. But all too often I walk into these situations and they say, yeah, here's 20,000 lines of YAML. And I'm like, you're going to have to draw me a picture.

 

[00:37:58.116]

Yeah, absolutely.

 

[00:37:59.480]

And that's a problem. And I think people need to respect that as a problem. And it's a potential problem when they're doing design.

 

[00:38:08.267]

Yeah, I think that's a really interesting empathetic point of view towards new developers or even developers who've been on the team a while. How hard is it for them to get up and get productive? So where does developer productivity kind of depart from what we usually talk about with DevOps?

 

[00:38:32.546]

So yeah, so I work on the developer productivity team at Elastic, and I think one of the things that's important to say is that if we rewind this all the way to the beginning and you go back to 2009 and I was actually there. I was in the front row at the Velocity 2009. John Allspaw gave just one of the probably most brilliant tech talks of all times. It was called 10+ Deploys Per Day at Flickr. If everyone hasn't seen it, you should see it. And it's fantastic, and John is brilliant, and all of those ideas are brilliant. But, and this is a big but, I think it may be possible that right now we see too frequently DevOps as being the singular way that we think about software development and deployment. And that may be harmful. And let me give you one simple reason why I think that's true, which is that we at Elastic, we don't do 10 deploys a day. We release software on a schedule. It's every couple of weeks, and that's fine. And a lot of organizations are in that same boat, and I fear that too many of them think that that's a problem, when I fact, it might not be. And so developer productivity, in my view, we're somewhat similar to DevOps teams in that we think about things like release management, we think about things like continuous integration, we think about things like CI servers, what have you, but our sort of manifest for what we're trying to achieve is to do all of the things for development teams that they wish they could do, that is often sometimes in their backlog or it's in the wish list of a manager or what have you to try to make their experience more delightful, more efficient, and help us produce better software. So like I said, not only do we think about things like CI, but we think about---like if you go to developers, in my view, and this is a very generic statement, and you say, what could I do to make your day-to-day experience of developing software happier and more pleasant and wonderful, I think a lot of them would say something that DevOps, for example, doesn't even acknowledge, which is, how do I deal with flaky tests? How do I make my test suite stable? Well, there's not much of a conversation in DevOps about how that works. So, developer productivity is here, and our team and things like ours, are here to think about those types of problems. Secondarily, we're here to think about problems like, how do we structure development environments for developers? We have a relatively complex stack. We have Elasticsearch and Kibana and APM servers and all these test servers and what have you. And all to often, you go to development teams and you're like, how do you develop software? And usually the answer is kind of one of two things. It's like, well, I kind of maybe stand up some services on my laptop, or yeah, kind of, like I just reason about it and then I push it into the CI and then maybe we see what the test says. Not often do you go to development teams and they're like, oh, yeah, I have a super sophisticated development environment that has all of the same services that I need, I can stand them up, I can stand them down, there's really wonderful tooling built around that. We think a lot about that. We think about stuff like, for example, issue reproduction. So when developers get an issue reported, we think about metrics like what is the mean time for a developer to receive a bug report and be able to reproduce it? that's actually one of the metrics that we try and track because that's indicative of how sophisticated and robust our developer tooling is. If they have to download a bunch of software or muck about with repos or permissions or CIs or other things, that's just burnt time that they're never going to get back. So we think a lot about that. And then we also think about what is generally on developers' wish lists, which are things like capacity and load testing and how can we build up environments for developers to do that type of testing and facilitate that type of testing for them so that they can come in and simply ask a question and get an answer, versus the weeks or months that it might take for them to sort of hand build all of that tooling on their own. I can virtually guarantee you that every development team out there has a list of backlog items in their backlog like this. Stuff like that, stuff how do we do project metrics, how do we do KPIs in a more sophisticated way? And so at Elastic, what we've done is, we've said, okay, there's this whole constellation of problems around CI, tests, testability, QA, developer tooling, sort of project developer metrics, and there's a lot of sort of cross-discipline in here. And as you can see, we've strayed kind of far from the traditional DevOps role there because it's our role to in fact support the efficacy and the performance of the developer teams themselves. And so the way that we've structured that is, it's our job to do that, so developers are our consumers, and then we're consumers of the infrastructure team, where the infrastructure team probably does more what these days people would call DevOps-type work, but I just call system administration is. And so by injecting a developer productivity team in between, again, what is traditionally a DevOps team and the developers themselves, we're able to build all of that tooling that makes our developers really happy, A, and, B, extremely efficient. And if you look at the pace of software developer and feature development at Elastic, and I encourage everyone to do that, I think our record speaks for itself. 

 

[00:46:06.064]

And so I would really encourage a lot of these organizations to think about that set of priorities, that constellation that exists around developer support and developer productivity as being an incredibly important piece that sits between that DevOps level on the top and the developer teams as consumers on the bottom. And what we've done is, we've said, look, we're going to build out those teams, these developer productivity teams, and we're going to staff them with really, really experienced developers. Most of the folks on our team have decades or multiple decades of development or QA experience. And so we're able to step into that role of being able to support these developer teams hopefully in a way that really meets their needs.

 

[00:47:06.616]

So one question that I ask every technologist that I have on here, how do you learn a new technology? So if you see something that catches your eye, and you're like, I've got to learn this thing quick, what is your kind of procedure for doing that, if any?

 

[00:47:24.090]

Right, yeah, thankfully it's not too often that I'm under pressure to learn something quick. It does happen. But I'm a big reader, and I'm usually the guy that decides he's going to read 3 or 4 books over the weekend to kind of like understand all of the fundamentals and kind of the thinking and the theory around the problem because I suppose the message, especially for people who are just starting out in technology, is that it's all fundamentals with a thin layer of syntax on top. And if you can understand that truth, you'll have a really successful career. Because all too often we get new folks who are like, I'm going to memorize every switch of grep, for God knows what reason. Forget about that. Focus on the things that don't change. Look for the patterns. If you're thinking about containerization, or, sorry, if thinking about Docker, that's all fine and good. But think about containerization as a principle. Why is it important? What's actually happening? How do containers work? What are competing paradigms, in terms of container design, if you will? What's the history of containers? What came before? Which is a question, by the way, that new learners should always, always, always ask. If you're looking at a technology, part of your learning about that should be having a firm understanding of, A, why it exists, and, B, what came before and why is this better? And by the way, your answer can be "it's not," and that's fine too.

 

[00:49:34.392]

Awesome. So is there anything cool you're working on right now that you'd like to tell us about?

 

[00:49:38.484]

What am I working on right now? Yeah, so what I'm working on right now, two things, really, that I've been working on recently. One is flaky test detection. I sort of alluded to this as a problem that almost every development team that you go to. Not everyone, but pretty much everyone has a problem with flaky tests. And so we asked ourselves, well, how do we solve this? And our answer was sort of like, well, we probably don't solve it perfectly, but what we can do is give developers better insight about what's flaky and what's not. So what I did to solve this problem was to---we store all of our test results in Elasticsearch, of course, and so I wrote tooling that goes back and walks back through that history of test results and tries to make a determination about flakiness, like how often do these things occur, or do these failures occur in isolation. And then I apply some fairly basic heuristics to that. And then I come up with a set of scores that determine flakiness. And recently I've been working on trying to make those heuristics much smarter and trying to apply some of Elastic's AI technology that we have there, excuse me, ML, ML is what I meant to say, to this to make these heuristics even smarter. But what we do then at the end of the day is, if you come and you file a PR with one of our projects and a test fails and it's a flaky test, we have a little bot that will come and tell you right there in your GitHub PR, hey, your test failed, but we think there's a pretty good chance that it's flaky and we'll give you some insight and some suggestions about what to do with that. And we also, by the way, go and we create issues for our developers of those flaky tests so that they actually have something in their project tracking system to address those, versus what many teams do, which is be like, yeah, kind of when we see them maybe we address them. But because we have a system which is detecting and analyzing and producing metrics around them, we've been able to have our teams dramatically reduce the number of flaky tests because they have, again, strict analysis about what they are. So that's been pretty successful. So far we're certainly not the only ones to do that, but we sort of rolled our own solution for doing that. 

 

[00:52:26.708]

And then we also, the other thing I've been working on, actually today and yesterday, I've been using an outstanding tool by the folks over at Shopify. They wrote something called Toxiproxy. Toxiproxy basically is a proxy server, similar to many others, like HAProxy, for example. But what's unique about it is, it allows you to artificially insert different types of load or interference, or latency. So, for example, you can say, okay, make this connection, give it 250 ms of latency, or give it a random set of latency between these sets of values. And the really well-designed piece of this is the fact that they have, they actually have a REST server and a CLI tool so that you can speak to your sort of centralized proxy and adjust these values. So what I built was, I talked a little bit about how we focus a lot on trying to recreate systems for developers to develop on because they develop in this sort of complex environment that has many different services running. And so the system that I built today inserts Toxiproxy into all of the network connections between these things, and then I built a dashboard that sort of looks like, you can imagine like a graphic equalizer that you use in music production. And what that does is, it allows you to say, okay, now go ahead and tune the database to have, I'm going to drag down the slider and introduce 10 ms of latency here. And okay, now I'm going to drag down the slider for application response time over here. And then where we're sort of intending to go with that is to then write test cases---again, Toxiproxy has this wonderful sort of multilanguage support for test runners that will allow us to write test cases that can represent these failure scenarios. So we can say, for example, and this comes from their documentation, so this isn't my idea, it's theirs, you can write a test that says, for example, okay, run my services. Now introduce a serious amount of latency to the database layer. Now assert that the front page showed the correct error message that says, we seem to be having a little bit of trouble with our database, can you come back later. So I'm very interested in doing sort of like these test assertions against sort of artificial constraints for the set of services that we build and that we test against. And, of course, we build applications performance monitoring software, so this will be helpful for us in doing development to be able to produce the sort of test data that we need to continue doing---we work on lots of things, but that's a sample.

 

[00:56:00.217]

That is awesome. Yeah, I'm going to check out that Toxiproxy for my personal projects that I'm working on. I kind of haphazardly create these things, exactly what you're talking about, I create these little issues for working around, and yeah, that sounds really cool. That's awesome. Well, thank you very much for talking with me today.

 

[00:56:22.633]

Yeah, cheers, thank you. I really enjoyed our conversation.

 

[00:56:32.943]

Thank you for listening to All Hands on Tech. To see show notes and more info, visit pluralsight.com/podcast. Have a great day.