The official blog of Codecademy Blog
Stay up to date with feature releases, events, and much more.
As product development becomes more and more data driven, the demand for essential data analysis tools has surged dramatically. Today, we are excited to announce that we've open sourced EventHub, an event analysis platform that enables startups to run their funnel and cohort analysis on their own servers. Getting EventHub deployed only requires downloading and executing a jar file. To give you a taste of what EventHub can do, we set up a demo server to play on located here - demo server with example funnel and cohort analysis queries.
EventHub was designed to handle hundreds of events per seconds while running on a single commodity machine. With EventHub, you don’t need to worry about pricy bills. We did this to make it as frictionless as possible for anyone to start doing essential data analysis. While more details can be found from our repository, the following are some key observations, assumptions, and rationales behind the design.
Basic funnel queries only requires two indices, a sorted map from (event_type and event_time) pair to events, and a sorted map from (user and event_time) pair to events
Basic cohort queries only requires one index, a sorted map from (event_type and event_time) pair to events.
A/B testing and power analysis are simply statistics calculation based on funnel conversion and pre-determined thresholds
Therefore, as long as the two indices in the first bullet point fit in memory, all basic analysis (queries by event_types and date range) can be done efficiently. Now, consider a hypothetical scenario in which there are one billion events and one million users. A sorted map implementation like AVL tree, RB tree, SkipList, etc. can be dismissed as the overhead of pointers would be prohibitively large. On the other hand, B+tree may seem to be a reasonable choice. However, since events are ordered and stored chronologically, sorted parallel arrays would be a much more space efficient and simpler implementation. That is, the first index from (event_type and event_time) pair to events can be implemented as having one array storing even\ttime for each event\type and another parallel array storing event_id, and similarly for the other index. Though separate indices are needed for looking up from event_type or user to their corresponding parallel arrays, as event_type and user are level of magnitude smaller than events, the space overhead is negligible.
With parallel array indices, the amount of memory needed is approximately (1B events * (8 bytes for timestamp + 8 bytes for event id)) * 2 = ~32G, which still seems prohibitively large. However, one of the biggest advantage of using parallel arraysis that within each array, the content is homogeneous and thus compression friendly. The compression requirement here is very similar to compressing posting list in search engines, and with algorithm like p4delta, the compressed indices can be reasonably expected to be <10G. In addition, EventHub made another important assumption that date is the finest granularity. As event id is assigned monotonically increasingly, the event id itself can then be thought of as some logical timestamp. As EventHub maintains another sorted map from date to the smallest event id on that date, all the queries filtered by date range can be translated to queries filtered by event id range. With that assumption, EventHub was able to get rid of the time array and further reduced the index size by half (<5G). Lastly, since indices are event_ids stored chronologically in an array, plus the array is stored as memory mapped file, the indices are very friendly to kernel page cache. Also, assuming most of the analysis only cares about recent events, as long as those tail of the indices can fit in the memory, most of the analysis can still be computed without touch disks.
At this point, as the size of the basic indices is small enough, EventHub would be able to answer basic funnel and cohort queries efficiently. However, since, there are no indices implemented for other properties on events, in case of queries filtered by event properties other than event_type, EventHub still needs to look up the event properties from disk and filter events accordingly. Due to the space and time complexity needed for this type of query is not easy to estimate analytically, but in practice, when we ran our internal analysis at Codecademy, the running time for most of the funnel or cohort queries with event properties filter is around few seconds. To optimize the query performance, the followings are some key features implemented and more details can be found from the repository.
Each event has a bloomfilter to quickly reject event property which doesn't exactly match the filter
LRU Cache for events
Assuming the bloomfilters are in memory, EventHub only needs to do disk lookup for events that actually match the filter criteria (true positive) as well as the false positive events from bloomfilters. As, the size of bloomfilters can be configured, the false positive rate can be adjusted accordingly. Additionally, since most of the queries only involves recent events, to optimize the query performance, EventHub also keeps a LRU cache for events. Alternatively, EventHub could have implemented inverted index like search engines do to facilitate fast equality filters. The primarily reason for adopting bloomfilters with cache is that it doesn't require adding more posting list as new event properties are added, and we believe for most use cases and with proper compression, EventHub can easily cache hundreds of million of events in memory and achieve low query latency.
Lastly, EventHub as is doesn't compress the index and we left that as one of our todo. In addition, the following two features can be easily added to achieve higher throughput and lower latency if that's needed.
Event properties can be stored as column oriented which will allow high compression rate and great cache locality
Events from each user in funnel analysis, cohort analysis, and A/B testing are independent. As a result, horizontal scalability can be trivially achieved from sharding by users.
As always, it's open sourced, and pull requests are highly welcome.
If you like the post, you can follow me (@chengtao_chu) on Twitter or subscribe to my blog "ML in the Valley". Also, special thanks Bob Ren (@bobrenjc93) for reading a draft of this
More and more startups are looking to hire data scientists who can work autonomously to derive valuable insights from data. In principle, this sounds great: engineers and designers build the product, while data scientists crunch the numbers to gain insights. In practice, finding these data scientists and enabling them to be productive are very challenging tasks.
Before diving further, it's useful to note a few trends in data and product development that have emerged over the past decade:
Companies such as Google, Amazon and Netflix have shown that proper storage and analysis of data can lead to tremendous competitive advantages.
It’s now feasible for startups to instrument and collect vast amounts of usage data. Mobile is ubiquitous, and apps are constantly emitting data. Big data infrastructure has matured, which means large-scale data storage and analysis are affordable.
The widely adopted lean startup philosophy has shifted product development to be much more data-driven. Startups now face the challenges of defining Key Performance Indicators (KPI), designing and implementing A/B tests, understanding growth and engagement funnel conversion, building machine learning models, etc.
Because of these trends, startups are eager to develop in-house data science capabilities. Unfortunately many of them have the wrong ideas about how to build such a team. Let me describe three popular misconceptions.
Misconception #1: It's okay to compromise the engineering bar for statistical skills.
For data scientists to work productively and independently, they must be able to navigate the entire technical stack and work effectively with existing systems to extract relevant data. The only exception is if a startup has already built out its data infrastructure. But in reality, very few startups have their infrastructure in place before building a data science team. In these cases, a data science team without strong engineering skills or engineering support will have a hard time doing their job. At best, they will produce suboptimal solution that will be rewritten by another team for production.
To illustrate this, take the example of building the KPI dashboard at Codecademy. Before visualizing the data in d3, I had to extract and join (a) user data from MongoDB, (b) cohort data from Redis, and (c) pageview data from Google Analytics. The data collection alone would've been near impossible without an engineering background, let alone making the dashboard real-time, modularized and reusable.
Misconception #2: It's okay to compromise the statistics bar for engineering skills.
Proper interpretation of data is not easy, and misinterpreted data can do more damage than data that's not interpreted at all (check out "Statistics done wrong"). Building useful machine learning (ML) models is trickier than most people expect. A popular but misguided view holds that ML problems can be solved either by applying some black box algorithm (a.k.a magic), or by hiring interns who are PhD students. In practice, hundreds of decisions and tradeoffs are made in solving such problems, and knowing which decisions to make requires a lot of experience. (I’ll expand upon this more in a future post titled “Machine learning done wrong”.) For a given ML problem, there are tens if not hundreds of way to solve it. Each solution makes different assumptions and it's not obvious how to navigate and identify which assumptions are reasonable and which model should be used. Some would argue: why not just try all different approaches, cross validate, and see which one works the best? In reality, you never have the bandwidth to do so. A strong data scientist might be able to produce a sensible model right off the bat, while a weak one might spend months optimizing the wrong model without knowing where the problem is.
Misconception #3: It's okay to hire data scientists who lack product thinking.
Imagine asking someone who doesn't have a holistic view of the product to optimize your business KPIs. They may prematurely optimize the sign up funnel before making sure the product has reasonable retention, which would lead to more unretained users. Some think data-driven product development is a local optimization. This criticism is only correct when those who drive product development with data fail to think about the product holistically.
To sum up, a productive data science team requires data scientists that are strong in engineering, statistics, and product thinking. It's hard. And it becomes even harder to look for the first data science hire who will be spearheading data efforts in a startup. For startups that don't have the luxury to wait and hire these rare data scientists, it's important to be aware of the compromises made especially in terms of the hiring bar. Before the data team is strong enough across all three areas, make sure they have strong support for the skills they lack, and don't expect them to work autonomously.
If you like the post, you can follow me (@chengtao_chu) on Twitter or subscribe to "ML in the Valley". Also, special thanks Ian Wong (@ihat), Leo Polovets, and Bob Ren (@bobrenjc93) for reading a draft of this.
We’ve been working hard over the past four months trying to reimagine Codecademy and we couldn’t be happier to finally unveil it to the world. We have redefined every component under our brand, from a single button on our dashboard to our email template, business cards, slides and even apparel.
We had been discussing a design refresh for a while, but somehow it always ended up being pushed to the side. Finally, in October last year, after completing a user segmentation project that brought to live the main user archetypes of Codecademy.com, it quickly became apparent that if we wanted to grow and mature as a brand, we required a thorough redesign of our entire product.
Why a redesign?
Reason #1 – Start fresh
First, there was the obvious problem of design incoherence and variation. This happened primarily because we lacked a well-defined color and font palette, a uniform visual language for our badges, a unified layout scheme (page types), and a cohesive strategy for all print materials – business cards, postcards, posters, etc. After two and a half years of multiple nip and tuck design fixes and additions, it was time to clean up the house and start fresh. This meant we could finally create an extensible UI pattern library (used and shared by designers and developers) and optimize our new face across multiple platforms by embracing a responsive design layout.
A random sampling of pages within our old web ecosystem, showcasing some visual design inconsistency.
Reason #2 – Brand matureness
Secondly, was the realization that our young startup look and feel was slowly becoming incompatible with our future goals and aspirations. In a time when we are engaged in several partnerships with schools, companies, and governments across the globe, while also continuing to fulfill the needs of our growing user base, our brand should feel a bit more mature, inviting, professional, and sophisticated.
Codecademy’s quirky and undistinguished old logo was created by one of our co-founders in a few minutes by browsing through various fonts in a word processor. The logo featured the giddy lobster font, which has become so popular that is at times compared with Microsoft’s Comic Sans.
Our new look
Back in early November last year, we partnered with our friend Eddie Opara, and his immensely talented team of designers at Pentagram, in order to create a new visual identity that could better reflect the company’s age, ambition, and main attributes.
The first thing we tackled was the logo, as the key centerpiece of our new look. We spent some time talking to our users, colleagues, and our founders Zach and Ryan, to have a solid grasp of Codecademy's perception and future aspirations. After this important research period, we went through several revisions, continuously narrowing down on the mark that best represented our main traits.
While putting the finishing touches to our new logo, we began creating a complementary color, font and iconography palette. It was important to handle all these components simultaneously, so we could delineate a consistent design thread through all of them. Phase 1 gave us the most critical building blocks of our new brand, through our partnership with Pentagram, and marked the beginning of an exclusive in-house development period.
Various early directions for our new logo.
Narrowing down on a few favorite visual marks.
Our final logo with its underlying grid.
Our new graphical language used across the site to indicate different types of content (symbols), actions and controls (icons), and learning achievements (badges).
FF DIN Rounded - Our primary typeface.
Our new color palette.
After defining the main brand pieces with Pentagram (logo, typography, iconography, color), we started applying it internally to our entire web ecosystem by building a comprehensive number of reusable design patterns. For two weeks we built a sizable UI toolkit covering a variety of elements (see below).
Our first attempt at the UI toolkit encompassing only a short amount of elements.
Our growing UI toolkit covering every element, such as header, footer, form fields, button styles, sign up modules, grid, padding, typography, colors, and interactions.
This was the longest, and perhaps most exhausting, of all phases, where we redesigned 70+ webpages in tandem with other collateral material (email templates, slides, apparel, etc). Fortunately, we imposed ourselves a very well defined timeline, with multiple cycles and milestones, which helped us guide through this large task (see sitemap below).
First, we created a comprehensive sitemap of Codecademy.com and then divided the sitemap into four groups, each representing a 2-week delivery cycle. As we redesigned the various pages in each cycle, our brilliant team of developers built our UI styleguide and constructed many of the pages based on the shared design patterns.
Examples of our redesigned pages, from left to right: Enterprise, Stories, Profile.
Examples of our redesigned pages, from left to right: Blog, About, Jobs.
Examples of our redesigned pages, from left to right: Help Center, UK Curriculum, Hour of Code.
Our final phase was all about making sure we were building the thing right. We implemented and tested our new redesign, while in the process getting feedback from our community. We created a huge amount of redlines for all the new material, started experimenting with some versions live on the site, and listened to dozens of comments from our selected users and moderators.
An example of the various comps created to support the accurate implementation of all our redesigned pages.
Even though we spent a long time rethinking Codecademy, this hefty work is still unfinished. It certainly provides our team and product with a much-needed fresh face, one that we can feel proud of, and most importantly, one that our users can thoroughly enjoy. But this is just the beginning. We would love to hear what you have to say about our redesign and how we can continue improving our product. We have dozens of ideas to continue pushing this brand foreword. Please keep coming back for more!
Two years ago, we started building a product that would help teach people the skills they needed to succeed in a digital world. As more than 24 million people took Codecademy courses on our web and iOS platforms, we too learned and grew. Now, we’re excited to show you our latest project — a new Codecademy designed from the ground up, aimed to help you learn skills hands-on, with real projects, and constant feedback. Better yet, the new Codecademy experience helps to connect you with the real skills you’ll need to succeed in today’s workplace.
Learning isn’t just about one exercise or “class,” but instead is a gateway to community, opportunities, ideas, and a better life. We’ve witnessed this through the millions of learners on Codecademy and through the thousands of inspiring teachers who have shared their knowledge with the world with our course creator. We listened to them while building what we think is the best learning experience — for anyone, anywhere — to learn the most important skills of today.
In two years, Codecademy has scaled to become larger than we had ever imagined. Our learners, spread across the globe in every country in the world, have:
- written more than a billion lines of code
- joined more than 24 million others in starting the journey of learning to code
- helped to create more than 100,000 courses using our course creator
- hosted meetups in more than 350 cities
- learned on-the-go through our mobile apps for iPhone and iPad, both of which have been featured in the App Store
- worked with nearly 100 partner organizations like The White House and Twitter to both learn and spread their knowledge even further
Today, we’re proud to show off the results of all of that to a few friends and, within days, the rest of the world. The first fruits of this effort are an experience that gets you from knowing nothing to building a website — in this case, Airbnb’s homepage. Along the way, you’ll experiment with blocks of code, see the results of adding and subtracting different parts of a page, and use the real terminology that developers and designers all over the world over use to create websites just like Airbnb’s.
Our new platform leaves you not just with new knowledge, but with a portfolio of projects you can share with your friends, enabling them to learn from you. We’ve even built the capability for you to share your work with future employers, and to demonstrate your new skills. We’ve been testing our new learning interface for weeks and we’ve seen it applied in an amazing number of ways — from designers at major firms winning new consulting work because of their ability to build their designs to students in high school making personal webpages for themselves.
Codecademy’s learning experience comes not just from the data behind 24 million learners and billions of lines of code, but also from the individual stories we’ve heard from our wealth of committed learners. Former book critic Juliet Waters, for instance, started learning with her 11-year old son as part of our Code Year program in 2012. Since then, she’s gone on to chronicle her journey in a book that’s coming soon, noting that programming helped her feel “more connected with others in our tech-driven society.” A parent named Shari told us that her 11 and 13-year old sons had a “reaction to what they are learning [that] beats their enthusiasm for the [video games].” We work hard everyday to deliver a similar experience for our users all around the world — with more than 65% of users outside the US, it’s important to us that we’re democratizing access to the fundamental blocks of knowledge that can improve peoples’ lives.
Tommy Nicholas’ story is just the sort we’re hoping our new learning environment will foster: he began with almost no programming knowledge at all, and gained enough skills to develop a website, Coffitivity, that was named one of TIME Magazine’s Top 50 websites in 2013.
Billions of lines of code, millions of users, and years after our founding, we’ve been astonished by what people can do when they can easily learn the fundamental skills that can transform their lives. Today, we’ve redesigned Codecademy to reflect that potential — and hopefully to help more people reach their goals and build the future they want to live in.
If you want to help us, we're hiring!
Codecademy started to help anyone learn the skills they need in order to succeed in the twenty-first century. We want Codecademy to be with you everywhere - learning shouldn't be confined to a classroom or a desktop computer.
We launched our first iPhone app, Codecademy: Hour of Code, for anyone in the world to get started learning to code on the go. We've built an entirely new Codecademy experience for mobile that includes the same things that make Codecademy on the web great - interactivity, "snack" sized content, and fun lessons. Our first app gives you the basics of programming and should help absolutely anyone get started with programming - it's almost too easy not to try!
We'll send content updates to this app with more courses for you to complete as time goes on. You'll see more feature updates as well.
Perhaps best of all, this is just the beginning of Codecademy on the go. We want to help you learn the skills that can help you change your life - anywhere and anytime. Download our first app and let us know what you think!
Teacher training is critically important and Codecademy is pleased to partner with CAS to be a part of the solution. Teachers can use Codecademy resources after they've attended training sessions to continue building their skills, and remote teachers can access the platform if they are unable to attend in person training sessions. Below is the official announcement:
From September 2014 schools in England will teach a new statutory computing curriculum, which aims to ensure all students can understand and apply the fundamental principles and concepts of computer science. This will make England the educational envy of almost every other country in the world, but it will also be a major step change from what schools currently teach. Not surprisingly this has left many teachers looking for support and further training.
CAS is running a national Network of Excellence for Teaching Computer Science, that aims to provide exactly this support and training. Codecademy, based in New York, will complement CAS’s in-person approach with a free online platform and interactive learning resources specifically designed to support the programming aspects of the new computing curriculum in England. Teachers can use it to learn programming themselves, or as a way to teach programming to their students.
Simon Peyton-Jones, Chair of CAS says “The UK has tens of thousands of teachers who need support and encouragement to deliver the new Computing curriculum with confidence and enthusiasm. Codecademy offers us the scalability of an online platform, and teachers can move smoothly from learning programming themselves to Codecademy to teach their students. I’m delighted to have this support.”
It's that time of the year again. The air is crisper, lattes are spiced with pumpkin-flavor extract, and Codecademy is coming to a campus near you!
Codecademy is taking to the road for a whirl-wind college tour up and down the eastern seaboard. We've had a great time sharing stores from our college fellowship program and spilling the technical details on how we evaluate up to 5,000 code submissions a second (over 25 million a day!)
So far, we've had a turnout at Brown that would scare any fire marshall and we've stayed up all night to hack with the best and brightest at MIT's blow-out HackMIT event. But we're just getting started, so come join us if we're visiting a campus near you!
Hope to see you soon!
MIT tech talk
10/8 at 7pm
Bldg 5, Room 233
Olin tech talk
10/9 at 7pm
Academic Center 126
10/11 at 9pm
MIT On campus interviews
Apply on careerbridge, if interested
A few of us at Codecademy spent the weekend at MIT for HackMIT - a hackathon involving over 1000 college students from around the world. It's been an awesome experience seeing so many coders in one location hacking away at amazing projects.
The atmosphere inspired us to start hacking ourselves. For our hack, we decided to programmatically find the most active Codecademy users at HackMIT and give them some love and swag.
By querying our database for all of the emails of attendees at HackMIT, we were able to find all the Codecademy users in attendance. We then sorted the list of users by points and achievements to find the top 3 most active Codecademy users at HackMIT.
Here is a photo of us with our top HackMIT user - spiltpeasoup!
We were surprised to find out that over 25% of hackers at HackMIT have a Codecademy account! If you are at HackMIT this weekend, stop by and say hi!
As some of you may have realized, Friday morning at about 10:00am, our site was not operable for 2 hours. We apologize for the inconvenience and wanted to explain to you why this happened.
Our hosting provider, Amazon Web Services (AWS), was having networking issues. This affected our app servers, app load balancer, and redis boxes. Some of you may have noticed a 503 error, which was thrown by our CDN (content delivery network). During these two hours, we are able to restore the site, but because of the networking issues, the site was very slow. At 12:07 Amazon Web Services restored the issue, and the site was back up and running as normal. Because this was a networking issue, no content or progress was lost.
Again we apologize for any inconvenience caused by this downtime. Unfortunately this particular issue was out of our control. We're investigating ways we can add greater redundancy to Codecademy to help ensure we're protected from similar issues in the future.
See what we posted on twitter Friday morning.
We're always asking ourselves how we can help users when they get stuck while working through our courses. I had an idea at a company hackathon that turned into a big project.
For the past few weeks we've been running my idea as an experiment: when certain users write code that returns a common syntax error, we show them a snippet of code from the glossary that's an example of what they were trying to do.
The idea was that people need to see examples of good code before they can learn how to write good code.
In order to do this we had to redo the internals of the glossary to be more dynamic. With that done, another great result has been that we were able to hand over editing privileges to our moderators, who have been thrilled to take ownership of this part of the site and improve the content. They've been doing a good job of cleaning it up and adding to it, and this will be an on-going thing.
The best part of this is that as the moderators improve our glossary, the code suggestion feature will get smarter and show more examples! They come right from the glossary.
The overhaul was also aesthetic; the glossary is now much prettier and easier to read. The typography improved thanks to some expertise from our designer Jason and the code samples now match the color of our code editor.
Go take a look!