Most sites claiming to catch AI-written text fail spectacularly

As the fervor around generative AI grows, critics have called on the creators of the tech to take steps to mitigate its potentially harmful effects. In particular, text-generating AI in particular has gotten a lot of attention — and with good reason. Students could use it to plagiarize, content farms could use it to spam and bad actors could use it to spread misinformation.

OpenAI bowed to pressure several weeks ago, releasing a classifier tool that attempts to distinguish between human-written and synthetic text. But it’s not particularly accurate; OpenAI estimates that it misses 74% of AI-generated text.

In the absence of a reliable way to spot text originating from an AI, a cottage industry of detector services has sprung up. ChatZero, developed by a Princeton University student, claims to use criteria including “perplexity” to determine whether text might be AI-written. Plagiarism detector Turnitin has developed its own AI text detector. Beyond those, a Google search yields at least a half-dozen other apps that purport to be able to separate the human-generated wheat from the AI-generated chaff, to torture the metaphor.

But are these tools truly accurate? The stakes are high. In an academic setting, one can imagine a scenario in which a missed detection means the difference between a passing and failing grade. According to one survey, almost half of students say that they’ve used ChatGPT for an at-home test or quiz while over half admit having used it to write an essay.

To find out whether today’s AI text detection tools are up to snuff, we tapped a ChatGPT-like system called Claude, developed by AI startup Anthropic, to create eight samples of writing across a range of different styles. We specifically had Claude generate:

  • An encyclopedia entry for Mesoamerica
  • A marketing email for shoe polish
  • A college essay about the fall of Rome
  • A news article about the 2020 U.S. presidential election
  • A cover letter for a paralegal position
  • A resume for a software engineer
  • An outline for an essay on the merits of gun control

While admittedly not the most thorough approach, we wanted to keep it simple — the goal was to gauge the performance of AI text detectors on text originating from a single AI system (Claude). We tested several of the more popular detectors out there, namely OpenAI’s own classifier, AI Writing Check, GPTZero, Copyleaks, GPTRadar, CatchGPT and Originality.ai.

Encyclopedia entry

Claude’s encyclopedia entry reads like something out of Encyclopedia Britannica, complete with rich detail about the rise, fall and lasting impact of ancient Central American civilizations. The ideas flow well enough from paragraph to paragraph, albeit with a non sequitur (or two) thrown in, and the writing style aligns with what you might expect from an academic publication:

Mesoamerica is a region that encompasses southern Mexico and Central America. It was home to pre-Columbian societies that shared common cultural traits. The earliest complex societies in Mesoamerica arose around 3,000 years ago.

The Mesoamerican cultural region was dominated by maize cultivation and the rise of complex civilizations. Mesoamerican civilizations built iconic stepped pyramid temples and other monumental architecture. Many of these civilizations had rich mythological and religious traditions as well as complex calendrical systems and mathematics.

major Mesoamerican civilizations included the Olmec, Teotihuacan, Maya, Zapotec, Mixtec, Huastec, Tarascan state, and the Aztec, which was encountered by the Spanish conquistador Hernán Cortés in the early 16th century. These Pre-Columbian civilizations are known for architectural feats, as well as advances in mathematics, astronomy, and calendrics. They are also known for the development of complex market exchange networks and trade routes connecting the Pacific and Atlantic coasts.

After conquering the Aztec empire, the Spanish built Mexico City on the ruins of the Aztec capital Tenochtitlan. The language of the Aztecs, Nahuatl, remains the most widely spoken indigenous language in Mexico today. Mesoamerican cultural traits such as maize cultivation, buildings with stepped pyramids, and monumental sculptures, have continued in the post-Conquest period and are still present in contemporary Mexican culture.

For those reasons, we predicted that the text would give the detectors some trouble — and it did. Of those tested, only two, GPTZero and Originality.ai, correctly classified the text as AI-generated. The others fell short. OpenAI’s classifier initially wasn’t confident enough to arrive at an answer, while Originality.ai gave the text only a 4% chance of being AI-authored. Not the best look.

AI text classifier

CatchGPT was fooled by the AI-generated text.

  • OpenAI classifier: Classified incorrectly
  • AI Writing Check: Classified incorrectly
  • GPTZero: Classified correctly
  • Copyleaks: Classified incorrectly
  • GPTRadar: Classified incorrectly
  • CatchGPT: Classified incorrectly
  • Originality.ai: Classified incorrectly

Marketing email

Claude’s social media copy is a humorous blend of real and far-fetched details, but there’s no obvious tip-off that the text is AI-generated. It includes a price and call to action, even — how neat! Ad copywriters be forewarned:

Subject: Get a Shine That Lasts

Are your shoes looking dull and worn? With Super Shine shoe polish, you can restore your shoes to a glossy, like-new shine and protect them from damage and wear.

Super Shine is made of the highest quality waxes and dyes and is available in a range of neutral and glossy colors to match any shoe type or leather. Our polish is uniquely formulated to clean, polish, and protect your shoes with a single application. The conditioning oils penetrate the leather to nourish it from within while the pigments cover scuffs and scratches and the protective wax shield seals the shine to repel water and other elements.

A perfect shine has never been easier—just wipe away dirt with a damp cloth, apply Super Shine with a soft brush, and buff to a brilliant shine. Our polish dries to a hard finish, so you won’t leave marks on your hands or clothes. And because a little goes a long way, a single tin will polish up to 100 pairs of shoes.

Never settle for lackluster-looking shoes again. For a shine that demands admiration, choose Super Shine—available for $9.99 at your local retailer

Text generation classifier

A poor showing from Originality.ai

The snippet stumped all of the detectors, incredibly. But to be fair, it was shorter in length than our encyclopedia entry. And detectors tend to perform better with lengthier samples of text, where the patterns are more obvious.

  • OpenAI classifier: Classified incorrectly
  • AI Writing Check: Classified incorrectly
  • GPTZero: Classified incorrectly
  • Copyleaks: Classified incorrectly
  • GPTRadar: Classified incorrectly
  • CatchGPT: Classified incorrectly
  • Originality.ai: Classified incorrectly

College essay

Claude couldn’t write us a very lengthy college essay owing to its technical limitations, but as if to make up for it, the AI packed as much detail as it could into a few short paragraphs. The sample has elements of a typical in-class essay, certainly, including a thesis statement, conclusion (if not an especially punchy one) and supporting references to historical events:

The Mighty Roman Empire: What Led to Its Downfall?

The Roman Empire was one of the greatest and longest-lasting empires in history. However, by the 5th century CE, the Roman Empire had entered a period of decline and fall. There were several major causes for the fall of Rome: political instability and corruption, barbarian invasions, economic decline, and over-reliance on slaves and the spoils of war.

Politically, the late Roman Empire was full of civil wars, coups, and rival claimants to the throne. This constant power struggle drained the empire of resources and made it difficult to deal with other threats. Emperors used the Praetorian Guard to seize power, but then they were assassinated and new emperors rose, continuing the cycle.

As the empire weakened, invading barbarian tribes saw an opportunity. In 410 CE, the Visigoths sacked Rome itself, a shocking blow to the empire’s prestige. Later, other tribes invaded and gained territory within the empire, sometimes being granted land in exchange for providing soldiers to the Roman army. These tribes weakened the empire, but they also introduced new customs and cultures.

Economically, Rome had a system based on conquest and spoils, not production. The cost of maintaining a large empire, legions of soldiers, and grand spectacles and entertainments was quite high. Once expansion slowed and new conquests brought in less wealth, the unsustainable economic system faltered. At the same time, slaves did much of the labor, and there was little incentive to improve agricultural technology or business practices.

In conclusion, the Roman Empire fell for many reasons, including political instability, barbarian invasions, economic decline, and over-reliance on slavery. Ultimately, it could not withstand the combined effects of these immense internal and external pressures. While the Roman Empire ultimately fell, its legacy lives on in our laws, culture, languages, and more—a lasting reminder of its enduring greatness.

The naturalness of the text was enough to defeat most of the classifiers once again, albeit fewer than with the marketing copy. That bodes poorly for educators hoping to rely on these tools; unlike detecting plagiarism, spotting AI-generated text is a far more nuanced task.

Text generation classifier

A win for CatchGPT.

  • OpenAI classifier: Classified incorrectly
  • AI Writing Check: Classified incorrectly
  • GPTZero: Classified correctly
  • Copyleaks: Classified incorrectly
  • GPTRadar: Classified incorrectly
  • CatchGPT: Classified correctly
  • Originality.ai: Classified incorrectly

Essay outline

Most grade school kids can outline an essay. So can AI — without breaking a sweat, Claude spit out an outline for a pros-and-cons essay on the merits of gun control. It helpfully labeled each paragraph (e.g. “Body paragraph,” “Analysis and discussion”), maintaining a dispassionate tone about the divisive topic:

I) Introduction: Introduce the topic of gun control and note that while many people argue that individuals should have the right to bear arms, others contend that gun control legislation could help reduce gun violence.

II) Body paragraph 1: Pro-gun control argument: Advocates argue that easy access to firearms leads to more homicides, suicides and other gun-related deaths. They point out that states with more restrictive gun laws have fewer gun-related deaths. Stricter measures like background checks, waiting periods, and limits on assault weapons could help save lives by keeping guns out of the hands of criminals, terrorists, and other dangerous individuals.

III) Body paragraph 2: Anti-gun control argument: Opponents counter that people have a right to self-defense and that “good” people should be able to have firearms to protect against “bad” people. They contend that gun ownership is an important freedom and that responsible, law-abiding citizens should be able to own firearms. They argue that gun control legislation would not deter criminals, who would always find ways to get access to firearms. Stricter laws would only restrict freedoms of ordinary people and make them less safe.

IV) Analysis and discussion: Discuss perspectives on both sides and analyze merits and issues with each argument. For example, would gun restrictions reduce gun violence or just limit freedoms? Do restrictions disarm good citizens and put them at risk, or mainly deter irresponsible people or criminals? Could some measures like background checks achieve a balance? Explain complexities around the issues and rates of gun violence in different areas with varying levels of restrictions.

V) Conclusion: Wrap up the essay by restating your analysis and overall conclusions on the merits of gun control legislation. Note the multifaceted nature of the issues and argue for a way forward that could potentially reduce gun violence while respecting rights.

The outline might’ve fooled me, but the detectors had an easier time. Three — the OpenAI classifier, GPTZero and CatchGPT — caught on.

Text generation classifier

OpenAI’s classifier spotted the AI-generated text.

  • OpenAI classifier: Classified correctly
  • AI Writing Check: Classified incorrectly
  • GPTZero: Classified correctly
  • Copyleaks: Classified incorrectly
  • GPTRadar: Classified incorrectly
  • CatchGPT: Classified correctly
  • Originality.ai: Classified incorrectly

News article

As with the previous samples, there’s nothing obviously artificial about the news article we generated using Claude. It reads well, structured more or less in the inverted pyramid style. And it doesn’t contain obvious factual errors or logical inconsistencies:

Biden Defeats Trump in 2020 U.S. Election, Ending a Divisive Campaign

Former Vice President Joe Biden has defeated President Trump in an election that drew record numbers of voters and emerged as a referendum on Mr. Trump’s turbulent tenure.

Mr. Biden’s victory amounted to a repudiation of Mr. Trump by millions of voters after four profoundly turbulent years that fueled a heated debate about the nation’s values and future. Mr. Biden campaigned on plans to expand health care, address economic inequality and combat climate change, while vowing to ‘restore the soul of the nation’ and summon Americans to a common purpose.

In a brief statement on Saturday morning, Mr. Biden said he was ‘honored and humbled’ by the trust the American people had placed in him. ‘The battle is over, but the campaign to restore the soul of the nation has just begun,’ he said from Wilmington, Del., as celebrants cheered and honked car horns nearby. ‘It’s time for Americans to unite.’

Mr. Trump showed no sign of conceding, claiming without evidence that the election was ‘rigged’ and that his early leads in some states on election night showed he was the rightful winner. There were no major irregularities reported in an election that state election officials and outside experts said went smoothly in the middle of a deadly pandemic.

The outcome amounted to a repudiation of Mr. Trump’s divisive appeals to racial grievances and hard-line responses to the virus, which has claimed more than 232,000 lives in the United States, and left millions out of work.

It’s no wonder, then, that the detectors struggled. With the exception of GPTZero, none managed to classify the article correctly. Originality.ai went so far as to give it a 0% chance of being AI-generated. Big yikes.

AI text classifier

AI Writing Check got it very wrong.

  • OpenAI classifier: Classified incorrectly
  • AI Writing Check: Classified incorrectly
  • GPTZero: Classified correctly
  • Copyleaks: Classified incorrectly
  • GPTRadar: Classified incorrectly
  • CatchGPT: Classified incorrectly
  • Originality.ai: Classified incorrectly

Cover letter

The cover letter we generated with Claude has all the hallmarks of a straightforward, no-nonsense professional correspondence. It highlights the skills of a fictional paralegal job candidate, inventing the name of a law firm (somewhat peculiarly) and making references to legal discovery tools like Westlaw and LexisNexis:

Dear Hiring Manager,

I am writing to express my strong interest in the paralegal role at your firm. I believe my experience and education in the legal field make me a great candidate for this position.

Over the past two years, I have worked as a paralegal at Smith & Jones Law Firm, where I have gained extensive experience supporting attorneys in all aspects of civil litigation cases. I have consistently organized and maintained thousands of pages of legal documents, including transcripts, affidavits, and discovery material. I have also streamlined the firm’s file management system, resulting in significant time savings. In addition, I have drafted correspondence with clients, opposing counsel, and third parties; assisted at trials; and completed legal research projects to support pre‐trial motions and settlement negotiations.

Prior to my role as a paralegal, I earned an Associate’s Degree in Paralegal Studies from [College Name]. My coursework and internship experiences provided a strong foundation in key areas such as legal research and writing, as well as knowledge of relevant software and databases including Westlaw and LexisNexis. I have kept my skills and knowledge up-to-date through ongoing professional development.

Outside of my work and education experience, I am a diligent and detail-oriented person, with excellent organizational and communication skills. I thrive in a fast-paced environment and am adept at balancing and prioritizing complex, time-sensitive tasks to meet tight deadlines. I would appreciate the opportunity to contribute to the success of your firm’s clients and cases.

Thank you for your consideration. I look forward to speaking with you further about this opportunity.

Sincerely,

[Your name]

The letter stumped OpenAI’s classifier, which couldn’t say with confidence whether it was AI- or human-authored. GPTZero and CatchGPT managed to spot the AI-generated text for what it was, but the rest of the detectors failed to achieve the same.

Text generation classifier

GPTZero impressively detected the AI-originated bits.

  • OpenAI classifier: Classified incorrectly
  • AI Writing Check: Classified incorrectly
  • GPTZero: Classified correctly
  • Copyleaks: Classified incorrectly
  • GPTRadar: Classified incorrectly
  • CatchGPT: Classified correctly
  • Originality.ai: Classified incorrectly

Resume

Pairing the fake cover letter with a fake resume seemed fitting. We told Claude to write one for a software engineer, and it delivered — mostly. Our imaginary candidate has an eclectic mix of programming skills, but none that stand out as particularly implausible:

• John Doe

• Software Engineer, 3 years of experience

• jdoe@email.com • 123-456-7890

• Technical Skills: Java, JavaScript, C++, SQL, MySQL, Git, Agile methodology, Software design, Algorithms, Data structures

• Professional Experience:

› ACME Corp, Software Engineer, 2018-Present

› Worked on core components of company’s flagship product, a SaaS-based big data analytics platform.

› Led design and development of the data ingestion module, capable of handling huge volumes of streaming data. Used Java and MySQL.

› Reduced upstream data errors by 42% through implementation of advanced data validation and correction algorithms.

› XYZ Tech Company, Software Engineer Intern, Summer 2017

› Developed back-end components for ecommerce company using JavaScript and Node.js.

› Prototyped and demonstrated scaling of core databases and APIs to handle 5x growth.

• Education:

› Bachelor’s degree in Computer Science, Big Tech University, 2017

› Courses included algorithms, operating systems, machine learning, software architecture, and theory of computation.

› 3.8 GPA

• Skills: analytical, communication, problem-solving, detail-oriented

• Interests: running, reading, and hiking

Evidently, the detectors agree. The fake resume even stumped GPTZero, which up until this point had been the most reliable of the bunch.

Text generation classifier

GPTZero can’t win ’em all.

  • OpenAI classifier: Classified incorrectly
  • AI Writing Check: Classified incorrectly
  • GPTZero: Classified incorrectly
  • Copyleaks: Classified incorrectly
  • GPTRadar: Classified incorrectly
  • CatchGPT: Classified correctly
  • Originality.ai: Classified incorrectly

The trouble with classifiers

After all that testing, what conclusions can we draw? Generally speaking, AI text detectors do a poor job of… well, detecting. GPTZero was the only consistent performer, classifying AI-generated text correctly five out of seven times. As for the rest… not so much. CatchGPT was second best in terms of accuracy with four out of seven correct classifications, while the OpenAI classifier came in distant third with one out of seven.

So why are AI text detectors so unreliable?

Detectors are essentially AI language models trained on many, many examples of publicly available text from the web and fine-tuned to predict how likely it is a piece of text was generated by AI. During training, the detectors compare text to similar (but not exactly the same) human-written text from websites and other sources to try to learn patterns that give the text’s origin away.

The trouble is, the quality of AI-generated text is constantly improving, and the detectors are likely trained on lots of examples of older generations. Unless they’re retrained on a near-continuous basis, the classifier models are bound to become less accurate over time.

Of course, any of the classifiers can be easily evaded by modifying some words or sentences in AI-generated text. For determined students and fraudsters, It’ll likely become a cat-and-mouse game. As text-generating AI improves, so will the detectors.

While the classifiers might help in certain circumstances, they’ll never be a reliable sole piece of evidence in deciding whether text was AI-generated. That’s all to say that there’s no silver bullet to solve the problems AI-generated text poses. Quite likely, there won’t ever be.

Most sites claiming to catch AI-written text fail spectacularly by Kyle Wiggers originally published on TechCrunch

TestGorilla scores $70M for skills tests aiming to replace the recruitment resume dump

Sending in a resume is the main way a person hope to get noticed for a job. But a startup out of Amsterdam called TestGorilla is today announcing $70 million in funding for a very different kind of approach: it has built an assessment platform that can be used to screen for a wide range of job categories, verticals, and skills, and it says its approach is more effective, and more equitable, than when the initial screening is done via CVs.

The funding, a Series A, is being co-led by Atomico and Balderton Capital; and it’s coming about a year after TestGorilla raised $10 million in a seed round from Notion, Partech, and CapitalT. TestGorilla is not disclosing its valuation, but in two years of operations, the company has amassed 5,300 customers — including big names like H&M, Sony, PepsiCo and Oracle — that are selecting from a library of some 220 assessments, covering both hard skills such as a particular coding language or accounting program; and soft skills like communication, attention to detail and “culture add.”

“We strongly believe that you have to look at different things when considering job [and recruiting] success,” said Wouter Durville, the CEO who co-founded the company with Otto Verhage (COO). “Does this person know about coding? Okay great, but what about other motivations? We really see this as a first step. The assessment first, the CV second,” once the group is smaller. And for an even smaller group, a job interview, he added.

The gap in the market that TestGorilla is addressing has been a tough one to tackle in HR. For decades, the recruitment market has been hooked on resumes as the first point of entry for people applying for work. But it’s a game of diminishing returns: the more popular or specialized the job, the the less effective resume screening becomes for the recruiter. It’s impossible (not to mention exhausting) to read between the lines, and for people to stand out, when each CV looks virutally the same as the next; or even worse, when CVs basically look the same except one might have detail of, for example, attending a more prestigious university — letting bias creep into the process.

This is something that Durville encountered directly himself: he and his wife had moved to Barcelona, Spain from Amsterdam after starting a social enterprise, a handmade rug business, in 2012. After the move, the company started to expand and they looked to hire people. The economy being what it is, Durville and his partner suddenly found themselves fielding hundreds, and then thousands, of resumes for roles in customer support, finance and related roles.

“We had no recruiting department, and we knew we were missing out on a lot of great talent,” he said. “We thought there must be a better way to do this. And that’s how the idea was born.”

The company first started with building tests of its own, and gradually improved its tooling and expanded its scope. The rug company is still operating today, although with Test Gorilla taking off, the plan is to wind it down,.

A number of companies have embarked on building out assessment tools for specific areas of expertise — for example, Turing has devised screening tests, but they are focused on technical/CS/engineering jobs — but what is interesting about TestGorilla is the company takes a more agnostic approach for testing some skills, complementing that with more specific skills assessments.

TestGorilla tackles that with a double strategy: it employs both technical and operational teams to build and run the business, but it also crowdsources experts who work with the startup to build tests in their particular areas of expertise. These experts in turn earn a cut each time their tests are used.

This is one large area where the funding will be going: The plan is to expand TestGorilla’s library by another third — 100 more tests — by the end of 2022; and to hire 100 more people. It will also be building more and deeper integrations with the other tools typically used by those who run hiring processes, including application tracking systems, recruitment platforms and job boards and more.

Apart from the fact that this provides a more practical way for companies to start to screen for people who might be better fits for roles, simply by automating the first stage of that process, it’s also, the company argues, a far more equitable approach that helps remove the potential for bias when hiring. It’s indeed a compelling idea that everyone — Stanford grad going against a self-made individual who didn’t go to a top-shelf school — has to take the same assessments to figure out not just skills but also possible culture fit, and that the outcome of who comes out better in that assessment might surprise you.

That, plus the ability for the company to provide more efficiency to hiring especially in distributed teams, are two facts that swung the deal for investors.

“Finding the right people is increasingly challenging for even the world’s best brands, and the pandemic has opened employer eyes to huge untapped pools of global talent,” says Atomico’s partner Luca Eisenstecken, in a statement. “TestGorilla is seeing incredible growth with its automated, data-driven approach to solving this problem. And they’re doing this while delivering a fairer hiring process based on skills rather than resumes, eliminating the biases that prejudice decision making.”

“Our portfolio companies hire more than 10,000 people a year and across the board, they have been looking for ways to remove bias while giving candidates the best hiring experience. It’s clear that traditional hiring practices have failed on both these fronts, and that this has only been exacerbated by Covid-19 and the rise of remote hiring,” added James Wise, partner at Balderton Capital. “We’ve already seen TestGorilla become wildly popular within our portfolio as a more effective and fair way to identify people with the right skills for the role, and we’re excited to support the team on their mission to end the era of CV-based candidate screening.”

In search of engagement, Twitter brings algorithmic timelines to Communities

Twitter Communities — the private, interest-based networking feature launched last year — will now gain their own algorithmic-based timelines, similar to Twitter’s Home timeline where the most relevant and engaging conversations will be surfaced. The company announced on Wednesday it would begin testing this option within Communities across iOS, Android and the web, initially with a select group of users.

In this test, the algorithmic-based timeline will be called the “For You” feed, while the chronological timeline will be dubbed “Latest.” Users will be able to switch back and forth between the two options, Twitter notes, and whichever option you set for a given Community will become the default every time you return to that group.

The company said the option will help users keep up with the top conversations in Communities where there is a lot of activity. It then pointed to Communities like a Harry Styles fan group a cooking community and the Xbox Community, as examples.

However, in our experience across over 20 Communities, the issue is not one of struggling to keep up with all the conversations taking place– rather, it’s the lack of conversation that’s the issue.

But these new timeline options could help address that, too, as any tweets with engagement could be bumped up to the top of the feed. This could help make a quieter Community seem more active.

The idea behind Twitter Communities was to carve out a space within Twitter’s larger, public social network where people could connect with others who share the same interests. But in reality, there’s a lot of overlap between Communities and another Twitter feature, Topics, which helps people discover the conversation around a given subject by personalizing their feed with tweets, events, and even ads related to the Topics they follow. In other words, if you’re just looking to tune into the conversation about Apple or startups, for example, you may as well just follow that Topic.

Although Communities could allow for users to more directly connect with people who regularly post about a particular topic, Twitter decided to implement the feature in an odd, semi-public format. Your tweets in Communities are public, but only other community members can reply. This design choice could be limiting participation, as users may not feel comfortable fanning out about their niche interests in public. And since the Community tweets are associated with your main Twitter identity, you still feel as exposed as when you’re posting to the global, public feed.

If you’re in the test group, you’ll be able to choose how you want to view your Community timelines from a new setting in the upper right-hand corner on each Community page, Twitter says, just like on the main Twitter Home timeline.

The change follows other updates to Communities, including giving mods and admins the ability to pin their Community Tweets (web), the addition of communities search (on web and iOS), mod/admin member removal (on web and Android), and member search (across all platforms).

Twitter notes more features will roll out to Communities over the coming months as the feature is further developed — a statement that seems to somewhat contradict the latest Bloomberg report that says work on consumer-facing features like Spaces, Communities, and newsletters is now being scaled back amid a broader restructuring.

Diffblue launches a free community edition of its automated Java unit testing tool

Diffblue, a spin-out from Oxford University, uses machine learning to help developers automatically create unit tests for their Java code. Since few developers enjoy writing unit tests to ensure that their code works as expected, increased automation doesn’t just help developers focus on writing the code that actually makes a difference but also lead to code with fewer bugs. Current Diffblue customers include the likes of Goldman Sachs and AWS.

So far, Diffblue only offered its service through a paid — and pricey — subscription. Today, however, the company also launched its free community edition, Diffblue Cover: Community Edition, which doesn’t feature all of the enterprise features in its paid versions, but still offers an IntelliJ plugin and the same AI-generated unit tests as the paid editions.

The company also plans to launch a new lower cost ‘individual’ plan for Diffblue Cover soon, starting at $120 per month. This plan will offer access to support and other advanced features as well.

At its core, Diffblue uses unsupervised learning to build these unit tests. “What we’re doing is unique in the sense that there have been tools before that use what’s called static analysis,” Diffblue CEO Mathew Loge, who joined the company about a year ago, explained. “They look at the program and they basically understand the path through the program and try and work backwards from the path. So if the path gets to this point, what inputs do we need to put into the program in order to get here?” That approach has its limitations, though, which Diffblue’s reinforcement learning method aims to get around.

Once the process has run its course, Diffblue provides developers with readable tests. That’s important, Loge stressed, because if a test fails and a developer can’t figure out what happened, it’s virtually impossible for the developer to fix the issue. That’s something the team learning the hard way, as early version so Diffblue used a very aggressive algorithm that provided great test coverage (the key metric for unit tests), but made it very hard for developers to figure out what was happening.

With the community edition, which doesn’t offer the command-line interface (CLI) of Diffblue’s paid editions, developers can write their code in IntelliJ as before and then simply click a button to have Diffblue write the tests for that code.

“The Community Edition is designed to be very accessible. It is literally one click in the IDE and you get your tests. The CLI version is more sophisticated and it covers more cases and solves for teams and large deployments inside of an organization,” Loge explained.

The company plans to add support for other languages, including Python, JavaScript and C# over time, but as Loge noted, Java has long been a mainstay in the business world and the team felt like that would be the best language to start with. As Loge noted, though, the technology

Diffblue has actually been around for a bit. The company raised a $22 million Series A round led by Goldman Sachs and with participation from Oxford Sciences Innovation and the Oxford Technology and Innovations Fund in 2017. You obviously don’t raise that kind of money to focus only on unit tests for Java code. Besides support for more language, unit tests are just the first step in the company’s overall goal of automating more of the programming process with the help of AI.

“We started with testing because it’s an important and urgent problem, especially with the impact that it has on DevOps and the adoption of more rapid software cycles,” Loge said. The next obvious step is to then take a similar approach to automatically fixing bugs — and especially security bugs — in code as well.

“The idea is that there are these steppingstones to machines writing more and more code,” he said. “And also, frankly, it’s a way of getting developers used to that. Because developer acceptance is a crucial part of making this successful.”

3M and MIT partner to develop a new, affordable rapid COVID-19 test

A heavyweight partnership between industry and academic sciences is throwing their considerable weight into an important task: Creating a new low-cost, rapid diagnostic test for COVID-19. Chemical industry leader 3M has partnered with MIT to create a diagnostic tool for COVID-19 that’s easy-to-use, and that can be manufactured cheaply and in large volume for mass distribution and use.

The test is currently the research phase, with a team led by MIT’s Professor Hadley Sikes of the school’s Department of Chemical Engineering. Sikes’ laboratory has a specific focus on creating and developing tech to enhance the performance of protein tests that are meant to provide rapid, accurate results.

3M is contributing its biomaterials and bioprocessing expertise, along with its experience in creating products designed to be manufactured at scale. The end goal is to create a test that detects viral antigens, a type of test first cleared for use in COVID-19 detection at the beginning of May by the FDA. These tests provide results much faster than the molecular PCR-based test – but do have a higher change of fall negatives. Still, their ability to be administered at point-of-care, and return results within just minutes, could help considerably in ramping up testing efforts, especially in cases where individuals aren’t necessarily presenting symptoms but are in situations where they could pose a risk to others if carrying the virus while asymptomatic.

The new 3M and MIT projects is part of the RADx Tech program created by the National Institute of Health (NIH) specifically to fund the development of tests that can expand U.S. testing deployment. An initial $500,000 of funding was provided to MIT and 3M from the program, and it can potentially receive further funding after achieving other development milestones.

Color is launching a high-capacity COVID-19 testing lab and will open-source its design and protocols

Genomics health technology startup Color is doing its part to address the global COVID-19 pandemic, and has detailed the steps its taking to support expansion of testing efforts in a new blog post and letter from CEO Othman Laraki on Tuesday. The efforts include development of a high-throughput lab that can process as many as 10,000 tests per day, with a turnaround time of within 24 hours for reporting results back to physicians. In order to provide the most benefit possible from the effort of standing this lab up, Color will also make the design, protocols and specifics of this lab available open-source to anyone else looking to establish high-capacity lab testing.

Color’s lab is also already nearly ready to begin processing samples – it’s going live “in the coming week,” according to Laraki. The Color team worked in tandem with MIT’s Broad Institute, as well as Harvard and Weill Cornell Medicine to develop its process and testing techniques that can allow for higher bandwidth results output vs. standard, in-use methods.

The focus of Color’s efforts in making this happen have been on using automation wherever possible, and seeking techniques that source parts and components, including reagents, that can come from different supply chains. That’s actually a crucial ingredient to being able to ramp efforts at scale nationally and globally, since if everyone is using the same lab processing methods, you’re going to run up against a bottle neck pretty quickly in terms of supplies. Being able to process tens of thousands of tests per day is great on paper, but it means nothing if one ingredient you need to make that happen is also required by every other testing lab in the country.

Color has also made efforts to address COVID-19 response in two other key areas: testing for front-line and essential workers, and post-test follow-up and processing. To address the need for testing for those workers who continue to operate in public-facing roles despite the risks, Color has redirected its enterprise employee base to providing, in tandem with governments and employers, onsite clinical test administration, lab transportation and results reporting with patient physicians.

For its post-test workflow, Color is working to address the challenges reported by other clinicians and health officials around how difficult it is to be consistent and effective in following up on the results of tests, as well as next steps. So the company is opening up their own platform for doing so, which they’ve re-tooled in response to their experience to date, and making that available to any other COVID-19 testing labs for free use. These resources include test result reporting, guidelines and instructions for patients, follow-up questionnaires around contact tracing, and support for how to reach out to potentially exposed individuals tied to a patient who tests positive.

To date, Color says that its been able to operate at cost, in part backed by support by philanthropic public and private donations. The company is encouraging direct outreach via its covide-response@color.com email in case anyone thinks they can contribute to, or benefit from the project and the resources being made available.

SpaceX completes key Crew Dragon launch system static test fire

SpaceX has confirmed that it ran a static’s fire test of its Crew Dragon astronaut capsule launch escape system. That’s a key step that it needed to run, and one that is under especially high scrutiny since a static fire of its thrusters back in April resulted in an explosion that destroyed that spacecraft. After an investigation, SpaceX and NASA were confident that they identified and corrected the cause of that faulty test, which seems to have worked in their favor with today’s engine fire.

Today’s stick fire appears to have gone much more smoothly, with SpaceX noting that it ran for the full planned duration, and that now its own engineers along with NASA teams will be reviewing the results of this test and the data it provided. So long as what these times find from these test results is within their expected range and criteria for success, that will mean they can move on to an in-flight demonstration of the crew space system – the next and necessary step leading up to the eventual crewed flight of Crew Dragon with NASA astronauts on board.

The in-flight abort test that will be the next key step for Crew Dragon will demonstrate how the SuperDraco crew escape system would behave in the unlikely event of an actual emergency during a crewed mission, albeit with a Crew Dragon spacecraft that doesn’t actually have anyone on board. NASA requires that its commercial crew partners demonstrate this system to ensure the safety of those on board, by showing that they can quickly move the crew capsule to a safe distance away from the spacecraft in case of emergency. Musk has said they’d hope to fly an in-flight abort as early as mid-December, provided this static test shows that everything is behaving as predicted.

If everything goes as planned with that crucial demonstration, NASA and SpaceX are optimistic that a first mission with crew on board could fly as early as the first part of next year. Commercial crew co-contractor Boeing is tracking to a similar timeline with its own Starliner crew capsule program.

YouTube confirms a test where the comments are hidden by default

YouTube’s comments section has a bad reputation. It’s even been called “the worst on the internet,” and a reflection of YouTube’s overall toxic culture where creators are rewarded for outrageous behavior — whether that’s tormenting and exploiting their children, filming footage of a suicide victim, promoting dangerous “miracle cures” or sharing conspiracies, to name a few high-profile examples. Now, the company is considering a design change that hides the comments by default.

The website XDA Developers first spotted the test on Android devices in India.

Today, YouTube’s comments don’t have a prominent position on its mobile app. On both iOS and Android devices, the YouTube video itself appears at the top of the screen, followed by engagement buttons for sharing, liking, disliking, downloading and saving the video. Below that are recommendations from YouTube’s algorithm in a section titled “Up Next.” If you actually want to visit the comments, you have to scroll all the way to the bottom of the page.

In the test, the comments have been removed from this bottom section of the page entirely.

Instead, they’ve been relocated to a new section that users can only view after clicking a button.

The new Comments button is found between the Thumbs Down and Share buttons, right below the video.

It’s unclear if this change will reduce or increase user engagement with comments, or if engagement will remain flat — something that YouTube likely wants to find out, too.

On the one hand, comments are hidden unless the user manually taps on the button to reveal them — users won’t happen upon them by scrolling down. On the other hand, putting the comments button behind a click at top of the page instead of forcing users to scroll could make them easier to access.

As XDA Developers reports, when you’ve loaded up this new Comments section, you can pull to refresh the page to see the newly-added comments appear. To exit, you tap the “X” button at the top of the window to close the section.

While it reported the test was underway in Android devices in India, we’ve confirmed it’s also appearing on iOS and is not limited to a particular region. That means it’s something YouTube wants to test on a broader scale, rather than a feature it’s considering for a localized version of its app for Indian users.

The change comes at a time when YouTube’s comments section has been discovered to be more than just the home to bullying, abuse, arguments, and other unhelpful content, but also a tool that was exploited by pedophiles. A ring of pedophiles had communicated through the comments to share videos and timestamps with one another.

YouTube reacted then by disabling comments on videos with kids. More recently, it’s been considering moving kids content to a separate app. (Unfortunately, it will never consider the appropriateness of having built a platform where young children can be put on public display for the whole world to see.)

A YouTube spokesperson confirmed the Comments test, in a statement, but downplayed its importance by referring to it as one of many small experiments the company is running.

“We’re always experimenting with ways to help people more easily find, watch, share and interact with the videos that matter most to them,” the spokesperson told TechCrunch. “We are testing a few different options on how to display comments on the watch page. This is one of many small experiments we run all the time on YouTube, and we’ll consider rolling features out more broadly based on feedback on these experiments.”

A new ‘Hide Tweet’ button has been spotted in Twitter’s code

Twitter confirmed it has in development a new “Hide Tweet” option, but has yet to provide more detail about its plans for the feature. The new option, spotted in Twitter’s code, is available from a list of moderation choices that appear when you click the “Share” button on a tweet – a button whose icon has also been given a refresh, it seems. Like it sounds, “Hide Tweet” appears to function as an alternative to muting or blocking a user, while still offering some control over a conversation.

Related to this, an option to “View Hidden Tweets” was also found to be in the works. This appears to allow a user to unhide those tweets that were previously hidden.

The “Hide Tweet” feature was first discovered by Jane Manchun Wong, who tweeted about her findings on Thursday.

Wong says she found the feature within the code of the Twitter Android application. That means it’s not necessarily something Twitter will release publicly, but has at least thought about seriously enough to develop.

Reached for comment earlier today, Twitter told us some employees would soon tweet out more context about the feature. As of the time of writing, those explanations had not gone live.

Immediately, there were concerns an option like this would allow users to silence their critics – not just for themselves, as is possible today with muting and blocking – but for anyone reading through a stream of Twitter Replies. Imagine, for example, if a controversial politician began to hide tweets they didn’t like or those that contradicted an outrageous claim with a fact check, people said.

On the flip side, putting the original poster back in control of which Replies are visible may allow people to feel more comfortable with sharing on Twitter, which could impact user growth – a number Twitter struggles with today.

But as of now, it’s not clear that the “Hide Tweet” button is something that would hide the tweet from everyone’s view, or just the from the person who clicked the button.

It’s also unclear what stage of development the feature is in, or if it will be part of a larger change to moderation controls.

If Twitter chooses to comment, we’ll update with those answers.

The feature’s discovery comes at a time when Twitter has been under increased pressure to improve the conversational health on its platform.

In a recent interview, Twitter CEO Jack Dorsey admitted that it puts most of the burden on the victims of abuse, which has been “a huge fail.” He said Twitter was looking into new way to proactively enforce and promote health, so blocking and reporting were last resorts.

A “Hide Tweet” button doesn’t seem to fit into that plan, as it requires users’ direct involvement with the moderation process.

It’s worth also noting that Twitter already has a “hidden tweets” feature of sorts.

In 2018, the company introduced a new filtering strategy to hide disruptive tweets, which takes into consideration various behavioral signals – like whether the account had verified its email, is frequently blocked, or tweets often at accounts that don’t follow it back, for example. If Twitter determined the tweet should be downranked, it moved it to its own secluded part of the Reply thread, under a “Show more replies” button.

Twitter tests a number of things that never see the light of day in a public product. More recently, the company said it was weighing the idea of a “clarifying function” for explaining old tweets. It’s also launching a prototype app that will experiment with new ideas around conversation threads.

 

Safaricom rolls out Bonga social networking platform to augment M-Pesa

When it comes to monetizing digital social interactions, Kenya’s Safaricom has its own order. American tech companies such as Facebook and Twitter offered social networks first, then moved to commercialize them.

Through its M-Pesa mobile money product, Safaricom built one of Africa’s most robust commercial webs and now aims to leverage it as a social network.

The vehicle is the company’s new Bonga platform, something Kenya’s largest telco rolls out in pilot phase this week. An outgrowth of the Safaricom’s Alpha innovation incubator, “Bonga is a conversational and transactional social network,” Shikoh Gitau, Alpha’s Head of Products told TechCrunch.

“It’s focused on pay, play, and purpose…as the three main things our research found people do on our payment and mobile network,” she said. Gitau offered examples: pay could be using M-Pesa and SMS to coordinate anything from tuition payments to e-commerce, play spans online sports betting to gaming, and purpose includes SMS or WhatsApp chat groups that raise money for weddings, holidays, or Kenya’s informal investment groups.

“In our [Bonga] research we’ve said ‘what can we do to build upon those three network behaviors in our network that is Safaricom?,’” she said.

I recently sat in on an Alpha product development session in Nairobi and talked to Safaricom CIO Kamal Battacharya on his vision for the product late last year, as reported at TechCrunch.

“Safaricom’s unique in that we have telco services and a financial services platform that connect nearly every household in Kenya largely on the basis of trade,” he said.

“We’d actually like to move beyond M-Pesa by leveraging its power as a social network to connect people to other product solutions.”

As a telco, Safaricom­—still  has 69 percent of the Kenya’s mobile subscribers. Its M-Pesa fintech app―which generated $525 million of the company’s $2 billion annual revenues―boasts 27 million customers across a network of 136,000 agents.

Through in-house development and partnerships, the company continues to add consumer and small business-based products to its mobile and fintech network. These include digital TV, the M-Kopa solar-powered lighting kit, and Lipa-Na bill pay service.

This week Safaricom will offer Bonga to a test group of 600 users, before updating the product, allowing the initial group to refer it to friends, and then extending the platform in three phases.

Bonga Sasa will facilitate messaging and money transfer between individuals, “enabling users to send or receive money while conversing with each other,” according to a Safaricom release. For example, through Bonga Sasa a parent can send money to the child without having to leave the platform to access another money transfer tool.

Bonga Baraza, expected in mid-2018, will allow users to collect money for purpose driven events, including Kenya’s harambee collective fundraising drives.

Bonga Biashara will build on this use of social networks for commerce. Digitizing Kenya’s extensive informal trading commerce is at play here. Alpha’s research found roughly “2.5 million people doing side-hustles with a smartphone in Kenya” and 12.5 million total running small businesses on smart and USSD devices, according to Gitau.

Bonga will channel Facebook, YouTube, iTunes, PayPal, and eBay in one platform. Users will be able to create business profiles parallel to their personal social media profiles and M-Pesa accounts and sell online. Bonga will also include space for Kenya’s creative class to upload, shape, and distribute artistic products and content.

As for Safaricom’s Bonga monetization plan, it’s not an immediate priority, according to the Alpha team members I spoke to. “We’ll offer it for free for now, and it’s connected to M-Pesa, which is already monetized,” said Gitau. “The more these services grow and grow small businesses the more they grow M-Pesa..which is already profitable.”

Safaricom is exploring how to take Bonga beyond Kenya’s borders, which could include markets where both M-Pesa and Vodafone are present: currently 10 in Europe, Africa, and South Asia.

Photo courtesy of Flickr/WorldRemit