October 20, 2022 - Machine Learning

Can Machine Learning Help Us Find New Earths?

Diego Hidalgo - October 20, 2022

As Machine Learning (ML) and Deep Learning (DL) techniques become more sophisticated, they are being applied to an increasing number of tasks. One such application is the search for exoplanets. In this blog post, we discuss how the latter is currently done with a particular focus on the Transit Method and what role Machine Learning can play in successfully overcoming the pertaining challenges.


Machine and Deep Learning techniques are here to stay. Their automated nature and their strongest point, learning, offer the possibility of managing any type of data. Its wide variety of techniques supports a multitude of disciplines, including astrophysics and, in particular, the search for exoplanets, with special relevance in the so-called transit method.

Exoplanets and the Transit Method

An exoplanet is a planet that orbits a star outside of our solar system. Most exoplanets are found using the Radial Velocity Method, which detects the tiny wobbles of a star caused by the gravitational tug of an exoplanet orbiting it. Other methods of exoplanet detection include Transit Photometry, which looks for the slight dip in a star’s brightness that occurs when an exoplanet passes in front of it, and Direct Imaging, which captures images of exoplanets directly.

In this article, I’ll focus on the Transit Method.  This method can be summarized in a single sentence: “a transit is when an exoplanet blocks light reaching the Earth from the star it orbits”. However, getting to detect one is an extremely complicated task.

Without getting too technical, the search for exoplanets has its main detection site in outer space, outside the influence of our beloved atmosphere, for the simple fact that most of the light we receive from space is absorbed by it. That is why the techniques in the search for exoplanets have had their greatest expansion when space telescopes were used in their detection. First COROT, then Kepler/K2, and lately TESS, have given the starting shot for the expansion of our vision in the great variety of exoplanets we are surrounded by. As of August 23, 2022, the NASA Exoplanet Archive has confirmed the existence of 5071 exoplanets with more than 7000 possible candidates.

When I started my Ph.D., one of the most daunting things was the immense amount of data processing that had to be done. The Kepler space telescope had recently finished its main mission and, fortunately, NASA had been able to recover it from a double failure in its pointing system to continue operating with the renamed K2 mission.

The Two Challenges: too many Light Curves and too much Noise

To give us an idea of the dimension of the problem in the search for exoplanets, Kepler yielded some 150,000 light curves in its first mission and some 40,000 light curves per quarter in its almost 6 years operating in the K2 mission. A light curve, in a nutshell, is a graph of light intensity of a celestial object or region as a function of time.

In other words, with the Kepler mission alone, we have more than a million light curves to analyze and classify. For the TESS mission we received millions of light curves. And this is one of the great problems facing the field today:

  • The existence of a lot of data to analyze, which for an expert or even a small group of experts would take years of work, and which would almost certainly lead to errors in the analysis and classification.

The further problem derives from dealing with images full of stars and selecting the target star unambiguously:

  • The choice of a suitable aperture for each star to reduce signal contamination from nearby stars as much as possible.

Let’s hop again to the definition of the light curve. The simplest and most practical definition is: “the amount of measured light that reaches the Earth per unit of collecting area (diameter of our telescope) and per unit of observation time”. This form of observation is what astronomers call “Photometry” and, typically, a Kepler light curve has the shape shown in Figure 1.

Figure 1: Light curve of K2-264 obtained with the Kepler space telescope. The planetary system of two transiting planets. The vertical red and blue lines indicate their position. The upper image corresponds to the light curve before flattening, and the lower image corresponds to the light curve after flattening.

But to understand this definition and its concrete application to this field of astrophysics, it is best to answer the following questions: How is a light curve obtained and what is an exoplanetary transit?

To understand this, we need to take a closer look at our solar system. By now, everybody knows the dynamics behind a solar eclipse, which, in essence, is the blocking of the light of our star, the Sun, by the Moon, our satellite (see figure 2).

Figure 2: Diagram of a solar eclipse.

We can apply this phenomenon of blocking the Sun’s light to Mercury, the closest planet to the Sun. The phenomenon that occurs when Mercury interposes itself between the Sun and the imaginary line that joins the Earth and the Sun (line of sight), blocking a tiny fraction of its light, is called Transit. Normally this phenomenon is not visible to the naked eye and we need a telescope to see it in detail (figure 3).

Figure 3: Image of the Sun and representation of the transit of Mercury at different times, which occurred on November 11, 2019.

Imagine that I am a great amateur photographer and I adapt my camera to a telescope to take pictures of the Sun during the day of Mercury’s transit. In the morning, I take a picture (time unit, t), and by adding up all the pixels of the camera where the Sun is, I obtain a number of counts (photons) that I will transform into a number of measurable physical quantities. Let’s call these quantities “Luminosity” (L). Obtained a point (t, L), then, I take another and another, and so on in a regular way, until the night arrives. If I now make the typical Cartesian graphical representation of the school assuming the “x” axis as the time axis and the “y” axis as the Sun’s luminosity, we would see something very similar to what you see in the following video.

Now that we are clear about what a light curve is, let’s apply the same reasoning to other stars. However, as the luminosity of these is much smaller, observing transits of other planets from the surface of the Earth is very difficult and we have to go to space to get them. The difference between the transit of Mercury and that of an exoplanet observed from space is the time of observation, that is, how we take the pictures, which, typically, for the Kepler/K2 space mission are 3 seconds taken during 3 months. This is what we observe in Figure 1.

Kepler and TESS: two ways, one goal

Without going into detail as well, it is important to differentiate between the pictures taken by Kepler/K2 and the pictures taken by TESS, because they are totally different solutions. The technical limitation of the Kepler mission meant that it transmitted to Earth only small cuttings of a few pixels from each of the 84 CCDs (Charge Couple Devices) of which it was composed. When we speak of a CCD, the pixels work in a similar way as those in the flat-panel television: but instead of emitting light, they collect light, or rather, electrons. These pixel cutouts are shown in Figure 4.

Figure 4: Image centered on the target star, with at least three nearby sources capable of contaminating the sample.

In contrast, the photos from the TESS mission, which has much lower spatial resolution but incorporates better data transmission technology, are able to convey all the information collected by its 4 CCDs back to Earth. A sample of this information can be seen in detail in Figure 5.

Figure 5: Single CCD image from the TESS space telescope showing the Large Magellanic Cloud and the thousands of stars surrounding it.

The technical difference between the two telescopes makes the observing strategy (taking pictures) completely different. While the original Kepler mission focuses on observing a single corner of the sky, and its extended K2 mission on certain regions of the plane of the ecliptic, the TESS space mission still continues to observe half of the celestial vault each year. See Figure 6.

Figure 6: Map of the celestial vault. In orange, the field of view of the Kepler/K2 space missions is drawn. In green, is drawn the field of view of the TESS space mission.

In broad strokes, the Kepler/K2 mission focused on small regions of the sky, taking small image snippets of almost a million stars, while the TESS mission is able to take pictures of almost the entire sky (with countless stars) and send the entire picture to Earth.

Once the problem has been explained and a brief summary of the characteristics of each of the missions has been given, the scientific community has the challenge of finding the best way to obtain the valuable information stored in each of the photos. So far, the solution has been more effective than efficient, and basically, it consists of extracting the information from those pixels previously defined by a fixed aperture, given a star luminosity previously obtained from the star catalogs.

For the Kepler/K2 case, the problem is simpler although not trivial, because normally the star is in the center of the photo cutout, which makes it more likely that if we find a signal, it will be associated with the main source within the aperture. But there is also the possibility that the signal detected in the source is contaminated by a background or nearby star, intense enough to introduce its signal into the selected aperture. In addition, the target star has been previously selected as having a sufficiently high relative magnitude so that contamination from a background star is minimal.

The solution adopted by NASA for Kepler/K2 was simple, to apply a fixed aperture for each star according to its relative magnitude, which meant that between 80% and 90% of the cases it is an effective aperture and allows this simple technique to detect exoplanetary signals.

For the TESS telescope, things change substantially. In this case, we have thousands of stars in each of the CCDs, so the selection of each star has to be individual and personal, and having lower spatial resolution than Kepler, it is very difficult that the stars do not have contamination from another nearby source with a relevant magnitude.

The solution adopted for the TESS mission depends on the research group analyzing the images, since NASA distributes the images without any prior filtering, i.e., we obtain images directly taken by the space telescope, which we will need to “clean” and manipulate to extract good quality data. In most, if not all, research groups, light sources are selected using the aforementioned catalogs of stars and galaxies, and then the ones with the highest relative magnitude are chosen. This would lead to an additional drawback, apart from the choice of aperture. And that the light sources are exactly where the catalogs say they are, since, as we know, all light sources in the celestial vault have their proper motion.

Surface lightning into the problem

Having said that, let’s spend a few lines explaining the problem better. One of the great disadvantages of performing photometry from space is the lack of monitoring of the components of the measurement system. But what does this mean? We simply do not have physical access to perform calibrations of the instrumentation, which, in our particular case, makes it very difficult to know the degradation of the sensitivity of the pixels of each photo over time, beyond the last measurement taken in the laboratory from Earth, among other parameters.

The special conditions of spatial resolution in space telescopes, whereby a light source takes up only a few pixels (at most tens of pixels), and the lack of calibration, make both aperture and calibration crucial tasks.

As for calibration, several well-established techniques are used with different methods that give very good results. While the choice of aperture is a problem still to be refined, beyond setting a fixed aperture. The large number of light sources available for study means that too much importance is not given to this problem, which leads to a loss of efficiency in the search for signals by the simple fact of having too many light sources to search.

This problem is common to both Kepler/K2 and TESS space missions, the first one being in a state of complete data dropout, and with some potential to continue finding more promising signals. While TESS, still in operation, has much potential for improvement in locating bright sources and improving aperture retrieval. It is crucial to automate a form of pixel picking that takes into account the myriad of drawbacks we may encounter when measuring, as it is:

  • Selecting the sky background. In photos with so few pixels, obtaining a sky background for comparison is very important depending on which methods of obtaining the light curve.
  • Select the other light sources present in the image and mask them so that they do not introduce information into our aperture.
  • Discard pixels that are bad columns from nearby light sources that have saturated the CCD and ended up covering more pixels than expected.

In general, the first point is the least important when selecting our opening. However, the other two points are crucial. For TESS the problem is even greater, since its spatial resolution, as we have mentioned, is even lower, and finding a bright source sufficiently separated from the rest to obtain a good aperture is a very unlikely situation. In addition, most of the sources are not cataloged or are cataloged only in certain restricted parts of the sky, which makes it very difficult to know in advance the nearby sources. But even if we knew them, the movement of each source makes that what we know today about its position is valid for a certain period of time, but we are talking about years. It is true that GAIA is doing a very important job at this point and with every year that goes by, this problem is being reduced, but we must take into account that once the mission is over, the problem will reappear.

And does that end all the problems? Nothing could be further from the truth. With Kepler, seeing 150,000 light curves is not an easy task for a person specialized in light curves, and if you divide it among several people with experience, things improve, but the drawbacks remain. These translate into, more people, more evaluation criteria, what for some might be a sign, for others is not, or simply went unnoticed. The fact that there are different research groups around the world, made that with Kepler the inconvenience of losing some signal among the hundreds of thousands of curves was solved, but with K2 the situation worsened considerably, since its increased noise and almost a million light curves to study, made that many periodic signals went unnoticed at the beginning. With TESS, the situation is simply unmanageable from a human point of view, with several million light curves, you would need at least a couple of full-time people sorting all the light curves. This solution is unfeasible because the mere fact of getting tired of performing a routine task, makes you lose efficiency in the task performed.

Applying AI in science: level up!

How could we address these problems in order to optimize the search for exoplanets in these types of space missions? We have a few options available.

First of all, we can combine both the problem of the choice of aperture and the univocal localization of light sources under the same umbrella. In these problems we could use the technique called image recognition which could consist of:

  1. Detection of light sources in the image.
  2. Segmentation of each light source, i.e. the specific identification of the pixels occupied by each source in the image.

The next big problem is the classification of the thousands of generated light curves that we will obtain (almost certainly millions). From a quantitative point of view, they are all time series but with different cadence, and most importantly, they all have the same type of information although the way to obtain it is different.

Traditionally, time series have been studied using auto-regressive predictive models. However, neural networks have experienced a very rapid increase in recent years because of their versatility and accuracy, especially recurrent networks combined with convolutional networks. Implementing these techniques to be able to jointly classify any type of light curve is crucial to be able to take the search for exoplanets by the transit method to a higher level, since otherwise we will be applying these methods to specific missions and, therefore, we will be wasting all the previous information we have.

These two approaches to solving a current problem in the search for exoplanets by means of the transit method are so extensive that they could be the main subject of a doctoral thesis, since their study and implementation are very broad. However, the benefits we can obtain in return would be a qualitative leap in the detection of exoplanets.


The Transit Method is the most common approach to planet hunting and involves detecting a decrease in star brightness as a planet crosses in front of it. However, this method is not perfect and can be hampered by noise from the star or other sources. Machine Learning can help to overcome some of these issues and improve our ability to detect planets using the Transit Method. We are excited about the potential for Machine Learning to play a role in exoplanet discovery and look forward to seeing its impact on this field in the future.


Diego Hidalgo

Diego Hidalgo is a data engineer with a Ph.D. in astrophysics, which he obtained at Instituto de Astrofísica de Canarias. After completing his studies, he worked in various fields before landing his current job at dyvenia. Diego loves working with data and finds the parallels between different types of data management fascinating.

Our Blog

Career adviceData AnalyticsData Engineering

Event recap—Women in Data: How to Start and Grow Your Career

Get inspired by success stories of women in data from the past and present and catch some tips on starting and navigating your own data career.

Karolina Soppa - May 8, 2023
Augmented Analytics enables companies to increase data quality, improve efficiency, and obtain insights quickly.
Business IntelligenceData Analytics

Data Analytics Just Got Smarter: Understanding Augmented Analytics

In this article, you will discover the benefits, challenges and use cases of Augmented Analytics.

Ira Kovalchyk - February 20, 2023
Machine Learning

Can Machine Learning Help Us Find New Earths?

In this article, you will learn about challenges in the search of exoplanets that can be addressed by machine learning and deep learning.

Diego Hidalgo - October 20, 2022
The future of the supply chain
Business IntelligenceData AnalyticsData EngineeringManufacturingSupply

The Future of the Supply Chain: Data challenges, solutions, and success stories

Although data bottlenecks and silos continue to frustrate supply chains around the world, the article illustrates how a firm grasp of the importance of data foundations can lead to success.

Wiktoria Kuzma - October 13, 2022
Business IntelligenceCareer adviceData AnalyticsData Engineering

If Batman and Spiderman worked in the data world, they would definitely be…

Read the stories our team members shared at dyvenia’s first event in its second season of events for data practitioners.

Data EngineeringManufacturing

Data Challenges of Carbon Accounting for Companies

This article presents three carbon accounting challenges and details steps on how to overcome them.

Alessio Civitillo - September 28, 2022
dyvenia scrum
Business IntelligenceData AnalyticsData Engineering

How are we using Scrum to consistently deliver value?

Using Scrum can help your team solve challenging issues by following a simple and agile framework. Scrum aids teams in concentrating on what really matters, enabling them to collaborate effectively and adapt to changing circumstances. Read the following article to learn about the Scrum fundamentals and how we’ve implemented the framework in dyvenia.

top 4 must-haves for data-driven marketing
Data AnalyticsData Engineering

Top 4 Must-Haves for Data-Driven Marketing

In this article, you will learn about the top 4 must-haves for data-driven marketing every marketer needs to know to take their data game to the next level.

Wiktoria Kuzma - August 18, 2022
5 steps to create effective Tableau & Power BI Dashboards
Business IntelligenceCareer adviceData Analytics

Prepare Your Data for Effective Tableau & Power BI Dashboards

The ability to create effective Tableau & Power BI dashboards is a crucial skill in today’s data-driven world. This guide walks you through the steps that will allow you to create easily updatable, automated and scalable dashboards.

Valeria Perluzzo - June 23, 2022
4 Steps to Overcome SAP Integration Challenges
Data AnalyticsData EngineeringManufacturing

4 Steps to Overcome SAP Integration Challenges

In this article, you will learn how we managed to overcome SAP integration challenges in 4 steps and combine data from different applications to to acquire a consolidated view of it.

Michal Zawadzki - June 22, 2022